Identifying Cognate Sets Across Dictionaries of Related Languages

Adam St Arnaud, David Beck, Grzegorz Kondrak


Abstract
We present a system for identifying cognate sets across dictionaries of related languages. The likelihood of a cognate relationship is calculated on the basis of a rich set of features that capture both phonetic and semantic similarity, as well as the presence of regular sound correspondences. The similarity scores are used to cluster words from different languages that may originate from a common proto-word. When tested on the Algonquian language family, our system detects 63% of cognate sets while maintaining cluster purity of 70%.
Anthology ID:
D17-1267
Volume:
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
Month:
September
Year:
2017
Address:
Copenhagen, Denmark
Editors:
Martha Palmer, Rebecca Hwa, Sebastian Riedel
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
2519–2528
Language:
URL:
https://aclanthology.org/D17-1267/
DOI:
10.18653/v1/D17-1267
Bibkey:
Cite (ACL):
Adam St Arnaud, David Beck, and Grzegorz Kondrak. 2017. Identifying Cognate Sets Across Dictionaries of Related Languages. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2519–2528, Copenhagen, Denmark. Association for Computational Linguistics.
Cite (Informal):
Identifying Cognate Sets Across Dictionaries of Related Languages (St Arnaud et al., EMNLP 2017)
Copy Citation:
PDF:
https://aclanthology.org/D17-1267.pdf
Code
 ajstarna/SemaPhoR