A Classification-Based Approach to Cognate Detection Combining Orthographic and Semantic Similarity Information

Sofie Labat, Els Lefever


Abstract
This paper presents proof-of-concept experiments for combining orthographic and semantic information to distinguish cognates from non-cognates. To this end, a context-independent gold standard is developed by manually labelling English-Dutch pairs of cognates and false friends in bilingual term lists. These annotated cognate pairs are then used to train and evaluate a supervised binary classification system for the automatic detection of cognates. Two types of information sources are incorporated in the classifier: fifteen string similarity metrics capture form similarity between source and target words, while word embeddings model semantic similarity between the words. The experimental results show that even though the system already achieves good results by only incorporating orthographic information, the performance further improves by including semantic information in the form of embeddings.
Anthology ID:
R19-1071
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
Month:
September
Year:
2019
Address:
Varna, Bulgaria
Editors:
Ruslan Mitkov, Galia Angelova
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
602–610
Language:
URL:
https://aclanthology.org/R19-1071
DOI:
10.26615/978-954-452-056-4_071
Bibkey:
Cite (ACL):
Sofie Labat and Els Lefever. 2019. A Classification-Based Approach to Cognate Detection Combining Orthographic and Semantic Similarity Information. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 602–610, Varna, Bulgaria. INCOMA Ltd..
Cite (Informal):
A Classification-Based Approach to Cognate Detection Combining Orthographic and Semantic Similarity Information (Labat & Lefever, RANLP 2019)
Copy Citation:
PDF:
https://aclanthology.org/R19-1071.pdf