Building a Dataset of Multilingual Cognates for the Romanian Lexicon

Liviu P. Dinu; Alina Maria Ciobanu

Building a Dataset of Multilingual Cognates for the Romanian Lexicon

Abstract

Identifying cognates is an interesting task with applications in numerous research areas, such as historical and comparative linguistics, language acquisition, cross-lingual information retrieval, readability and machine translation. We propose a dictionary-based approach to identifying cognates based on etymology and etymons. We account for relationships between languages and we extract etymology-related information from electronic dictionaries. We employ the dataset of cognates that we obtain as a gold standard for evaluating to which extent orthographic methods can be used to detect cognate pairs. The question that arises is whether they are able to discriminate between cognates and non-cognates, given the orthographic changes undergone by foreign words when entering new languages. We investigate some orthographic approaches widely used in this research area and some original metrics as well. We run our experiments on the Romanian lexicon, but the method we propose is adaptable to any language, as far as resources are available.

Anthology ID:: L14-1184
Volume:: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:: May
Year:: 2014
Address:: Reykjavik, Iceland
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 1038–1043
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/175_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Liviu Dinu and Alina Maria Ciobanu. 2014. Building a Dataset of Multilingual Cognates for the Romanian Lexicon. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 1038–1043, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):: Building a Dataset of Multilingual Cognates for the Romanian Lexicon (Dinu & Ciobanu, LREC 2014)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/175_Paper.pdf

PDF Cite Search Fix data