Challenge Dataset of Cognates and False Friend Pairs from Indian Languages

Diptesh Kanojia, Malhar Kulkarni, Pushpak Bhattacharyya, Gholamreza Haffari


Abstract
Cognates are present in multiple variants of the same text across different languages (e.g., “hund” in German and “hound” in the English language mean “dog”). They pose a challenge to various Natural Language Processing (NLP) applications such as Machine Translation, Cross-lingual Sense Disambiguation, Computational Phylogenetics, and Information Retrieval. A possible solution to address this challenge is to identify cognates across language pairs. In this paper, we describe the creation of two cognate datasets for twelve Indian languages namely Sanskrit, Hindi, Assamese, Oriya, Kannada, Gujarati, Tamil, Telugu, Punjabi, Bengali, Marathi, and Malayalam. We digitize the cognate data from an Indian language cognate dictionary and utilize linked Indian language Wordnets to generate cognate sets. Additionally, we use the Wordnet data to create a False Friends’ dataset for eleven language pairs. We also evaluate the efficacy of our dataset using previously available baseline cognate detection approaches. We also perform a manual evaluation with the help of lexicographers and release the curated gold-standard dataset with this paper.
Anthology ID:
2020.lrec-1.378
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3096–3102
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.378
DOI:
Bibkey:
Cite (ACL):
Diptesh Kanojia, Malhar Kulkarni, Pushpak Bhattacharyya, and Gholamreza Haffari. 2020. Challenge Dataset of Cognates and False Friend Pairs from Indian Languages. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3096–3102, Marseille, France. European Language Resources Association.
Cite (Informal):
Challenge Dataset of Cognates and False Friend Pairs from Indian Languages (Kanojia et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.378.pdf
Code
 dipteshkanojia/challengeCognateFF