Combining Noisy Semantic Signals with Orthographic Cues: Cognate Induction for the Indic Dialect Continuum

Niyati Bafna, Josef van Genabith, Cristina España-Bonet, Zdeněk Žabokrtský


Abstract
We present a novel method for unsupervised cognate/borrowing identification from monolingual corpora designed for low and extremely low resource scenarios, based on combining noisy semantic signals from joint bilingual spaces with orthographic cues modelling sound change. We apply our method to the North Indian dialect continuum, containing several dozens of dialects and languages spoken by more than 100 million people. Many of these languages are zero-resource and therefore natural language processing for them is non-existent. We first collect monolingual data for 26 Indic languages, 16 of which were previously zero-resource, and perform exploratory character, lexical and subword cross-lingual alignment experiments for the first time at this scale on this dialect continuum. We create bilingual evaluation lexicons against Hindi for 20 of the languages. We then apply our cognate identification method on the data, and show that our method outperforms both traditional orthography baselines as well as EM-style learnt edit distance matrices. To the best of our knowledge, this is the first work to combine traditional orthographic cues with noisy bilingual embeddings to tackle unsupervised cognate detection in a (truly) low-resource setup, showing that even noisy bilingual embeddings can act as good guides for this task. We release our multilingual dialect corpus, called HinDialect, as well as our scripts for evaluation data collection and cognate induction.
Anthology ID:
2022.conll-1.9
Volume:
Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates (Hybrid)
Editors:
Antske Fokkens, Vivek Srikumar
Venue:
CoNLL
SIG:
SIGNLL
Publisher:
Association for Computational Linguistics
Note:
Pages:
110–131
Language:
URL:
https://aclanthology.org/2022.conll-1.9
DOI:
10.18653/v1/2022.conll-1.9
Bibkey:
Cite (ACL):
Niyati Bafna, Josef van Genabith, Cristina España-Bonet, and Zdeněk Žabokrtský. 2022. Combining Noisy Semantic Signals with Orthographic Cues: Cognate Induction for the Indic Dialect Continuum. In Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL), pages 110–131, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):
Combining Noisy Semantic Signals with Orthographic Cues: Cognate Induction for the Indic Dialect Continuum (Bafna et al., CoNLL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.conll-1.9.pdf