Language Identification for Austronesian Languages

Jonathan Dunn, Wikke Nijhof


Abstract
This paper provides language identification models for low- and under-resourced languages in the Pacific region with a focus on previously unavailable Austronesian languages. Accurate language identification is an important part of developing language resources. The approach taken in this paper combines 29 Austronesian languages with 171 non-Austronesian languages to create an evaluation set drawn from eight data sources. After evaluating six approaches to language identification, we find that a classifier based on skip-gram embeddings reaches a significantly higher performance than alternate methods. We then systematically increase the number of non-Austronesian languages in the model up to a total of 800 languages to evaluate whether an increased language inventory leads to less precise predictions for the Austronesian languages of interest. This evaluation finds that there is only a minimal impact on accuracy caused by increasing the inventory of non-Austronesian languages. Further experiments adapt these language identification models for code-switching detection, achieving high accuracy across all 29 languages.
Anthology ID:
2022.lrec-1.701
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6530–6539
Language:
URL:
https://aclanthology.org/2022.lrec-1.701
DOI:
Bibkey:
Cite (ACL):
Jonathan Dunn and Wikke Nijhof. 2022. Language Identification for Austronesian Languages. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6530–6539, Marseille, France. European Language Resources Association.
Cite (Informal):
Language Identification for Austronesian Languages (Dunn & Nijhof, LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.701.pdf
Code
 jonathandunn/pacific_codeswitch