Automatic Detection of Borrowings in Low-Resource Languages of the Caucasus: Andic branch

Konstantin Zaitsev; Anzhelika Minchenko

Automatic Detection of Borrowings in Low-Resource Languages of the Caucasus: Andic branch

Abstract

Linguistic borrowings occur in all languages. Andic languages of the Caucasus have borrowings from different donor-languages like Russian, Arabic, Persian. To automatically detect these borrowings, we propose a logistic regression model. The model was trained on the dataset which contains words in IPA from dictionaries of Andic languages. To improve model’s quality, we compared TfIdf and Count vectorizers and chose the second one. Besides, we added new features to the model. They were extracted using analysis of vectorizer features and using a language model. The model was evaluated by classification quality metrics (precision, recall and F1-score). The best average F1-score of all languages for words in IPA was about 0.78. Experiments showed that our model reaches good results not only with words in IPA but also with words in Cyrillic.

Anthology ID:: 2022.fieldmatters-1.4
Volume:: Proceedings of the First Workshop on NLP applications to field linguistics
Month:: October
Year:: 2022
Address:: Gyeongju, Republic of Korea
Editors:: Oleg Serikov, Ekaterina Voloshina, Anna Postnikova, Elena Klyachko, Ekaterina Neminova, Ekaterina Vylomova, Tatiana Shavrina, Eric Le Ferrand, Valentin Malykh, Francis Tyers, Timofey Arkhangelskiy, Vladislav Mikhailov, Alena Fenogenova
Venue:: FieldMatters
SIG:
Publisher:: International Conference on Computational Linguistics
Note:
Pages:: 34–41
Language:
URL:: https://aclanthology.org/2022.fieldmatters-1.4/
DOI:
Bibkey:
Cite (ACL):: Konstantin Zaitsev and Anzhelika Minchenko. 2022. Automatic Detection of Borrowings in Low-Resource Languages of the Caucasus: Andic branch. In Proceedings of the First Workshop on NLP applications to field linguistics, pages 34–41, Gyeongju, Republic of Korea. International Conference on Computational Linguistics.
Cite (Informal):: Automatic Detection of Borrowings in Low-Resource Languages of the Caucasus: Andic branch (Zaitsev & Minchenko, FieldMatters 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.fieldmatters-1.4.pdf

PDF Cite Search Fix data