Multi-lect automatic detection of Swadesh list items from raw corpus data in East Slavic languages

Ilia Afanasev


Abstract
The article introduces a novel task of multi-lect automatic detection of Swadesh list items from raw corpora. The task aids the early stageof historical linguistics study by helping the researcher compile word lists for further analysis.In this paper, I test multi-lect automatic detection on the East Slavic lects’ data. The training data consists of Ukrainian, Belarusian, and Russian material. I introduce a new dataset for the Ukrainian language. I implement data augmentation techniques to give automatic tools a better understanding of the searched value. The test data consists of the Old East Slavic texts.I train HMM, CRF, and mBERT models, then test and evaluate them by harmonic F1 score. The baseline is a Random Forest classifier. I introduce two different subtasks: the search for new Swadesh list items, and the search for the known Swadesh list items in new lects of the well-established group. The first subtask, given the simultaneously diverse and vague nature of the Swadesh list, currently presents an almost unbeatable challenge for machine learning methods. The second subtask, on the other hand, is easier, and the mBERT model achieves a 0.57 F1 score. This is an impressive result, given how hard it is to formalise the token belonging to a very specific and thematically diverse set of concepts.
Anthology ID:
2023.lchange-1.8
Volume:
Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change
Month:
December
Year:
2023
Address:
Singapore
Editors:
Nina Tahmasebi, Syrielle Montariol, Haim Dubossarsky, Andrey Kutuzov, Simon Hengchen, David Alfter, Francesco Periti, Pierluigi Cassotti
Venue:
LChange
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
76–86
Language:
URL:
https://aclanthology.org/2023.lchange-1.8
DOI:
10.18653/v1/2023.lchange-1.8
Bibkey:
Cite (ACL):
Ilia Afanasev. 2023. Multi-lect automatic detection of Swadesh list items from raw corpus data in East Slavic languages. In Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change, pages 76–86, Singapore. Association for Computational Linguistics.
Cite (Informal):
Multi-lect automatic detection of Swadesh list items from raw corpus data in East Slavic languages (Afanasev, LChange 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.lchange-1.8.pdf