NLP for preserving Torlak, a vulnerable low-resource Slavic language

Li Tang; Teodora Vuković

NLP for preserving Torlak, a vulnerable low-resource Slavic language

Abstract

Torlak is an endangered, low-resource Slavic language with a high degree of areal and inter-speaker variation. In previous work, interviews were performed with Torlak speakers in Serbia, near the Bulgarian border, and the transcripts annotated with lemma and morphosyntactic descriptions at token level. As such token-level annotations facilitate cross-language comparison in the context of the Balkan Sprachbund, where multiple languages influenced Torlak over time, including Serbian and Bulgarian. Here, we aim to improve the prediction of morphosyntactic annotations for this low-resource language using the fine-tuning of large language models, comparing several predictive models. We also further fine-tuned the large language models for scoring the degree of ‘Torlakness’ of a sentence by labeling likely Torlak tokens, to facilitate the documentation of additional Torlak transcribed speech with a high degree of Torlak-style non-standard features compared to standard Serbian. Taken together, we hope that these contributions will help to document this endangered language, and improve digital access for its speakers.

Anthology ID:: 2025.coling-main.423
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6338–6347
Language:
URL:: https://aclanthology.org/2025.coling-main.423/
DOI:
Bibkey:
Cite (ACL):: Li Tang and Teodora Vuković. 2025. NLP for preserving Torlak, a vulnerable low-resource Slavic language. In Proceedings of the 31st International Conference on Computational Linguistics, pages 6338–6347, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: NLP for preserving Torlak, a vulnerable low-resource Slavic language (Tang & Vuković, COLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.coling-main.423.pdf

PDF Cite Search Fix data