Joanna Dolińska
Also published as: Joanna Dolinska
2024
Akha, Dara-ang, Karen, Khamu, Mlabri and Urak Lawoi’ language minorities’ subjective perception of their languages and the outlook for development of digital tools
Joanna Dolinska
|
Shekhar Nayak
|
Sumittra Suraratdecha
Proceedings of the Seventh Workshop on the Use of Computational Methods in the Study of Endangered Languages
Multilingualism is deeply rooted in the sociopolitical history of Thailand. Some minority language communities entered the Thai territory a few decades ago, while the families of some other minority speakers have been living in Thailand since at least several generations. The authors of this article address the question how Akha, Dara-ang, Karen, Khamu, Mlabri and Urak Lawoi’ language speakers perceive the current situation of their language and whether they see the need for the development of digital tools for documentation, revitalization and daily use of their languages. The objective is complemented by a discussion on the feasibility of development of such tools for some of the above mentioned languages and the motivation of their speakers to participate in this process. Furthermore, this article highlights the challenges associated with developing digital tools for these low-resource languages and outlines the standards researchers must adhere to in conceptualizing the development of such tools, collecting data, and engaging with the language communities throughout the collaborative process.
POS Tagging for the Endangered Dagur Language
Joanna Dolińska
|
Delphine Bernhard
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
The application of natural language processing tools opens new ways for the documentation and revitalization of under-resourced languages. In this article we aim to investigate the feasibility of automatic part-of-speech (POS) tagging for Dagur, which is an endangered Mongolic language spoken mainly in northeast China, with no official written standard for all Dagur dialects. We present a new manually annotated corpus for Dagur, which includes about 1,200 tokens, and detail the decisions made during the annotation process. This corpus is used to test transfer of models from other languages, especially from Buryat, which is currently the only Mongolic language included in the Universal Dependencies corpora. We applied the models trained by de Vries et al. (2022) to the Dagur corpus and continued training these models on Buryat. We analyse the results with respect to language families, script and POS distribution, in three different zero-shot settings: (1) unrelated, (2) related and (3) unrelated+related language.
Search