Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets

Flammie A. Pirinen

Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets

Abstract

The current trends in natural language processing strongly favor large language models and generative AIs as the basis for everything. For Uralic languages that are not largely present in publically available data on the Internet, this can be problematic. In the current computational linguistic scene, it is very important to have representation of your language in popular datasets. Languages that are included in well-known datasets are also included in shared tasks, products by large technology corporations, and so forth. This inclusion will become especially important for under-resourced, under-studied minority, and Indigenous languages, which will otherwise be easily forgotten. In this article, we present the resources that are often deemed necessary for digital presence of a language in the large language model obsessed world of today. We show that there are methods and tricks available to alleviate the problems with a lack of data and a lack of creators and annotators of the data, some more successful than others.

Anthology ID:: 2024.iwclul-1.16
Volume:: Proceedings of the 9th International Workshop on Computational Linguistics for Uralic Languages
Month:: November
Year:: 2024
Address:: Helsinki, Finland
Editors:: Mika Hämäläinen, Flammie Pirinen, Melany Macias, Mario Crespo Avila
Venue:: IWCLUL
SIG:: SIGUR
Publisher:: Association for Computational Linguistics
Note:
Pages:: 123–131
Language:
URL:: https://aclanthology.org/2024.iwclul-1.16/
DOI:
Bibkey:
Cite (ACL):: Flammie A Pirinen. 2024. Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets. In Proceedings of the 9th International Workshop on Computational Linguistics for Uralic Languages, pages 123–131, Helsinki, Finland. Association for Computational Linguistics.
Cite (Informal):: Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets (Pirinen, IWCLUL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.iwclul-1.16.pdf

PDF Cite Search Fix data