Using Large Language Models to Transliterate Endangered Uralic Languages

Niko Partanen


Abstract
This study investigates whether the Large Language Models are able to transliterate and normalize endangered Uralic languages, specifically when they have been written in early 20th century Latin script based transcription systems. We test commercially available closed source systems where there is no reason to expect that the models would be particularly adjusted to this task or these languages. The output of the transliteration in all experiments is contemporary Cyrillic orthography. We conclude that some of the newer LLMs, especially Claude 3.5 Sonnet, are able to produce high quality transliterations even in the smaller languages in our test set, both in zero-shot scenarios and with a prompt that contains an example of the desired output. We assume that the good result is connected to the large presence of materials in these languages online, which the LLM has learned to represent.
Anthology ID:
2024.iwclul-1.10
Volume:
Proceedings of the 9th International Workshop on Computational Linguistics for Uralic Languages
Month:
November
Year:
2024
Address:
Helsinki, Finland
Editors:
Mika Hämäläinen, Flammie Pirinen, Melany Macias, Mario Crespo Avila
Venue:
IWCLUL
SIG:
SIGUR
Publisher:
Association for Computational Linguistics
Note:
Pages:
81–88
Language:
URL:
https://aclanthology.org/2024.iwclul-1.10
DOI:
Bibkey:
Cite (ACL):
Niko Partanen. 2024. Using Large Language Models to Transliterate Endangered Uralic Languages. In Proceedings of the 9th International Workshop on Computational Linguistics for Uralic Languages, pages 81–88, Helsinki, Finland. Association for Computational Linguistics.
Cite (Informal):
Using Large Language Models to Transliterate Endangered Uralic Languages (Partanen, IWCLUL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.iwclul-1.10.pdf