Enhancing Tuvan Language Resources through the FLORES Dataset

Ali Kuzhuget, Airana Mongush, Nachyn-Enkhedorzhu Oorzhak


Abstract
FLORES is a benchmark dataset designed for evaluating machine translation systems, partic- ularly for low-resource languages. This paper, conducted as a part of Open Language Data Ini- tiative (OLDI) shared task, presents our contri- bution to expanding the FLORES dataset with high-quality translations from Russian to Tu- van, an endangered Turkic language. Our ap- proach combined the linguistic expertise of na- tive speakers to ensure both accuracy and cul- tural relevance in the translations. This project represents a significant step forward in support- ing Tuvan as a low-resource language in the realm of natural language processing (NLP) and machine translation (MT).
Anthology ID:
2024.wmt-1.46
Volume:
Proceedings of the Ninth Conference on Machine Translation
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venue:
WMT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
593–599
Language:
URL:
https://aclanthology.org/2024.wmt-1.46
DOI:
Bibkey:
Cite (ACL):
Ali Kuzhuget, Airana Mongush, and Nachyn-Enkhedorzhu Oorzhak. 2024. Enhancing Tuvan Language Resources through the FLORES Dataset. In Proceedings of the Ninth Conference on Machine Translation, pages 593–599, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Enhancing Tuvan Language Resources through the FLORES Dataset (Kuzhuget et al., WMT 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.wmt-1.46.pdf