Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak

Mukhammadsaid Mamasaidov, Abror Shopulatov


Abstract
This study presents several contributions for the Karakalpak language: a FLORES+ devtest dataset translated to Karakalpak, parallel corpora for Uzbek-Karakalpak, Russian-Karakalpak and English-Karakalpak of 100,000 pairs each and open-sourced fine-tuned neural models for translation across these languages. Our experiments compare different model variants and training approaches, demonstrating improvements over existing baselines. This work, conducted as part of the Open Language Data Initiative (OLDI) shared task, aims to advance machine translation capabilities for Karakalpak and contribute to expanding linguistic diversity in NLP technologies.
Anthology ID:
2024.wmt-1.48
Volume:
Proceedings of the Ninth Conference on Machine Translation
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venue:
WMT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
606–613
Language:
URL:
https://aclanthology.org/2024.wmt-1.48
DOI:
Bibkey:
Cite (ACL):
Mukhammadsaid Mamasaidov and Abror Shopulatov. 2024. Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak. In Proceedings of the Ninth Conference on Machine Translation, pages 606–613, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak (Mamasaidov & Shopulatov, WMT 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.wmt-1.48.pdf