Bootstrapping Moksha-Erzya Neural Machine Translation from Rule-Based Apertium

Khalid Alnajjar, Mika Hämäläinen, Jack Rueter


Abstract
Neural Machine Translation (NMT) has made significant strides in breaking down language barriers around the globe. For lesser-resourced languages like Moksha and Erzya, however, the development of robust NMT systems remains a challenge due to the scarcity of parallel corpora. This paper presents a novel approach to address this challenge by leveraging the existing rule-based machine translation system Apertium as a tool for synthetic data generation. We fine-tune NLLB-200 for Moksha-Erzya translation and obtain a BLEU of 0.73 on the Apertium generated data. On real world data, we got an improvement of 0.058 BLEU score over Apertium.
Anthology ID:
2023.nlp4dh-1.24
Volume:
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages
Month:
December
Year:
2023
Address:
Tokyo, Japan
Editors:
Mika Hämäläinen, Emily Öhman, Flammie Pirinen, Khalid Alnajjar, So Miyagawa, Yuri Bizzoni, Niko Partanen, Jack Rueter
Venues:
NLP4DH | IWCLUL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
213–218
Language:
URL:
https://aclanthology.org/2023.nlp4dh-1.24
DOI:
Bibkey:
Cite (ACL):
Khalid Alnajjar, Mika Hämäläinen, and Jack Rueter. 2023. Bootstrapping Moksha-Erzya Neural Machine Translation from Rule-Based Apertium. In Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages, pages 213–218, Tokyo, Japan. Association for Computational Linguistics.
Cite (Informal):
Bootstrapping Moksha-Erzya Neural Machine Translation from Rule-Based Apertium (Alnajjar et al., NLP4DH-IWCLUL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.nlp4dh-1.24.pdf