Low-Resource Machine Translation Systems for Indic Languages

Ivana Kvapilíková, Ondřej Bojar


Abstract
We present our submission to the WMT23 shared task in translation between English and Assamese, Khasi, Mizo and Manipuri. All our systems were pretrained on the task of multilingual masked language modelling and denoising auto-encoding. Our primary systems for translation into English were further pretrained for multilingual MT in all four language directions and fine-tuned on the limited parallel data available for each language pair separately. We used online back-translation for data augmentation. The same systems were submitted as contrastive for translation out of English as the multilingual MT pretraining step seemed to harm the translation performance. Our primary systems for translation out of English were trained without the multilingual MT pretraining step. Other contrastive systems used additional pseudo-parallel data mined from monolingual corpora for pretraining.
Anthology ID:
2023.wmt-1.90
Volume:
Proceedings of the Eighth Conference on Machine Translation
Month:
December
Year:
2023
Address:
Singapore
Editors:
Philipp Koehn, Barry Haddow, Tom Kocmi, Christof Monz
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
954–958
Language:
URL:
https://aclanthology.org/2023.wmt-1.90
DOI:
10.18653/v1/2023.wmt-1.90
Bibkey:
Cite (ACL):
Ivana Kvapilíková and Ondřej Bojar. 2023. Low-Resource Machine Translation Systems for Indic Languages. In Proceedings of the Eighth Conference on Machine Translation, pages 954–958, Singapore. Association for Computational Linguistics.
Cite (Informal):
Low-Resource Machine Translation Systems for Indic Languages (Kvapilíková & Bojar, WMT 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.wmt-1.90.pdf