Low-Resource Machine Translation Using Cross-Lingual Language Model Pretraining

Francis Zheng, Machel Reid, Edison Marrese-Taylor, Yutaka Matsuo


Abstract
This paper describes UTokyo’s submission to the AmericasNLP 2021 Shared Task on machine translation systems for indigenous languages of the Americas. We present a low-resource machine translation system that improves translation accuracy using cross-lingual language model pretraining. Our system uses an mBART implementation of fairseq to pretrain on a large set of monolingual data from a diverse set of high-resource languages before finetuning on 10 low-resource indigenous American languages: Aymara, Bribri, Asháninka, Guaraní, Wixarika, Náhuatl, Hñähñu, Quechua, Shipibo-Konibo, and Rarámuri. On average, our system achieved BLEU scores that were 1.64 higher and chrF scores that were 0.0749 higher than the baseline.
Anthology ID:
2021.americasnlp-1.26
Volume:
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas
Month:
June
Year:
2021
Address:
Online
Venues:
AmericasNLP | NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
234–240
Language:
URL:
https://aclanthology.org/2021.americasnlp-1.26
DOI:
10.18653/v1/2021.americasnlp-1.26
Bibkey:
Cite (ACL):
Francis Zheng, Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo. 2021. Low-Resource Machine Translation Using Cross-Lingual Language Model Pretraining. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 234–240, Online. Association for Computational Linguistics.
Cite (Informal):
Low-Resource Machine Translation Using Cross-Lingual Language Model Pretraining (Zheng et al., AmericasNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.americasnlp-1.26.pdf