Peru is Multilingual, Its Machine Translation Should Be Too?

Arturo Oncevay


Abstract
Peru is a multilingual country with a long history of contact between the indigenous languages and Spanish. Taking advantage of this context for machine translation is possible with multilingual approaches for learning both unsupervised subword segmentation and neural machine translation models. The study proposes the first multilingual translation models for four languages spoken in Peru: Aymara, Ashaninka, Quechua and Shipibo-Konibo, providing both many-to-Spanish and Spanish-to-many models and outperforming pairwise baselines in most of them. The task exploited a large English-Spanish dataset for pre-training, monolingual texts with tagged back-translation, and parallel corpora aligned with English. Finally, by fine-tuning the best models, we also assessed the out-of-domain capabilities in two evaluation datasets for Quechua and a new one for Shipibo-Konibo.
Anthology ID:
2021.americasnlp-1.22
Volume:
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas
Month:
June
Year:
2021
Address:
Online
Venues:
AmericasNLP | NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
194–201
Language:
URL:
https://aclanthology.org/2021.americasnlp-1.22
DOI:
10.18653/v1/2021.americasnlp-1.22
Bibkey:
Cite (ACL):
Arturo Oncevay. 2021. Peru is Multilingual, Its Machine Translation Should Be Too?. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pages 194–201, Online. Association for Computational Linguistics.
Cite (Informal):
Peru is Multilingual, Its Machine Translation Should Be Too? (Oncevay, AmericasNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.americasnlp-1.22.pdf
Code
 aoncevay/mt-peru