Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing

Aashish Dhawan; Christopher Driggers-Ellis; Christan Grant; Daisy Zhe Wang

Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing

Aashish Dhawan, Christopher Driggers-Ellis, Christan Grant, Daisy Zhe Wang

Abstract

Machine translation for Indigenous and other low-resource languages is constrained by limited parallel data, orthographic variation, and evaluation instability for morphologically rich languages. In this work, we study Spanish–Aymara, Spanish–Guarani, and Spanish–Quechua translation in the context of the AmericasNLP benchmarks, focusing on data-centric improvements rather than architectural changes. We augment curated parallel corpora with forward-translated synthetic sentence pairs generated using a high-capacity multilingual translation model, while applying conservative, language-specific preprocessing tailored to each language. Training data is filtered using length-ratio constraints and deduplication, whereas official development sets are left unfiltered to ensure fair evaluation. We fine-tune a multilingual mBART model under curated-only and curated+synthetic settings and evaluate performance primarily using chrF++, which is better suited for agglutinative languages than BLEU. Across all three languages, synthetic data augmentation consistently improves chrF++, with the largest gains observed for Aymara and Guarani, while Quechua benefits primarily from deterministic orthographic normalization. Our analysis highlights both the effectiveness and the limitations of generic preprocessing for highly agglutinative languages, suggesting that data-centric augmentation and language-aware normalization are strong, reproducible baselines for low-resource Indigenous language machine translation.

Anthology ID:: 2026.loresmt-1.10
Volume:: Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Atul Kr. Ojha, Chao-hong Liu, Ekaterina Vylomova, Flammie Pirinen, Jonathan Washington, Nathaniel Oco, Xiaobing Zhao
Venues:: LoResMT | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 119–126
Language:
URL:: https://aclanthology.org/2026.loresmt-1.10/
DOI:
Bibkey:
Cite (ACL):: Aashish Dhawan, Christopher Driggers-Ellis, Christan Grant, and Daisy Zhe Wang. 2026. Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing. In Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026), pages 119–126, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing (Dhawan et al., LoResMT 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.loresmt-1.10.pdf

PDF Cite Search Fix data