TIM-UNIGE Translation into Low-Resource Languages of Spain for WMT24

Jonathan Mutal, Lucía Ormaechea


Abstract
We present the results of our constrained submission to the WMT 2024 shared task, which focuses on translating from Spanish into two low-resource languages of Spain: Aranese (spa-arn) and Aragonese (spa-arg). Our system integrates real and synthetic data generated by large language models (e.g., BLOOMZ) and rule-based Apertium translation systems. Built upon the pre-trained NLLB system, our translation model utilizes a multistage approach, progressively refining the initial model through the sequential use of different datasets, starting with large-scale synthetic or crawled data and advancing to smaller, high-quality parallel corpora. This approach resulted in BLEU scores of 30.1 for Spanish to Aranese and 61.9 for Spanish to Aragonese.
Anthology ID:
2024.wmt-1.82
Volume:
Proceedings of the Ninth Conference on Machine Translation
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venue:
WMT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
862–870
Language:
URL:
https://aclanthology.org/2024.wmt-1.82
DOI:
Bibkey:
Cite (ACL):
Jonathan Mutal and Lucía Ormaechea. 2024. TIM-UNIGE Translation into Low-Resource Languages of Spain for WMT24. In Proceedings of the Ninth Conference on Machine Translation, pages 862–870, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
TIM-UNIGE Translation into Low-Resource Languages of Spain for WMT24 (Mutal & Ormaechea, WMT 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.wmt-1.82.pdf