Training and Fine-Tuning NMT Models for Low-Resource Languages Using Apertium-Based Synthetic Corpora

Aleix Sant, Daniel Bardanca, José Ramom Pichel Campos, Francesca De Luca Fornaciari, Carlos Escolano, Javier Garcia Gilabert, Pablo Gamallo, Audrey Mash, Xixian Liao, Maite Melero


Abstract
In this paper, we present the two strategies employed for the WMT24 Shared Task on Translation into Low-Resource Languages of Spain. We participated in the language pairs of Spanish-to-Aragonese, Spanish-to-Aranese, and Spanish-to-Asturian, developing neural-based translation systems and moving away from rule-based approaches for these language directions. To create these models, two distinct strategies were employed. The first strategy involved a thorough cleaning process and curation of the limited provided data, followed by fine-tuning the multilingual NLLB-200-600M model (Constrained Submission). The other strategy involved training a transformer from scratch using a vast amount of synthetic data (Open Submission). Both approaches relied on generated synthetic data and resulted in high ChrF and BLEU scores. However, given the characteristics of the task, the strategy used in the Constrained Submission resulted in higher scores that surpassed the baselines across the three translation directions, whereas the strategy employed in the Open Submission yielded slightly lower scores than the highest baseline.
Anthology ID:
2024.wmt-1.90
Volume:
Proceedings of the Ninth Conference on Machine Translation
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venue:
WMT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
925–933
Language:
URL:
https://aclanthology.org/2024.wmt-1.90
DOI:
Bibkey:
Cite (ACL):
Aleix Sant, Daniel Bardanca, José Ramom Pichel Campos, Francesca De Luca Fornaciari, Carlos Escolano, Javier Garcia Gilabert, Pablo Gamallo, Audrey Mash, Xixian Liao, and Maite Melero. 2024. Training and Fine-Tuning NMT Models for Low-Resource Languages Using Apertium-Based Synthetic Corpora. In Proceedings of the Ninth Conference on Machine Translation, pages 925–933, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Training and Fine-Tuning NMT Models for Low-Resource Languages Using Apertium-Based Synthetic Corpora (Sant et al., WMT 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.wmt-1.90.pdf