2024
pdf
bib
abs
Lightweight neural translation technologies for low-resource languages
Felipe Sánchez-Martínez
|
Juan Antonio Pérez-Ortiz
|
Víctor Sánchez-Cartagena
|
Andrés Lou
|
Cristian García-Romero
|
Aarón Galiano-Jiménez
|
Miquel Esplà-Gomis
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)
The LiLowLa (“Lightweight neural translation technologies for low-resource languages”) project aims to enhance machine translation (MT) and translation memory (TM) technologies, particularly for low-resource language pairs, where adequate linguistic resources are scarce. The project started in September 2022 and will run till August 2025.
pdf
bib
abs
Expanding the FLORES+ Multilingual Benchmark with Translations for Aragonese, Aranese, Asturian, and Valencian
Juan Antonio Perez-Ortiz
|
Felipe Sánchez-Martínez
|
Víctor M. Sánchez-Cartagena
|
Miquel Esplà-Gomis
|
Aaron Galiano Jimenez
|
Antoni Oliver
|
Claudi Aventín-Boya
|
Alejandro Pardos
|
Cristina Valdés
|
Jusèp Loís Sans Socasau
|
Juan Pablo Martínez
Proceedings of the Ninth Conference on Machine Translation
In this paper, we describe the process of creating the FLORES+ datasets for several Romance languages spoken in Spain, namely Aragonese, Aranese, Asturian, and Valencian. The Aragonese and Aranese datasets are entirely new additions to the FLORES+ multilingual benchmark. An initial version of the Asturian dataset was already available in FLORES+, and our work focused on a thorough revision. Similarly, FLORES+ included a Catalan dataset, which we adapted to the Valencian variety spoken in the Valencian Community. The development of the Aragonese, Aranese, and revised Asturian FLORES+ datasets was undertaken as part of a WMT24 shared task on translation into low-resource languages of Spain.
pdf
bib
abs
Findings of the WMT 2024 Shared Task Translation into Low-Resource Languages of Spain: Blending Rule-Based and Neural Systems
Felipe Sánchez-Martínez
|
Juan Antonio Perez-Ortiz
|
Aaron Galiano Jimenez
|
Antoni Oliver
Proceedings of the Ninth Conference on Machine Translation
This paper presents the results of the Ninth Conference on Machine Translation (WMT24) Shared Task “Translation into Low-Resource Languages of Spain”’. The task focused on the development of machine translation systems for three language pairs: Spanish-Aragonese, Spanish-Aranese, and Spanish-Asturian. 17 teams participated in the shared task with a total of 87 submissions. The baseline system for all language pairs was Apertium, a rule-based machine translation system that still performs competitively well, even in an era dominated by more advanced non-symbolic approaches. We report and discuss the results of the submitted systems, highlighting the strengths of both neural and rule-based approaches.
pdf
bib
abs
Universitat d’Alacant’s Submission to the WMT 2024 Shared Task on Translation into Low-Resource Languages of Spain
Aaron Galiano Jimenez
|
Víctor M. Sánchez-Cartagena
|
Juan Antonio Perez-Ortiz
|
Felipe Sánchez-Martínez
Proceedings of the Ninth Conference on Machine Translation
This paper describes the submissions of the Transducens group of the Universitat d’Alacant to the WMT 2024 Shared Task on Translation into Low-Resource Languages of Spain; in particular, the task focuses on the translation from Spanish into Aragonese, Aranese and Asturian. Our submissions use parallel and monolingual data to fine-tune the NLLB-1.3B model and to investigate the effectiveness of synthetic corpora and transfer-learning between related languages such as Catalan, Galician and Valencian. We also present a many-to-many multilingual neural machine translation model focused on the Romance languages of Spain.
2023
pdf
bib
abs
Exploiting large pre-trained models for low-resource neural machine translation
Aarón Galiano-Jiménez
|
Felipe Sánchez-Martínez
|
Víctor M. Sánchez-Cartagena
|
Juan Antonio Pérez-Ortiz
Proceedings of the 24th Annual Conference of the European Association for Machine Translation
Pre-trained models have drastically changed the field of natural language processing by providing a way to leverage large-scale language representations to various tasks. Some pre-trained models offer general-purpose representations, while others are specialized in particular tasks, like neural machine translation (NMT). Multilingual NMT-targeted systems are often fine-tuned for specific language pairs, but there is a lack of evidence-based best-practice recommendations to guide this process. Moreover, the trend towards even larger pre-trained models has made it challenging to deploy them in the computationally restrictive environments typically found in developing regions where low-resource languages are usually spoken. We propose a pipeline to tune the mBART50 pre-trained model to 8 diverse low-resource language pairs, and then distil the resulting system to obtain lightweight and more sustainable models. Our pipeline conveniently exploits back-translation, synthetic corpus filtering, and knowledge distillation to deliver efficient, yet powerful bilingual translation models 13 times smaller than the original pre-trained ones, but with close performance in terms of BLEU.
pdf
bib
abs
MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages
Marta Bañón
|
Mălina Chichirău
|
Miquel Esplà-Gomis
|
Mikel Forcada
|
Aarón Galiano-Jiménez
|
Taja Kuzman
|
Nikola Ljubešić
|
Rik van Noord
|
Leopoldo Pla Sempere
|
Gema Ramírez-Sánchez
|
Peter Rupnik
|
Vit Suchomel
|
Antonio Toral
|
Jaume Zaragoza-Bernabeu
Proceedings of the 24th Annual Conference of the European Association for Machine Translation
We present the most relevant results of the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages in its second year. To date, parallel and monolingual corpora have been produced for seven low-resourced European languages by crawling large amounts of textual data from selected top-level domains of the Internet; both human and automatic evaluation show its usefulness. In addition, several large language models pretrained on MaCoCu data have been published, as well as the code used to collect and curate the data.