Jusèp Loís Sans Socasau


2024

pdf bib
Expanding the FLORES+ Multilingual Benchmark with Translations for Aragonese, Aranese, Asturian, and Valencian
Juan Antonio Perez-Ortiz | Felipe Sánchez-Martínez | Víctor M. Sánchez-Cartagena | Miquel Esplà-Gomis | Aaron Galiano Jimenez | Antoni Oliver | Claudi Aventín-Boya | Alejandro Pardos | Cristina Valdés | Jusèp Loís Sans Socasau | Juan Pablo Martínez
Proceedings of the Ninth Conference on Machine Translation

In this paper, we describe the process of creating the FLORES+ datasets for several Romance languages spoken in Spain, namely Aragonese, Aranese, Asturian, and Valencian. The Aragonese and Aranese datasets are entirely new additions to the FLORES+ multilingual benchmark. An initial version of the Asturian dataset was already available in FLORES+, and our work focused on a thorough revision. Similarly, FLORES+ included a Catalan dataset, which we adapted to the Valencian variety spoken in the Valencian Community. The development of the Aragonese, Aranese, and revised Asturian FLORES+ datasets was undertaken as part of a WMT24 shared task on translation into low-resource languages of Spain.