Quality versus Quantity: Building Catalan-English MT Resources

Ona de Gibert Bonet, Ksenia Kharitonova, Blanca Calvo Figueras, Jordi Armengol-Estapé, Maite Melero


Abstract
In this work, we make the case of quality over quantity when training a MT system for a medium-to-low-resource language pair, namely Catalan-English. We compile our training corpus out of existing resources of varying quality and a new high-quality corpus. We also provide new evaluation translation datasets in three different domains. In the process of building Catalan-English parallel resources, we evaluate the impact of drastically filtering alignments in the resulting MT engines. Our results show that even when resources are limited, as in this case, it is worth filtering for quality. We further explore the cross-lingual transfer learning capabilities of the proposed model for parallel corpus filtering by applying it to other languages. All resources generated in this work are released under open license to encourage the development of language technology in Catalan.
Anthology ID:
2022.sigul-1.8
Volume:
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Maite Melero, Sakriani Sakti, Claudia Soria
Venue:
SIGUL
SIG:
SIGUL
Publisher:
European Language Resources Association
Note:
Pages:
59–69
Language:
URL:
https://aclanthology.org/2022.sigul-1.8
DOI:
Bibkey:
Cite (ACL):
Ona de Gibert Bonet, Ksenia Kharitonova, Blanca Calvo Figueras, Jordi Armengol-Estapé, and Maite Melero. 2022. Quality versus Quantity: Building Catalan-English MT Resources. In Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages, pages 59–69, Marseille, France. European Language Resources Association.
Cite (Informal):
Quality versus Quantity: Building Catalan-English MT Resources (de Gibert Bonet et al., SIGUL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.sigul-1.8.pdf
Data
FLoRes-101JW300OpenSubtitlesWikiMatrix