2024
pdf
bib
abs
Lightweight neural translation technologies for low-resource languages
Felipe Sánchez-Martínez
|
Juan Antonio Pérez-Ortiz
|
Víctor Sánchez-Cartagena
|
Andrés Lou
|
Cristian García-Romero
|
Aarón Galiano-Jiménez
|
Miquel Esplà-Gomis
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)
The LiLowLa (“Lightweight neural translation technologies for low-resource languages”) project aims to enhance machine translation (MT) and translation memory (TM) technologies, particularly for low-resource language pairs, where adequate linguistic resources are scarce. The project started in September 2022 and will run till August 2025.
2022
pdf
bib
abs
Building Domain-specific Corpora from the Web: the Case of European Digital Service Infrastructures
Rik van Noord
|
Cristian García-Romero
|
Miquel Esplà-Gomis
|
Leopoldo Pla Sempere
|
Antonio Toral
Proceedings of the BUCC Workshop within LREC 2022
An important goal of the MaCoCu project is to improve EU-specific NLP systems that concern their Digital Service Infrastructures (DSIs). In this paper we aim at boosting the creation of such domain-specific NLP systems. To do so, we explore the feasibility of building an automatic classifier that allows to identify which segments in a generic (potentially parallel) corpus are relevant for a particular DSI. We create an evaluation data set by crawling DSI-specific web domains and then compare different strategies to build our DSI classifier for text in three languages: English, Spanish and Dutch. We use pre-trained (multilingual) language models to perform the classification, with zero-shot classification for Spanish and Dutch. The results are promising, as we are able to classify DSIs with between 70 and 80% accuracy, even without in-language training data. A manual annotation of the data revealed that we can also find DSI-specific data on crawled texts from general web domains with reasonable accuracy. We publicly release all data, predictions and code, as to allow future investigations in whether exploiting this DSI-specific data actually leads to improved performance on particular applications, such as machine translation.
pdf
bib
abs
MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages
Marta Bañón
|
Miquel Esplà-Gomis
|
Mikel L. Forcada
|
Cristian García-Romero
|
Taja Kuzman
|
Nikola Ljubešić
|
Rik van Noord
|
Leopoldo Pla Sempere
|
Gema Ramírez-Sánchez
|
Peter Rupnik
|
Vít Suchomel
|
Antonio Toral
|
Tobias van der Werff
|
Jaume Zaragoza
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
We introduce the project “MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages”, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages. The approach followed consists of crawling large amounts of textual data from carefully selected top-level domains of the Internet, and then applying a curation and enrichment pipeline. In addition to corpora, the project will release successive versions of the free/open-source web crawling and curation software used.