Cristian García-Romero


2022

pdf bib
MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages
Marta Bañón | Miquel Esplà-Gomis | Mikel L. Forcada | Cristian García-Romero | Taja Kuzman | Nikola Ljubešić | Rik van Noord | Leopoldo Pla Sempere | Gema Ramírez-Sánchez | Peter Rupnik | Vít Suchomel | Antonio Toral | Tobias van der Werff | Jaume Zaragoza
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

We introduce the project “MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages”, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages. The approach followed consists of crawling large amounts of textual data from carefully selected top-level domains of the Internet, and then applying a curation and enrichment pipeline. In addition to corpora, the project will release successive versions of the free/open-source web crawling and curation software used.

pdf bib
Building Domain-specific Corpora from the Web: the Case of European Digital Service Infrastructures
Rik van Noord | Cristian García-Romero | Miquel Esplà-Gomis | Leopoldo Pla Sempere | Antonio Toral
Proceedings of the BUCC Workshop within LREC 2022

An important goal of the MaCoCu project is to improve EU-specific NLP systems that concern their Digital Service Infrastructures (DSIs). In this paper we aim at boosting the creation of such domain-specific NLP systems. To do so, we explore the feasibility of building an automatic classifier that allows to identify which segments in a generic (potentially parallel) corpus are relevant for a particular DSI. We create an evaluation data set by crawling DSI-specific web domains and then compare different strategies to build our DSI classifier for text in three languages: English, Spanish and Dutch. We use pre-trained (multilingual) language models to perform the classification, with zero-shot classification for Spanish and Dutch. The results are promising, as we are able to classify DSIs with between 70 and 80% accuracy, even without in-language training data. A manual annotation of the data revealed that we can also find DSI-specific data on crawled texts from general web domains with reasonable accuracy. We publicly release all data, predictions and code, as to allow future investigations in whether exploiting this DSI-specific data actually leads to improved performance on particular applications, such as machine translation.