Jaume Zaragoza-Bernabeu


pdf bib
MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages
Marta Bañón | Mălina Chichirău | Miquel Esplà-Gomis | Mikel Forcada | Aarón Galiano-Jiménez | Taja Kuzman | Nikola Ljubešić | Rik van Noord | Leopoldo Pla Sempere | Gema Ramírez-Sánchez | Peter Rupnik | Vit Suchomel | Antonio Toral | Jaume Zaragoza-Bernabeu
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

We present the most relevant results of the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages in its second year. To date, parallel and monolingual corpora have been produced for seven low-resourced European languages by crawling large amounts of textual data from selected top-level domains of the Internet; both human and automatic evaluation show its usefulness. In addition, several large language models pretrained on MaCoCu data have been published, as well as the code used to collect and curate the data.


pdf bib
Bicleaner AI: Bicleaner Goes Neural
Jaume Zaragoza-Bernabeu | Gema Ramírez-Sánchez | Marta Bañón | Sergio Ortiz Rojas
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper describes the experiments carried out during the development of the latest version of Bicleaner, named Bicleaner AI, a tool that aims at detecting noisy sentences in parallel corpora. The tool, which now implements a new neural classifier, uses state-of-the-art techniques based on pre-trained transformer-based language models fine-tuned on a binary classification task. After that, parallel corpus filtering is performed, discarding the sentences that have lower probability of being mutual translations. Our experiments, based on the training of neural machine translation (NMT) with corpora filtered using Bicleaner AI for two different scenarios, show significant improvements in translation quality compared to the previous version of the tool which implemented a classifier based on Extremely Randomized Trees.

pdf bib
Human evaluation of web-crawled parallel corpora for machine translation
Gema Ramírez-Sánchez | Marta Bañón | Jaume Zaragoza-Bernabeu | Sergio Ortiz Rojas
Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)

Quality assessment has been an ongoing activity of the series of ParaCrawl efforts to crawl massive amounts of parallel data from multilingual websites for 29 languages. The goal of ParaCrawl is to get parallel data that is good for machine translation. To prove so, both, automatic (extrinsic) and human (intrinsic and extrinsic) evaluation tasks have been included as part of the quality assessment activity of the project. We sum up the various methods followed to address these evaluation tasks for the web-crawled corpora produced and their results. We review their advantages and disadvantages for the final goal of the ParaCrawl project and the related ongoing project MaCoCu.


pdf bib
Bicleaner at WMT 2020: Universitat d’Alacant-Prompsit’s submission to the parallel corpus filtering shared task
Miquel Esplà-Gomis | Víctor M. Sánchez-Cartagena | Jaume Zaragoza-Bernabeu | Felipe Sánchez-Martínez
Proceedings of the Fifth Conference on Machine Translation

This paper describes the joint submission of Universitat d’Alacant and Prompsit Language Engineering to the WMT 2020 shared task on parallel corpus filtering. Our submission, based on the free/open-source tool Bicleaner, enhances it with Extremely Randomised Trees and lexical similarity features that account for the frequency of the words in the parallel sentences to determine if two sentences are parallel. To train this classifier we used the clean corpora provided for the task and synthetic noisy parallel sentences. In addition we re-score the output of Bicleaner using character-level language models and n-gram saturation.

pdf bib
Bifixer and Bicleaner: two open-source tools to clean your parallel data
Gema Ramírez-Sánchez | Jaume Zaragoza-Bernabeu | Marta Bañón | Sergio Ortiz Rojas
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

This paper shows the utility of two open-source tools designed for parallel data cleaning: Bifixer and Bicleaner. Already used to clean highly noisy parallel content from crawled multilingual websites, we evaluate their performance in a different scenario: cleaning publicly available corpora commonly used to train machine translation systems. We choose four English–Portuguese corpora which we plan to use internally to compute paraphrases at a later stage. We clean the four corpora using both tools, which are described in detail, and analyse the effect of some of the cleaning steps on them. We then compare machine translation training times and quality before and after cleaning these corpora, showing a positive impact particularly for the noisiest ones.