Jaume Zaragoza-Bernabeu


pdf bib
Human evaluation of web-crawled parallel corpora for machine translation
Gema Ramírez-Sánchez | Marta Bañón | Jaume Zaragoza-Bernabeu | Sergio Ortiz Rojas
Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval)

Quality assessment has been an ongoing activity of the series of ParaCrawl efforts to crawl massive amounts of parallel data from multilingual websites for 29 languages. The goal of ParaCrawl is to get parallel data that is good for machine translation. To prove so, both, automatic (extrinsic) and human (intrinsic and extrinsic) evaluation tasks have been included as part of the quality assessment activity of the project. We sum up the various methods followed to address these evaluation tasks for the web-crawled corpora produced and their results. We review their advantages and disadvantages for the final goal of the ParaCrawl project and the related ongoing project MaCoCu.


pdf bib
Bicleaner at WMT 2020: Universitat d’Alacant-Prompsit’s submission to the parallel corpus filtering shared task
Miquel Esplà-Gomis | Víctor M. Sánchez-Cartagena | Jaume Zaragoza-Bernabeu | Felipe Sánchez-Martínez
Proceedings of the Fifth Conference on Machine Translation

This paper describes the joint submission of Universitat d’Alacant and Prompsit Language Engineering to the WMT 2020 shared task on parallel corpus filtering. Our submission, based on the free/open-source tool Bicleaner, enhances it with Extremely Randomised Trees and lexical similarity features that account for the frequency of the words in the parallel sentences to determine if two sentences are parallel. To train this classifier we used the clean corpora provided for the task and synthetic noisy parallel sentences. In addition we re-score the output of Bicleaner using character-level language models and n-gram saturation.

pdf bib
Bifixer and Bicleaner: two open-source tools to clean your parallel data
Gema Ramírez-Sánchez | Jaume Zaragoza-Bernabeu | Marta Bañón | Sergio Ortiz Rojas
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

This paper shows the utility of two open-source tools designed for parallel data cleaning: Bifixer and Bicleaner. Already used to clean highly noisy parallel content from crawled multilingual websites, we evaluate their performance in a different scenario: cleaning publicly available corpora commonly used to train machine translation systems. We choose four English–Portuguese corpora which we plan to use internally to compute paraphrases at a later stage. We clean the four corpora using both tools, which are described in detail, and analyse the effect of some of the cleaning steps on them. We then compare machine translation training times and quality before and after cleaning these corpora, showing a positive impact particularly for the noisiest ones.