Jaume Zaragoza


2023

pdf bib
HPLT: High Performance Language Technologies
Mikko Aulamo | Nikolay Bogoychev | Shaoxiong Ji | Graeme Nail | Gema Ramírez-Sánchez | Jörg Tiedemann | Jelmer van der Linde | Jaume Zaragoza
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

We describe the High Performance Language Technologies project (HPLT), a 3-year EU-funded project started in September 2022. HPLT will build a space combining petabytes of natural language data with large-scale model training. It will derive monolingual and bilingual datasets from the Internet Archive and CommonCrawl and build efficient and solid machine translation (MT) as well as large language models (LLMs). HPLT aims at providing free, sustainable and reusable datasets, models and workflows at scale using high-performance computing (HPC).

2022

pdf bib
MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages
Marta Bañón | Miquel Esplà-Gomis | Mikel L. Forcada | Cristian García-Romero | Taja Kuzman | Nikola Ljubešić | Rik van Noord | Leopoldo Pla Sempere | Gema Ramírez-Sánchez | Peter Rupnik | Vít Suchomel | Antonio Toral | Tobias van der Werff | Jaume Zaragoza
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

We introduce the project “MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages”, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages. The approach followed consists of crawling large amounts of textual data from carefully selected top-level domains of the Internet, and then applying a curation and enrichment pipeline. In addition to corpora, the project will release successive versions of the free/open-source web crawling and curation software used.

2020

pdf bib
ParaCrawl: Web-Scale Acquisition of Parallel Corpora
Marta Bañón | Pinzhen Chen | Barry Haddow | Kenneth Heafield | Hieu Hoang | Miquel Esplà-Gomis | Mikel L. Forcada | Amir Kamran | Faheem Kirefu | Philipp Koehn | Sergio Ortiz Rojas | Leopoldo Pla Sempere | Gema Ramírez-Sánchez | Elsa Sarrías | Marek Strelec | Brian Thompson | William Waites | Dion Wiggins | Jaume Zaragoza
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.