Quality Beyond A Glance: Revealing Large Quality Differences Between Web-Crawled Parallel Corpora

Rik Van Noord; Miquel Esplà-Gomis; Mălina Chichirău; Gema Ramírez-Sánchez; Antonio. Toral

Quality Beyond A Glance: Revealing Large Quality Differences Between Web-Crawled Parallel Corpora

Rik van Noord, Miquel Esplà-Gomis, Malina Chichirau, Gema Ramírez-Sánchez, Antonio Toral

Abstract

Parallel corpora play a vital role in advanced multilingual natural language processing tasks, notably in machine translation (MT). The recent emergence of numerous large parallel corpora, often extracted from multilingual documents on the Internet, has expanded the available resources. Nevertheless, the quality of these corpora remains largely unexplored, while there are large differences in how the corpora are constructed. Moreover, how the potential differences affect the performance of neural MT (NMT) systems has also received limited attention. This study addresses this gap by manually and automatically evaluating four well-known publicly available parallel corpora across eleven language pairs. Our findings are quite concerning: all corpora contain a substantial amount of noisy sentence pairs, with CCMatrix and CCAligned having well below of 50% reasonably clean pairs. MaCoCu and ParaCrawl generally have higher quality texts, though around a third of the texts still have clear issues. While corpus size impacts NMT models’ performance, our study highlights the critical role of quality: higher-quality corpora consistently yield better-performing NMT models when controlling for size.

Anthology ID:: 2025.coling-main.124
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1824–1838
Language:
URL:: https://aclanthology.org/2025.coling-main.124/
DOI:
Bibkey:
Cite (ACL):: Rik van Noord, Miquel Esplà-Gomis, Malina Chichirau, Gema Ramírez-Sánchez, and Antonio Toral. 2025. Quality Beyond A Glance: Revealing Large Quality Differences Between Web-Crawled Parallel Corpora. In Proceedings of the 31st International Conference on Computational Linguistics, pages 1824–1838, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Quality Beyond A Glance: Revealing Large Quality Differences Between Web-Crawled Parallel Corpora (van Noord et al., COLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.coling-main.124.pdf

PDF Cite Search Fix data