Mălina Chichirău

Also published as: Malina Chichirau


2025

pdf bib
Quality Beyond A Glance: Revealing Large Quality Differences Between Web-Crawled Parallel Corpora
Rik van Noord | Miquel Esplà-Gomis | Malina Chichirau | Gema Ramírez-Sánchez | Antonio Toral
Proceedings of the 31st International Conference on Computational Linguistics

Parallel corpora play a vital role in advanced multilingual natural language processing tasks, notably in machine translation (MT). The recent emergence of numerous large parallel corpora, often extracted from multilingual documents on the Internet, has expanded the available resources. Nevertheless, the quality of these corpora remains largely unexplored, while there are large differences in how the corpora are constructed. Moreover, how the potential differences affect the performance of neural MT (NMT) systems has also received limited attention. This study addresses this gap by manually and automatically evaluating four well-known publicly available parallel corpora across eleven language pairs. Our findings are quite concerning: all corpora contain a substantial amount of noisy sentence pairs, with CCMatrix and CCAligned having well below of 50% reasonably clean pairs. MaCoCu and ParaCrawl generally have higher quality texts, though around a third of the texts still have clear issues. While corpus size impacts NMT models’ performance, our study highlights the critical role of quality: higher-quality corpora consistently yield better-performing NMT models when controlling for size.

2023

pdf bib
Automatic Discrimination of Human and Neural Machine Translation in Multilingual Scenarios
Malina Chichirau | Rik van Noord | Antonio Toral
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

We tackle the task of automatically discriminating between human and machine translations. As opposed to most previous work, we perform experiments in a multilingual setting, considering multiple languages and multilingual pretrained language models. We show that a classifier trained on parallel data with a single source language (in our case German–English) can still perform well on English translations that come from different source languages, even when the machine translations were produced by other systems than the one it was trained on. Additionally, we demonstrate that incorporating the source text in the input of a multilingual classifier improves (i) its accuracy and (ii) its robustness on cross-system evaluation, compared to a monolingual classifier. Furthermore, we find that using training data from multiple source languages (German, Russian and Chinese) tends to improve the accuracy of both monolingual and multilingual classifiers. Finally, we show that bilingual classifiers and classifiers trained on multiple source languages benefit from being trained on longer text sequences, rather than on sentences.

pdf bib
MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages
Marta Bañón | Mălina Chichirău | Miquel Esplà-Gomis | Mikel Forcada | Aarón Galiano-Jiménez | Taja Kuzman | Nikola Ljubešić | Rik van Noord | Leopoldo Pla Sempere | Gema Ramírez-Sánchez | Peter Rupnik | Vit Suchomel | Antonio Toral | Jaume Zaragoza-Bernabeu
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

We present the most relevant results of the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages in its second year. To date, parallel and monolingual corpora have been produced for seven low-resourced European languages by crawling large amounts of textual data from selected top-level domains of the Internet; both human and automatic evaluation show its usefulness. In addition, several large language models pretrained on MaCoCu data have been published, as well as the code used to collect and curate the data.