How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP

Kushal Tatariya; Artur Kulmizev; Wessel Poelman; Esther Ploeger; Marcel Bollmann; Johannes Bjerva; Jiaming Luo; Heather Lent; Miryam de Lhoneux

How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP

Kushal Tatariya, Artur Kulmizev, Wessel Poelman, Esther Ploeger, Marcel Bollmann, Johannes Bjerva, Jiaming Luo, Heather Lent, Miryam de Lhoneux

Abstract

Wikipedia’s perceived high quality and broad language coverage have established it as a fundamental resource in NLP. However, in recent years, such assumptions of high quality have become the subject of scrutiny in low-resource and multilingual contexts. In this study, we subject the entirety of non-English Wikipedia to a data filtering procedure typically reserved for noisy web-text — a process which removes a large percentage of the collection’s data. In analysing the removed data, we reveal numerous systematic quality issues, such as script and language contamination, repeated template and placeholder articles, and a high concentration of bot-generated content. We consolidate these findings into a 4-level quality ranking of Wikipedia, which shows strong correspondence with alternative quality measures and heuristics. Lastly, we evaluate the downstream impact of quality filtering in three practical language modelling scenarios, showing that models trained on filtered data largely match or outperform those trained on raw Wikipedia, with the largest gains observed for lower-quality language editions. Ultimately, our experiments serve as a first step in establishing quality-aware best practices for Wikipedia utilization in NLP, laying groundwork that can inform future dataset creation and curation efforts.

Anthology ID:: 2026.acl-long.1373
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 29754–29774
Language:
URL:: https://aclanthology.org/2026.acl-long.1373/
DOI:
Bibkey:
Cite (ACL):: Kushal Tatariya, Artur Kulmizev, Wessel Poelman, Esther Ploeger, Marcel Bollmann, Johannes Bjerva, Jiaming Luo, Heather Lent, and Miryam de Lhoneux. 2026. How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 29754–29774, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP (Tatariya et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.1373.pdf
Checklist:: 2026.acl-long.1373.checklist.pdf

PDF Cite Search Checklist Fix data