WikiLingDiv: a dataset for quantifying digital linguistic diversity using Wikipedia page views

Hannes Essfors, Andreas Baumann


Abstract
With the conflation of digital and non-digital spaces, and NLP technologies being integrated into an increasing number of aspects of daily life, linguistic diversity cannot be fully understood without considering how language is used online. While existing models of linguistic diversity typically have relied on speaker numbers or language production, the dimension of diversity in language consumption remains comparatively understudied. To facilitate such research, we introduce WikiLingDiv, an openly accessible dataset for quantifying linguistic diversity in online knowledge retrieval using Wikipedia page views. Our dataset is based on yearly page views of 340 language editions of Wikipedia, aggregated across 239 countries and territories over 10 years (2015-2024). Using the dataset, we illustrate spatial and temporal patterns of digital linguistic diversity, suggesting that diversity has both increased and decreased across countries and regions, while highlighting country-specific dynamics in language usage. We release the dataset as an openly available and easily integrable data resource for researchers in computational linguistics, digital humanities, and the broader social sciences, enabling further work on linguistic variation, digital inequality, and the interaction between language use and digital technology.
Anthology ID:
2026.latechclfl-1.19
Volume:
Proceedings of the 10th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature 2026
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Diego Alves, Yuri Bizzoni, Stefania Degaetano-Ortlieb, Anna Kazantseva, Janis Pagel, Stan Szpakowicz
Venues:
LaTeCH-CLfL | WS
SIG:
SIGHUM
Publisher:
Association for Computational Linguistics
Note:
Pages:
202–211
Language:
URL:
https://aclanthology.org/2026.latechclfl-1.19/
DOI:
Bibkey:
Cite (ACL):
Hannes Essfors and Andreas Baumann. 2026. WikiLingDiv: a dataset for quantifying digital linguistic diversity using Wikipedia page views. In Proceedings of the 10th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature 2026, pages 202–211, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
WikiLingDiv: a dataset for quantifying digital linguistic diversity using Wikipedia page views (Essfors & Baumann, LaTeCH-CLfL 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.latechclfl-1.19.pdf
Supplementarymaterial:
 2026.latechclfl-1.19.SupplementaryMaterial.zip