Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing

Isaac Johnson, Lucie-Aimée Kaffee, Miriam Redi


Abstract
Wikimedia content is used extensively by the AI community and within the language modeling community in particular. In this paper, we provide a review of the different ways in which Wikimedia data is curated to use in NLP tasks across pre-training, post-training, and model evaluations. We point to opportunities for greater use of Wikimedia content but also identify ways in which the language modeling community could better center the needs of Wikimedia editors. In particular, we call for incorporating additional sources of Wikimedia data, a greater focus on benchmarks for LLMs that encode Wikimedia principles, and greater multilingualism in Wikimedia-derived datasets.
Anthology ID:
2024.wikinlp-1.14
Volume:
Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Lucie Lucie-Aimée, Angela Fan, Tajuddeen Gwadabe, Isaac Johnson, Fabio Petroni, Daniel van Strien
Venue:
WikiNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
91–101
Language:
URL:
https://aclanthology.org/2024.wikinlp-1.14
DOI:
Bibkey:
Cite (ACL):
Isaac Johnson, Lucie-Aimée Kaffee, and Miriam Redi. 2024. Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing. In Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia, pages 91–101, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Wikimedia data for AI: a review of Wikimedia datasets for NLP tasks and AI-assisted editing (Johnson et al., WikiNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.wikinlp-1.14.pdf