HOAXPEDIA: A Unified Wikipedia Hoax Articles Dataset

Hsuvas Borkakoty; Luis Espinosa Anke

doi:10.18653/v1/2024.wikinlp-1.11

HOAXPEDIA: A Unified Wikipedia Hoax Articles Dataset

Abstract

Hoaxes are a recognised form of disinformation created deliberately, with potential serious implications in the credibility of reference knowledge resources such as Wikipedia. What makes detecting Wikipedia hoaxes hard is that they often are written according to the official style guidelines. In this work, we first provide a systematic analysis of similarities and discrepancies between legitimate and hoax Wikipedia articles, and introduce HOAXPEDIA, a collection of 311 hoax articles (from existing literature and official Wikipedia lists), together with semantically similar legitimate articles, which together form a binary text classification dataset aimed at fostering research in automated hoax detection. In this paper, We report results after analyzing several language models, hoax-to-legit ratios, and the amount of text classifiers are exposed to (full article vs the article’s definition alone). Our results suggest that detecting deceitful content in Wikipedia based on content alone is hard but feasible, and complement our analysis with a study on the differences in distributions in edit histories, and find that looking at this feature yields better classification results than context.

Anthology ID:: 2024.wikinlp-1.11
Volume:: Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Lucie Lucie-Aimée, Angela Fan, Tajuddeen Gwadabe, Isaac Johnson, Fabio Petroni, Daniel van Strien
Venues:: WikiNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 53–66
Language:
URL:: https://aclanthology.org/2024.wikinlp-1.11/
DOI:: 10.18653/v1/2024.wikinlp-1.11
Bibkey:
Cite (ACL):: Hsuvas Borkakoty and Luis Espinosa-Anke. 2024. HOAXPEDIA: A Unified Wikipedia Hoax Articles Dataset. In Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia, pages 53–66, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: HOAXPEDIA: A Unified Wikipedia Hoax Articles Dataset (Borkakoty & Espinosa-Anke, WikiNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.wikinlp-1.11.pdf

PDF Cite Search Fix data