WAC: A Corpus of Wikipedia Conversations for Online Abuse Detection

Noé Cécillon; Vincent Labatut; Richard Dufour; Georges Linarès

WAC: A Corpus of Wikipedia Conversations for Online Abuse Detection

Noé Cécillon, Vincent Labatut, Richard Dufour, Georges Linarès

Abstract

With the spread of online social networks, it is more and more difficult to monitor all the user-generated content. Automating the moderation process of the inappropriate exchange content on Internet has thus become a priority task. Methods have been proposed for this purpose, but it can be challenging to find a suitable dataset to train and develop them. This issue is especially true for approaches based on information derived from the structure and the dynamic of the conversation. In this work, we propose an original framework, based on the the Wikipedia Comment corpus, with comment-level abuse annotations of different types. The major contribution concerns the reconstruction of conversations, by comparison to existing corpora, which focus only on isolated messages (i.e. taken out of their conversational context). This large corpus of more than 380k annotated messages opens perspectives for online abuse detection and especially for context-based approaches. We also propose, in addition to this corpus, a complete benchmarking platform to stimulate and fairly compare scientific works around the problem of content abuse detection, trying to avoid the recurring problem of result replication. Finally, we apply two classification methods to our dataset to demonstrate its potential.

Anthology ID:: 2020.lrec-1.173
Volume:: Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:: May
Year:: 2020
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 1382–1390
Language:: English
URL:: https://aclanthology.org/2020.lrec-1.173/
DOI:
Bibkey:
Cite (ACL):: Noé Cécillon, Vincent Labatut, Richard Dufour, and Georges Linarès. 2020. WAC: A Corpus of Wikipedia Conversations for Online Abuse Detection. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1382–1390, Marseille, France. European Language Resources Association.
Cite (Informal):: WAC: A Corpus of Wikipedia Conversations for Online Abuse Detection (Cécillon et al., LREC 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.lrec-1.173.pdf

PDF Cite Search Fix data