DeTox: A Comprehensive Dataset for German Offensive Language and Conversation Analysis

Christoph Demus, Jonas Pitz, Mina Schütz, Nadine Probol, Melanie Siegel, Dirk Labudde


Abstract
In this work, we present a new publicly available offensive language dataset of 10.278 German social media comments collected in the first half of 2021 that were annotated by in total six annotators. With twelve different annotation categories, it is far more comprehensive than other datasets, and goes beyond just hate speech detection. The labels aim in particular also at toxicity, criminal relevance and discrimination types of comments. Furthermore, about half of the comments are from coherent parts of conversations, which opens the possibility to consider the comments’ contexts and do conversation analyses in order to research the contagion of offensive language in conversations.
Anthology ID:
2022.woah-1.14
Volume:
Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH)
Month:
July
Year:
2022
Address:
Seattle, Washington (Hybrid)
Editors:
Kanika Narang, Aida Mostafazadeh Davani, Lambert Mathias, Bertie Vidgen, Zeerak Talat
Venue:
WOAH
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
143–153
Language:
URL:
https://aclanthology.org/2022.woah-1.14
DOI:
10.18653/v1/2022.woah-1.14
Bibkey:
Cite (ACL):
Christoph Demus, Jonas Pitz, Mina Schütz, Nadine Probol, Melanie Siegel, and Dirk Labudde. 2022. DeTox: A Comprehensive Dataset for German Offensive Language and Conversation Analysis. In Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH), pages 143–153, Seattle, Washington (Hybrid). Association for Computational Linguistics.
Cite (Informal):
DeTox: A Comprehensive Dataset for German Offensive Language and Conversation Analysis (Demus et al., WOAH 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.woah-1.14.pdf
Video:
 https://aclanthology.org/2022.woah-1.14.mp4
Code
 hdasprachtechnologie/detox