DALC: the Dutch Abusive Language Corpus

Tommaso Caselli, Arjan Schelhaas, Marieke Weultjes, Folkert Leistra, Hylke van der Veen, Gerben Timmerman, Malvina Nissim


Abstract
As socially unacceptable language become pervasive in social media platforms, the need for automatic content moderation become more pressing. This contribution introduces the Dutch Abusive Language Corpus (DALC v1.0), a new dataset with tweets manually an- notated for abusive language. The resource ad- dress a gap in language resources for Dutch and adopts a multi-layer annotation scheme modeling the explicitness and the target of the abusive messages. Baselines experiments on all annotation layers have been conducted, achieving a macro F1 score of 0.748 for binary classification of the explicitness layer and .489 for target classification.
Anthology ID:
2021.woah-1.6
Volume:
Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)
Month:
August
Year:
2021
Address:
Online
Venue:
WOAH
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
54–66
Language:
URL:
https://aclanthology.org/2021.woah-1.6
DOI:
10.18653/v1/2021.woah-1.6
Bibkey:
Cite (ACL):
Tommaso Caselli, Arjan Schelhaas, Marieke Weultjes, Folkert Leistra, Hylke van der Veen, Gerben Timmerman, and Malvina Nissim. 2021. DALC: the Dutch Abusive Language Corpus. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), pages 54–66, Online. Association for Computational Linguistics.
Cite (Informal):
DALC: the Dutch Abusive Language Corpus (Caselli et al., WOAH 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.woah-1.6.pdf
Video:
 https://aclanthology.org/2021.woah-1.6.mp4
Code
 tommasoc80/dalc