Ward Ruitenbeek
2022
“Zo Grof !”: A Comprehensive Corpus for Offensive and Abusive Language in Dutch
Ward Ruitenbeek
|
Victor Zwart
|
Robin Van Der Noord
|
Zhenja Gnezdilov
|
Tommaso Caselli
Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH)
This paper presents a comprehensive corpus for the study of socially unacceptable language in Dutch. The corpus extends and revise an existing resource with more data and introduces a new annotation dimension for offensive language, making it a unique resource in the Dutch language panorama. Each language phenomenon (abusive and offensive language) in the corpus has been annotated with a multi-layer annotation scheme modelling the explicitness and the target(s) of the message. We have conducted a new set of experiments with different classification algorithms on all annotation dimensions. Monolingual Pre-Trained Language Models prove as the best systems, obtaining a macro-average F1 of 0.828 for binary classification of offensive language, and 0.579 for the targets of offensive messages. Furthermore, the best system obtains a macro-average F1 of 0.667 for distinguishing between abusive and offensive messages.