Lower Bias, Higher Density Abusive Language Datasets: A Recipe

Juliet van Rosendaal, Tommaso Caselli, Malvina Nissim


Abstract
Datasets to train models for abusive language detection are at the same time necessary and still scarce. One the reasons for their limited availability is the cost of their creation. It is not only that manual annotation is expensive, it is also the case that the phenomenon is sparse, causing human annotators having to go through a large number of irrelevant examples in order to obtain some significant data. Strategies used until now to increase density of abusive language and obtain more meaningful data overall, include data filtering on the basis of pre-selected keywords and hate-rich sources of data. We suggest a recipe that at the same time can provide meaningful data with possibly higher density of abusive language and also reduce top-down biases imposed by corpus creators in the selection of the data to annotate. More specifically, we exploit the controversy channel on Reddit to obtain keywords that are used to filter a Twitter dataset. While the method needs further validation and refinement, our preliminary experiments show a higher density of abusive tweets in the filtered vs unfiltered dataset, and a more meaningful topic distribution after filtering.
Anthology ID:
2020.restup-1.4
Volume:
Proceedings of the Workshop on Resources and Techniques for User and Author Profiling in Abusive Language
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Johanna Monti, Valerio Basile, Maria Pia Di Buono, Raffaele Manna, Antonio Pascucci, Sara Tonelli
Venue:
ResTUP
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
14–19
Language:
English
URL:
https://aclanthology.org/2020.restup-1.4
DOI:
Bibkey:
Cite (ACL):
Juliet van Rosendaal, Tommaso Caselli, and Malvina Nissim. 2020. Lower Bias, Higher Density Abusive Language Datasets: A Recipe. In Proceedings of the Workshop on Resources and Techniques for User and Author Profiling in Abusive Language, pages 14–19, Marseille, France. European Language Resources Association (ELRA).
Cite (Informal):
Lower Bias, Higher Density Abusive Language Datasets: A Recipe (van Rosendaal et al., ResTUP 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.restup-1.4.pdf