Detection of Abusive Language: the Problem of Biased Datasets

Michael Wiegand, Josef Ruppenhofer, Thomas Kleinbauer


Abstract
We discuss the impact of data bias on abusive language detection. We show that classification scores on popular datasets reported in previous work are much lower under realistic settings in which this bias is reduced. Such biases are most notably observed on datasets that are created by focused sampling instead of random sampling. Datasets with a higher proportion of implicit abuse are more affected than datasets with a lower proportion.
Anthology ID:
N19-1060
Volume:
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
Month:
June
Year:
2019
Address:
Minneapolis, Minnesota
Editors:
Jill Burstein, Christy Doran, Thamar Solorio
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
602–608
Language:
URL:
https://aclanthology.org/N19-1060
DOI:
10.18653/v1/N19-1060
Bibkey:
Cite (ACL):
Michael Wiegand, Josef Ruppenhofer, and Thomas Kleinbauer. 2019. Detection of Abusive Language: the Problem of Biased Datasets. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 602–608, Minneapolis, Minnesota. Association for Computational Linguistics.
Cite (Informal):
Detection of Abusive Language: the Problem of Biased Datasets (Wiegand et al., NAACL 2019)
Copy Citation:
PDF:
https://aclanthology.org/N19-1060.pdf
Video:
 https://aclanthology.org/N19-1060.mp4