Towards a Comprehensive Taxonomy and Large-Scale Annotated Corpus for Online Slur Usage

Jana Kurrek; Haji Mohammad Saleem; Derek Ruths

doi:10.18653/v1/2020.alw-1.17

Towards a Comprehensive Taxonomy and Large-Scale Annotated Corpus for Online Slur Usage

Jana Kurrek, Haji Mohammad Saleem, Derek Ruths

Abstract

Abusive language classifiers have been shown to exhibit bias against women and racial minorities. Since these models are trained on data that is collected using keywords, they tend to exhibit a high sensitivity towards pejoratives. As a result, comments written by victims of abuse are frequently labelled as hateful, even if they discuss or reclaim slurs. Any attempt to address bias in keyword-based corpora requires a better understanding of pejorative language, as well as an equitable representation of targeted users in data collection. We make two main contributions to this end. First, we provide an annotation guide that outlines 4 main categories of online slur usage, which we further divide into a total of 12 sub-categories. Second, we present a publicly available corpus based on our taxonomy, with 39.8k human annotated comments extracted from Reddit. This corpus was annotated by a diverse cohort of coders, with Shannon equitability indices of 0.90, 0.92, and 0.87 across sexuality, ethnicity, and gender. Taken together, our taxonomy and corpus allow researchers to evaluate classifiers on a wider range of speech containing slurs.

Anthology ID:: 2020.alw-1.17
Volume:: Proceedings of the Fourth Workshop on Online Abuse and Harms
Month:: November
Year:: 2020
Address:: Online
Editors:: Seyi Akiwowo, Bertie Vidgen, Vinodkumar Prabhakaran, Zeerak Waseem
Venue:: ALW
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 138–149
Language:
URL:: https://aclanthology.org/2020.alw-1.17/
DOI:: 10.18653/v1/2020.alw-1.17
Bibkey:
Cite (ACL):: Jana Kurrek, Haji Mohammad Saleem, and Derek Ruths. 2020. Towards a Comprehensive Taxonomy and Large-Scale Annotated Corpus for Online Slur Usage. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pages 138–149, Online. Association for Computational Linguistics.
Cite (Informal):: Towards a Comprehensive Taxonomy and Large-Scale Annotated Corpus for Online Slur Usage (Kurrek et al., ALW 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.alw-1.17.pdf
Optionalsupplementarymaterial:: 2020.alw-1.17.OptionalSupplementaryMaterial.pdf
Video:: https://slideslive.com/38939533

PDF Cite Search Optionalsupplementarymaterial Video Fix data