HARALD: Augmenting Hate Speech Data Sets with Real Data

Tal Ilan, Dan Vilenchik


Abstract
The successful completion of the hate speech detection task hinges upon the availability of rich and variable labeled data, which is hard to obtain. In this work, we present a new approach for data augmentation that uses as input real unlabelled data, which is carefully selected from online platforms where invited hate speech is abundant. We show that by harvesting and processing this data (in an automatic manner), one can augment existing manually-labeled datasets to improve the classification performance of hate speech classification models. We observed an improvement in F1-score ranging from 2.7% and up to 9.5%, depending on the task (in- or cross-domain) and the model used.
Anthology ID:
2022.findings-emnlp.165
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2022
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2241–2248
Language:
URL:
https://aclanthology.org/2022.findings-emnlp.165
DOI:
10.18653/v1/2022.findings-emnlp.165
Bibkey:
Cite (ACL):
Tal Ilan and Dan Vilenchik. 2022. HARALD: Augmenting Hate Speech Data Sets with Real Data. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2241–2248, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
HARALD: Augmenting Hate Speech Data Sets with Real Data (Ilan & Vilenchik, Findings 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.findings-emnlp.165.pdf
Software:
 2022.findings-emnlp.165.software.zip