Exploring Data Augmentation Strategies for Hate Speech Detection in Roman Urdu

Ubaid Azam, Hammad Rizwan, Asim Karim


Abstract
In an era where social media platform users are growing rapidly, there has been a marked increase in hateful content being generated; to combat this, automatic hate speech detection systems are a necessity. For this purpose, researchers have recently focused their efforts on developing datasets, however, the vast majority of them have been generated for the English language, with only a few available for low-resource languages such as Roman Urdu. Furthermore, what few are available have small number of samples that pertain to hateful classes and these lack variations in topics and content. Thus, deep learning models trained on such datasets perform poorly when deployed in the real world. To improve performance the option of collecting and annotating more data can be very costly and time consuming. Thus, data augmentation techniques need to be explored to exploit already available datasets to improve model generalizability. In this paper, we explore different data augmentation techniques for the improvement of hate speech detection in Roman Urdu. We evaluate these augmentation techniques on two datasets. We are able to improve performance in the primary metric of comparison (F1 and Macro F1) as well as in recall, which is impertinent for human-in-the-loop AI systems.
Anthology ID:
2022.lrec-1.481
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4523–4531
Language:
URL:
https://aclanthology.org/2022.lrec-1.481
DOI:
Bibkey:
Cite (ACL):
Ubaid Azam, Hammad Rizwan, and Asim Karim. 2022. Exploring Data Augmentation Strategies for Hate Speech Detection in Roman Urdu. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4523–4531, Marseille, France. European Language Resources Association.
Cite (Informal):
Exploring Data Augmentation Strategies for Hate Speech Detection in Roman Urdu (Azam et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.481.pdf
Data
Hate Speech and Offensive Language