L-HSAB: A Levantine Twitter Dataset for Hate Speech and Abusive Language

Hala Mulki, Hatem Haddad, Chedi Bechikh Ali, Halima Alshabani


Abstract
Hate speech and abusive language have become a common phenomenon on Arabic social media. Automatic hate speech and abusive detection systems can facilitate the prohibition of toxic textual contents. The complexity, informality and ambiguity of the Arabic dialects hindered the provision of the needed resources for Arabic abusive/hate speech detection research. In this paper, we introduce the first publicly-available Levantine Hate Speech and Abusive (L-HSAB) Twitter dataset with the objective to be a benchmark dataset for automatic detection of online Levantine toxic contents. We, further, provide a detailed review of the data collection steps and how we design the annotation guidelines such that a reliable dataset annotation is guaranteed. This has been later emphasized through the comprehensive evaluation of the annotations as the annotation agreement metrics of Cohen’s Kappa (k) and Krippendorff’s alpha (α) indicated the consistency of the annotations.
Anthology ID:
W19-3512
Volume:
Proceedings of the Third Workshop on Abusive Language Online
Month:
August
Year:
2019
Address:
Florence, Italy
Editors:
Sarah T. Roberts, Joel Tetreault, Vinodkumar Prabhakaran, Zeerak Waseem
Venue:
ALW
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
111–118
Language:
URL:
https://aclanthology.org/W19-3512
DOI:
10.18653/v1/W19-3512
Bibkey:
Cite (ACL):
Hala Mulki, Hatem Haddad, Chedi Bechikh Ali, and Halima Alshabani. 2019. L-HSAB: A Levantine Twitter Dataset for Hate Speech and Abusive Language. In Proceedings of the Third Workshop on Abusive Language Online, pages 111–118, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
L-HSAB: A Levantine Twitter Dataset for Hate Speech and Abusive Language (Mulki et al., ALW 2019)
Copy Citation:
PDF:
https://aclanthology.org/W19-3512.pdf
Data
Hate Speech