AraTar: A Corpus to Support the Fine-grained Detection of Hate Speech Targets in the Arabic Language

Seham Alghamdi; Youcef Benkhedda; Basma Alharbi; Riza Theresa Batista-Navarro

AraTar: A Corpus to Support the Fine-grained Detection of Hate Speech Targets in the Arabic Language

Seham Alghamdi, Youcef Benkhedda, Basma Alharbi, Riza Batista-Navarro

Abstract

We are currently witnessing a concerning surge in the spread of hate speech across various social media platforms, targeting individuals or groups based on their protected characteristics such as race, religion, nationality and gender. This paper focuses on the detection of hate type (Task 1) and hate target (Task 2) in the Arabic language. To comprehensively address this problem, we have combined and re-annotated hate speech tweets from existing publicly available corpora, resulting in the creation of AraTar, the first and largest Arabic corpus annotated with support for multi-label classification for both hate speech types and target detection with a high inter-annotator agreement. Additionally, we sought to determine the most effective machine learning-based approach for addressing this issue. To achieve this, we compare and evaluate different approaches, including: (1) traditional machine learning-based models, (2) deep learning-based models fed with contextual embeddings, and (3) fine-tuning language models (LMs). Our results demonstrate that fine-tuning LMs, specifically using AraBERTv0.2-twitter (base), achieved the highest performance, with a micro-averaged F1-score of 84.5% and 85.03%, and a macro-averaged F1-score of 77.46% and 73.15%, for Tasks 1 and 2, respectively.

Anthology ID:: 2024.osact-1.1
Volume:: Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Hend Al-Khalifa, Kareem Darwish, Hamdy Mubarak, Mona Ali, Tamer Elsayed
Venues:: OSACT | WS
SIG:: SIGARAB
Publisher:: ELRA and ICCL
Note:
Pages:: 1–12
Language:
URL:: https://aclanthology.org/2024.osact-1.1/
DOI:
Bibkey:
Cite (ACL):: Seham Alghamdi, Youcef Benkhedda, Basma Alharbi, and Riza Batista-Navarro. 2024. AraTar: A Corpus to Support the Fine-grained Detection of Hate Speech Targets in the Arabic Language. In Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024, pages 1–12, Torino, Italia. ELRA and ICCL.
Cite (Informal):: AraTar: A Corpus to Support the Fine-grained Detection of Hate Speech Targets in the Arabic Language (Alghamdi et al., OSACT 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.osact-1.1.pdf

PDF Cite Search Fix data