The Effect of Arabic Dialect Familiarity on Data Annotation

Ibrahim Abu Farha, Walid Magdy


Abstract
Data annotation is the foundation of most natural language processing (NLP) tasks. However, data annotation is complex and there is often no specific correct label, especially in subjective tasks. Data annotation is affected by the annotators’ ability to understand the provided data. In the case of Arabic, this is important due to the large dialectal variety. In this paper, we analyse how Arabic speakers understand other dialects in written text. Also, we analyse the effect of dialect familiarity on the quality of data annotation, focusing on Arabic sarcasm detection. This is done by collecting third-party labels and comparing them to high-quality first-party labels. Our analysis shows that annotators tend to better identify their own dialect and they are prone to confuse dialects they are unfamiliar with. For task labels, annotators tend to perform better on their dialect or dialects they are familiar with. Finally, females tend to perform better than males on the sarcasm detection task. We suggest that to guarantee high-quality labels, researchers should recruit native dialect speakers for annotation.
Anthology ID:
2022.wanlp-1.39
Volume:
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates (Hybrid)
Editors:
Houda Bouamor, Hend Al-Khalifa, Kareem Darwish, Owen Rambow, Fethi Bougares, Ahmed Abdelali, Nadi Tomeh, Salam Khalifa, Wajdi Zaghouani
Venue:
WANLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
399–408
Language:
URL:
https://aclanthology.org/2022.wanlp-1.39
DOI:
10.18653/v1/2022.wanlp-1.39
Bibkey:
Cite (ACL):
Ibrahim Abu Farha and Walid Magdy. 2022. The Effect of Arabic Dialect Familiarity on Data Annotation. In Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP), pages 399–408, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):
The Effect of Arabic Dialect Familiarity on Data Annotation (Abu Farha & Magdy, WANLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.wanlp-1.39.pdf