A Fine-Grained Annotated Multi-Dialectal Arabic Corpus

Anis Charfi, Wajdi Zaghouani, Syed Hassan Mehdi, Esraa Mohamed


Abstract
We present ARAP-Tweet 2.0, a corpus of 5 million dialectal Arabic tweets and 50 million words of about 3000 Twitter users from 17 Arab countries. Compared to the first version, the new corpus has significant improvements in terms of the data volume and the annotation quality. It is fully balanced with respect to dialect, gender, and three age groups: under 25 years, between 25 and 34, and 35 years and above. This paper describes the process of creating the corpus starting from gathering the dialectal phrases to find the users, to annotating their accounts and retrieving their tweets. We also report on the evaluation of the annotation quality using the inter-annotator agreement measures which were applied to the whole corpus and not just a subset. The obtained results were substantial with average Cohen’s Kappa values of 0.99, 0.92, and 0.88 for the annotation of gender, dialect, and age respectively. We also discuss some challenges encountered when developing this corpus.s.
Anthology ID:
R19-1023
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
Month:
September
Year:
2019
Address:
Varna, Bulgaria
Editors:
Ruslan Mitkov, Galia Angelova
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
198–204
Language:
URL:
https://aclanthology.org/R19-1023
DOI:
10.26615/978-954-452-056-4_023
Bibkey:
Cite (ACL):
Anis Charfi, Wajdi Zaghouani, Syed Hassan Mehdi, and Esraa Mohamed. 2019. A Fine-Grained Annotated Multi-Dialectal Arabic Corpus. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 198–204, Varna, Bulgaria. INCOMA Ltd..
Cite (Informal):
A Fine-Grained Annotated Multi-Dialectal Arabic Corpus (Charfi et al., RANLP 2019)
Copy Citation:
PDF:
https://aclanthology.org/R19-1023.pdf