Syed Hassan Mehdi


2019

pdf bib
A Fine-Grained Annotated Multi-Dialectal Arabic Corpus
Anis Charfi | Wajdi Zaghouani | Syed Hassan Mehdi | Esraa Mohamed
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

We present ARAP-Tweet 2.0, a corpus of 5 million dialectal Arabic tweets and 50 million words of about 3000 Twitter users from 17 Arab countries. Compared to the first version, the new corpus has significant improvements in terms of the data volume and the annotation quality. It is fully balanced with respect to dialect, gender, and three age groups: under 25 years, between 25 and 34, and 35 years and above. This paper describes the process of creating the corpus starting from gathering the dialectal phrases to find the users, to annotating their accounts and retrieving their tweets. We also report on the evaluation of the annotation quality using the inter-annotator agreement measures which were applied to the whole corpus and not just a subset. The obtained results were substantial with average Cohen’s Kappa values of 0.99, 0.92, and 0.88 for the annotation of gender, dialect, and age respectively. We also discuss some challenges encountered when developing this corpus.s.