Anis Charfi


2024

pdf bib
MARASTA: A Multi-dialectal Arabic Cross-domain Stance Corpus
Anis Charfi | Mabrouka Ben-Sghaier | Andria Samy Raouf Atalla | Raghda Akasheh | Sara Al-Emadi | Wajdi Zaghouani
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper introduces a cross-domain and multi-dialectal stance corpus for Arabic that includes four regions in the Arab World and covers the main Arabic dialect groups. Our corpus consists of 4657 sentences manually annotated with each sentence’s stance towards a specific topic. For each region, we collected sentences related to two controversial topics. We annotated each sentence by at least two annotators to indicate if its stance favors the topic, is against it, or is neutral. Our corpus is well-balanced concerning dialect and stance. Approximately half of the sentences are in Modern Standard Arabic (MSA) for each region, and the other half is in the region’s respective dialect. We conducted several machine-learning experiments for stance detection using our new corpus. Our most successful model is the Multi-Layer Perceptron (MLP), using Unigram or TF-IDF extracted features, which yielded an F1-score of 0.66 and an accuracy score of 0.66. Compared with the most similar state-of-the-art dataset, our dataset outperformed in specific stance classes, particularly “neutral” and “against”.

2019

pdf bib
A Fine-Grained Annotated Multi-Dialectal Arabic Corpus
Anis Charfi | Wajdi Zaghouani | Syed Hassan Mehdi | Esraa Mohamed
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

We present ARAP-Tweet 2.0, a corpus of 5 million dialectal Arabic tweets and 50 million words of about 3000 Twitter users from 17 Arab countries. Compared to the first version, the new corpus has significant improvements in terms of the data volume and the annotation quality. It is fully balanced with respect to dialect, gender, and three age groups: under 25 years, between 25 and 34, and 35 years and above. This paper describes the process of creating the corpus starting from gathering the dialectal phrases to find the users, to annotating their accounts and retrieving their tweets. We also report on the evaluation of the annotation quality using the inter-annotator agreement measures which were applied to the whole corpus and not just a subset. The obtained results were substantial with average Cohen’s Kappa values of 0.99, 0.92, and 0.88 for the annotation of gender, dialect, and age respectively. We also discuss some challenges encountered when developing this corpus.s.

2018

pdf bib
Arap-Tweet: A Large Multi-Dialect Twitter Corpus for Gender, Age and Language Variety Identification
Wajdi Zaghouani | Anis Charfi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)