Diyam Akra

2025

Active Learning for Multidialectal Arabic POS Tagging
Diyam Akra | Mohammed Khalilia | Mustafa Jarrar
Findings of the Association for Computational Linguistics: EMNLP 2025

Multidialectal Arabic POS tagging is challenging due to the morphological richness and high variability among dialects. While POS tagging for MSA has advanced thanks to the availability of annotated datasets, creating similar resources for dialects remains costly and labor-intensive. Increasing the size of annotated datasets does not necessarily result in better performance. Active learning offers a more efficient alternative by prioritizing annotating the most informative samples. This paper proposes an active learning approach for multidialectal Arabic POS tagging. Our experiments revealed that annotating approximately 15,000 tokens is sufficient for high performance. We further demonstrate that using a fine-tuned model from one dialect to guide the selection of initial samples from another dialect accelerates convergence—reducing the annotation requirement by about 2,000 tokens. In conclusion, we propose an active learning pipeline and demonstrate that, upon reaching its defined stopping point of 16,000 annotated tokens, it achieves an accuracy of 97.6% on the Emirati Corpus.

2014

pdf bib

Building a Corpus for Palestinian Arabic: a Preliminary Study
Mustafa Jarrar | Nizar Habash | Diyam Akra | Nasser Zalmout
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

Co-authors

Venues

Findings1
WANLP1

Fix author