Diyam Akra
2025
Active Learning for Multidialectal Arabic POS Tagging
Diyam Akra
|
Mohammed Khalilia
|
Mustafa Jarrar
Findings of the Association for Computational Linguistics: EMNLP 2025
Multidialectal Arabic POS tagging is challenging due to the morphological richness and high variability among dialects. While POS tagging for MSA has advanced thanks to the availability of annotated datasets, creating similar resources for dialects remains costly and labor-intensive. Increasing the size of annotated datasets does not necessarily result in better performance. Active learning offers a more efficient alternative by prioritizing annotating the most informative samples. This paper proposes an active learning approach for multidialectal Arabic POS tagging. Our experiments revealed that annotating approximately 15,000 tokens is sufficient for high performance. We further demonstrate that using a fine-tuned model from one dialect to guide the selection of initial samples from another dialect accelerates convergenceāreducing the annotation requirement by about 2,000 tokens. In conclusion, we propose an active learning pipeline and demonstrate that, upon reaching its defined stopping point of 16,000 annotated tokens, it achieves an accuracy of 97.6% on the Emirati Corpus.
2014
Building a Corpus for Palestinian Arabic: a Preliminary Study
Mustafa Jarrar
|
Nizar Habash
|
Diyam Akra
|
Nasser Zalmout
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)