Shahd Abdelmoneim

2026

Arabic, often considered a single language, actually describes a wide variety of sometimes mutually unintelligible language varieties. While large language models (LLMs) have revolutionized natural language processing (NLP) with rapid advances, these models still best serve speakers of high-resource and standard language varieties. One particular deficiency of theirs is in dialectal Arabic. We present the first ever shared task for dialectal Arabic language modeling: Arabic Modeling In Your Accent, or AMIYA. The goal of the shared task was to develop LLMs that could (1) respond in the correct dialectal variety when explicitly or implicitly prompted to, (2) translate between dialectal Arabic and standard Arabic or English, (3) adhere to LLM instructions in dialectal Arabic, and (4) produce fluent Arabic outputs. We called for submissions in the dialectal varieties of five countries: Morocco, Egypt, Palestine, Syria, and Saudi Arabia. We received 45 submitted systems from six participating teams. We saw positive results from supervised fine-tuning on a translation objective, and reinforcement learning to improve dialectness. Manual evaluation also showed that some systems had learned to output dialectal words or phrases, but at the expense of actual fluency or coherence. Overall the most effective system involved continual pre-training and supervised fine-tuning of 12 candidate LLMs, followed by selection of the best performing models.

2025

pdf bib abs

AL-QASIDA: Analyzing LLM Quality and Accuracy Systematically in Dialectal Arabic
Nathaniel Romney Robinson | Shahd Abdelmoneim | Kelly Marchisio | Sebastian Ruder
Findings of the Association for Computational Linguistics: ACL 2025

Dialectal Arabic (DA) varieties are under-served by language technologies, particularly large language models (LLMs). This trend threatens to exacerbate existing social inequalities and limits LLM applications, yet the research community lacks operationalized performance measurements in DA. We present a framework that comprehensively assesses LLMs’ DA modeling capabilities across four dimensions: fidelity, understanding, quality, and diglossia. We evaluate nine LLMs in eight DA varieties and provide practical recommendations. Our evaluation suggests that LLMs do not produce DA as well as they understand it, not because their DA fluency is poor, but because they are reluctant to generate DA. Further analysis suggests that current post-training can contribute to bias against DA, that few-shot examples can overcome this deficiency, and that otherwise no measurable features of input text correlate well with LLM DA performance.

Co-authors

Nathaniel R. Robinson 1

Nathaniel Romney Robinson 1

Sebastian Ruder 1

Venues

Fix author