Current state of LLMs for Arabic dialectal machine translation

Josef Jon, Rawan Bondok, Ondřej Bojar


Abstract
This work presents an evaluation of large language models (LLMs) for English to dialectal Arabic machine translation on the MADAR dataset. We evaluate both translation directions (English to Arabic and vice-versa) on 16 Arabic dialects. Our experiments cover a diverse set of models, including specialized Arabic models (Jais, Nile), multilingual models (Gemma, Command-R, Mistral, Aya), and commercial APIs (GPT-4.1). We employ multiple evaluation metrics: BLEU, CHRF, COMET (both reference-based and reference-less variants) and GEMBA (LLM-as-a-judge), as well as a small-scale manual evaluation, to assess translation quality. We discuss the challenges of automatic MT evaluation, especially in the context of Arabic dialects. We also evaluate the ability of LLMs to classify the dialect used in a text. The study offers insights into the capabilities and limitations of current LLMs for dialectal Arabic machine translation, particularly highlighting the difficulty of handling dialectal diversity, although the results may be influenced by possible training data contamination, which is always a concern with LLMs.
Anthology ID:
2026.abjadnlp-1.41
Volume:
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
Month:
March
Year:
2026
Address:
Rabat, Morocco
Venues:
AbjadNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
329–363
Language:
URL:
https://aclanthology.org/2026.abjadnlp-1.41/
DOI:
Bibkey:
Cite (ACL):
Josef Jon, Rawan Bondok, and Ondřej Bojar. 2026. Current state of LLMs for Arabic dialectal machine translation. In Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script, pages 329–363, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Current state of LLMs for Arabic dialectal machine translation (Jon et al., AbjadNLP 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.abjadnlp-1.41.pdf