One Language, Three of Its Voices: Evaluating Multilingual LLMs Across Persian, Dari, and Tajiki on Translation and Understanding Tasks

Noor Mairukh Khan Arnob; Abu Bakar Siddique Mahi

One Language, Three of Its Voices: Evaluating Multilingual LLMs Across Persian, Dari, and Tajiki on Translation and Understanding Tasks

Noor Mairukh Khan Arnob, Abu Bakar Siddique Mahi

Abstract

The Iranian linguistic family is pluricentric, encompassing Iranian Persian, Dari (Afghanistan), and Tajiki (Tajikistan). While Multilingual Large Language Models (MLLMs) claim broad coverage, their robustness across these regional variants and script differences (Perso-Arabic vs. Cyrillic) remains under-explored, particularly in the open-weight landscape. We evaluate five openweight models from the Qwen, Bloomz, and Gemma families across four downstream tasks: Sentiment Analysis, Machine Translation (MT), NLI, and QA. Utilizing a dataset of over 240,000 processed samples, we observe severe performance disparities. While the fine-tuned gemma-3-4b-persian achieves promising results on Iranian Persian (77.3% accuracy in Sentiment), almost all tested models appear to suffer catastrophic degradation on Tajiki script (dropping to 1.0 BLEU). These findings highlight a critical “script barrier” in current open-weight MLLM development for Central Asian languages. Code and data available here.

Anthology ID:: 2026.silkroadnlp-1.10
Volume:: The Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Rayyan Merchant, Karine Megerdoomian
Venues:: SilkRoadNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 98–104
Language:
URL:: https://aclanthology.org/2026.silkroadnlp-1.10/
DOI:
Bibkey:
Cite (ACL):: Noor Mairukh Khan Arnob and Abu Bakar Siddique Mahi. 2026. One Language, Three of Its Voices: Evaluating Multilingual LLMs Across Persian, Dari, and Tajiki on Translation and Understanding Tasks. In The Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family, pages 98–104, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: One Language, Three of Its Voices: Evaluating Multilingual LLMs Across Persian, Dari, and Tajiki on Translation and Understanding Tasks (Arnob & Mahi, SilkRoadNLP 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.silkroadnlp-1.10.pdf

PDF Cite Search Fix data