U-MIRAGE: Benchmarking Chain-of-Thought Reasoning for Urdu Medical QA

Ali Faheem; Faizad Ullah; Muhammad Hammad; Ahmed Hassan; Muhammad Sohaib Ayub; Asim Karim

U-MIRAGE: Benchmarking Chain-of-Thought Reasoning for Urdu Medical QA

Ali Faheem, Faizad Ullah, Muhammad Hammad, Ahmed Hassan, Muhammad Sohaib Ayub, Asim Karim

Abstract

Medical AI systems increasingly rely on large language models (LLMs), yet their deployment in linguistically diverse regions remains unexplored. We address this gap by introducing U-MIRAGE, the first medical question-answering benchmark for Urdu and Roman Urdu. Urdu is the 11th most spoken language (with over 246 million speakers) worldwide. Our systematic evaluation of six state-of-the-art LLMs reveals three main findings. (1) 6% to 10% drop in performance when moving from English to Urdu variants, even though medical knowledge should theoretically transfer across languages. (2) Chain-of-Thought (CoT) prompting improves small models by 8% to 20%, while surprisingly the larger models’ performance degraded by up to 3%. (3) Quantized small models fail catastrophically in low-resource languages, achieving near-random accuracy regardless of various prompting strategies. These findings challenge core assumptions about multilingual medical AI systems. Roman Urdu consistently outperforms standard Urdu script, suggesting orthographic alignment with pre-training data matters more than linguistic proximity. CoT prompting effectiveness depends critically on model architecture rather than task complexity alone. CoT prompting effectiveness depends critically on model architecture rather than task complexity alone. Our contributions are threefold: (1) U-MIRAGE, (2) systematic benchmarking of LLMs for Urdu and Roman Urdu medical reasoning, and (3) empirical analysis of CoT prompting in low-resource contexts. Our code and datasets are publicly available.

Anthology ID:: 2026.abjadnlp-1.55
Volume:: Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Venues:: AbjadNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 453–460
Language:
URL:: https://aclanthology.org/2026.abjadnlp-1.55/
DOI:
Bibkey:
Cite (ACL):: Ali Faheem, Faizad Ullah, Muhammad Hammad, Ahmed Hassan, Muhammad Sohaib Ayub, and Asim Karim. 2026. U-MIRAGE: Benchmarking Chain-of-Thought Reasoning for Urdu Medical QA. In Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script, pages 453–460, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: U-MIRAGE: Benchmarking Chain-of-Thought Reasoning for Urdu Medical QA (Faheem et al., AbjadNLP 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.abjadnlp-1.55.pdf

PDF Cite Search Fix data