QARI: Neural Architecture for Urdu Extractive Machine Reading Comprehension

Samreen Kazi, Shakeel Ahmed Khoja


Abstract
Urdu, a morphologically rich and low-resource language spoken by over 300 million people, poses unique challenges for extractive machine reading comprehension (EMRC), particularly in accurately identifying span boundaries involving postpositions and copulas. Existing multilingual models struggle with subword fragmentation and imprecise span extraction in such settings. We introduce QARI (قاری, “reader”), a character-enhanced architecture for Urdu extractive MRC that augments pretrained multilingual encoders with three innovations: (1) a character-level CNN that captures affix patterns and morphological features from full word forms; (2) a gated fusion mechanism that integrates semantic and morphological representations; and (3) a boundary-contrastive learning objective targeting Urdu-specific span errors. Evaluated on UQuAD+, the first native Urdu MRC benchmark, QARI achieves 83.5 F1, a 5.5 point improvement over the previous best result (mT5, 78.0 F1), setting a new state of the art. Ablations show that character-level modeling and boundary supervision contribute +7.5 and +7.0 F1, respectively. Cross-dataset evaluations on UQA and UrFQuAD confirm QARI’s robustness. Error analysis reveals significant reductions in boundary drift, with improvements most notable for short factual questions.
Anthology ID:
2026.loreslm-1.16
Volume:
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Hansi Hettiarachchi, Tharindu Ranasinghe, Alistair Plum, Paul Rayson, Ruslan Mitkov, Mohamed Gaber, Damith Premasiri, Fiona Anting Tan, Lasitha Uyangodage
Venue:
LoResLM
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
168–177
Language:
URL:
https://aclanthology.org/2026.loreslm-1.16/
DOI:
Bibkey:
Cite (ACL):
Samreen Kazi and Shakeel Ahmed Khoja. 2026. QARI: Neural Architecture for Urdu Extractive Machine Reading Comprehension. In Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026), pages 168–177, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
QARI: Neural Architecture for Urdu Extractive Machine Reading Comprehension (Kazi & Khoja, LoResLM 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.loreslm-1.16.pdf