Building and Evaluating a High Quality Parallel Corpus for English Urdu Low Resource Machine Translation

Munief Hassan Tahir, Hunain Azam, Sana Shams, Sarmad Hussain


Abstract
Low-resource languages like Urdu suffer from limited high quality parallel data for machine translation. We introduce a curated English–Urdu corpus of 80,749 high-fidelity sentence pairs across 18 diverse domains, built via ethical collection, manual alignment, deduplication, and strict length-based filtering (AWCD 5). The corpus is converted into a bidirectional SFT dataset with bilingual (English/Urdu) instructions to enhance prompt-language robustness. Fine-tuning Llama-3.1-8B-Instruct (Llama-FT) and UrduLlama 1.1 (UrduLlama-FT) yields major gains over the baseline. sacreBLEU scores reach 24.65–25.24 (EnUr) and 76.14–77.97 (UrEn) for Llama-FT, with minimal sensitivity to prompt language. Blind human evaluation on 90 sentences per direction confirms substantial perceptual improvements. Results demonstrate the value of clean parallel data and bilingual instruction tuning, revealing complementary benefits of general SFT versus Urdu specific pretraining. This work provides a reproducible resource and pipeline to advance Urdu machine translation and similar low-resource languages.
Anthology ID:
2026.loresmt-1.8
Volume:
Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026)
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Atul Kr. Ojha, Chao-hong Liu, Ekaterina Vylomova, Flammie Pirinen, Jonathan Washington, Nathaniel Oco, Xiaobing Zhao
Venues:
LoResMT | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
102–110
Language:
URL:
https://aclanthology.org/2026.loresmt-1.8/
DOI:
Bibkey:
Cite (ACL):
Munief Hassan Tahir, Hunain Azam, Sana Shams, and Sarmad Hussain. 2026. Building and Evaluating a High Quality Parallel Corpus for English Urdu Low Resource Machine Translation. In Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026), pages 102–110, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Building and Evaluating a High Quality Parallel Corpus for English Urdu Low Resource Machine Translation (Tahir et al., LoResMT 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.loresmt-1.8.pdf