Challenges in Urdu Machine Translation

Abdul Basit; Abdul Hameed Azeemi; Agha Ali Raza

doi:10.18653/v1/2024.loresmt-1.4

Challenges in Urdu Machine Translation

Abdul Basit, Abdul Hameed Azeemi, Agha Ali Raza

Abstract

Recent advancements in Neural Machine Translation (NMT) systems have significantly improved model performance on various translation benchmarks. However, these systems still face numerous challenges when translating low-resource languages such as Urdu. In this work, we highlight the specific issues faced by machine translation systems when translating Urdu language. We first conduct a comprehensive evaluation of English to Urdu Machine Translation with four diverse models: GPT-3.5 (a large language model), opus-mt-en-ur (a bilingual translation model), NLLB (a model trained for translating 200 languages), and IndicTrans2 (a specialized model for translating low-resource Indic languages). The results demonstrate that IndicTrans2 significantly outperforms other models in Urdu Machine Translation. To understand the differences in the performance of these models, we analyze the Urdu word distribution in different training datasets and compare the training methodologies. Finally, we uncover the specific translation issues and provide suggestions for improvements in Urdu machine translation systems.

Anthology ID:: 2024.acl-1.4
Volume:: Proceedings of the Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024)
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Atul Kr. Ojha, Chao-hong Liu, Ekaterina Vylomova, Flammie Pirinen, Jade Abbott, Jonathan Washington, Nathaniel Oco, Valentin Malykh, Varvara Logacheva, Xiaobing Zhao
Venues:: LoResMT | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 44–49
Language:
URL:: https://aclanthology.org/2024.acl-1.4/
DOI:: 10.18653/v1/2024.loresmt-1.4
Bibkey:
Cite (ACL):: Abdul Basit, Abdul Hameed Azeemi, and Agha Ali Raza. 2024. Challenges in Urdu Machine Translation. In Proceedings of the Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024), pages 44–49, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: Challenges in Urdu Machine Translation (Basit et al., LoResMT 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.loresmt-1.4.pdf

PDF Cite Search Fix data