BhasaBodh: Bridging Bangla Dialects and Romanized Forms through Machine Translation

Md. Tofael Ahmed Bhuiyan; Md. Abdur Rahman; Abdul Kadar Muhammad Masum

BhasaBodh: Bridging Bangla Dialects and Romanized Forms through Machine Translation

Md. Tofael Ahmed Bhuiyan, Md. Abdur Rahman, Abdul Kadar Muhammad Masum

Abstract

While machine translation has made significant strides for high-resource languages, many regional languages and their dialects, such as the Bangla variants Chittagong and Sylhet, remain underserved. Existing resources are often insufficient for robust sentence-level evaluation and overlook the widespread real-world practice of romanization, the common practice of typing native languages using the Latin script in digital communication. To address these gaps, we introduce BhasaBodh, a comprehensive benchmark for Bangla dialectal machine translation. We construct and release a sentence-level parallel dataset for Chittagong and Sylhet dialects aligned with Standard Bangla and English, create a novel romanized version of the dialectal data to facilitate evaluation in realistic multi-script scenarios, and provide the first comprehensive performance baselines by fine-tuning two powerful multilingual models, NLLB-200 and mBART-50, on seven distinct translation tasks. Our experiments reveal that mBART-50 consistently outperforms NLLB-200 on most dialectal and romanized tasks, achieving a BLEU score as high as 87.44 on the Romanized-to-Standard Bangla normalization task. However, complex cross-lingual and cross-script translation remains a significant challenge. BhasaBodh lays the groundwork for future research in low-resource dialectal NLP, offering a valuable resource for developing more inclusive and practical translation systems.

Anthology ID:: 2025.banglalp-1.9
Volume:: Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025)
Month:: December
Year:: 2025
Address:: Mumbai, India
Editors:: Firoj Alam, Sudipta Kar, Shammur Absar Chowdhury, Naeemul Hassan, Enamul Hoque Prince, Mohiuddin Tasnim, Md Rashad Al Hasan Rony, Md Tahmid Rahman Rahman
Venues:: BanglaLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 113–118
Language:
URL:: https://aclanthology.org/2025.banglalp-1.9/
DOI:
Bibkey:
Cite (ACL):: Md. Tofael Ahmed Bhuiyan, Md. Abdur Rahman, and Abdul Kadar Muhammad Masum. 2025. BhasaBodh: Bridging Bangla Dialects and Romanized Forms through Machine Translation. In Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025), pages 113–118, Mumbai, India. Association for Computational Linguistics.
Cite (Informal):: BhasaBodh: Bridging Bangla Dialects and Romanized Forms through Machine Translation (Bhuiyan et al., BanglaLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.banglalp-1.9.pdf

PDF Cite Search Fix data