BanglaSTEM: A Parallel Corpus and Term-Weighted Evaluation for Technical Bangla-English Translation

Kazi Reyazul Hasan; A. B. M. Alim Al Islam; Muhammad Abdullah Adnan

BanglaSTEM: A Parallel Corpus and Term-Weighted Evaluation for Technical Bangla-English Translation

Kazi Reyazul Hasan, A. B. M. Alim Al Islam, Muhammad Abdullah Adnan

Abstract

Large language models excel at technical problem solving in English but struggle when questions are posed in Bangla. While translation offers a practical solution, existing Bangla-English systems frequently mistranslate specialized terminology, altering problem semantics and degrading downstream performance. We present BanglaSTEM, a dataset of 5,000 Bangla-English sentence pairs covering computer science, mathematics, physics, chemistry, and biology. Our pipeline extracts matching passages from official bilingual curriculum textbooks using OCR, then uses LLMs to align sentences and mark technical terms. These aligned examples serve as few-shot prompts for generating over 12,000 new translation pairs from LLMs, avoiding copyright issues. Human evaluators then select the best 5,000 pairs that correctly preserve technical terminology. We also test a term-weighted BLEU metric that gives higher weight to technical words, since standard metrics treat terminology errors and common word errors equally. We show that our weighted metric correlates better with downstream accuracy in code generation and math solving, while standard BLEU gives high scores even for wrong translations. The full implementation, dataset, and model will be made publicly available.

Anthology ID:: 2026.acl-srw.34
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Santosh T.Y.S.S., Juan Diego Rodriguez, Ona de Gibert
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 403–412
Language:
URL:: https://aclanthology.org/2026.acl-srw.34/
DOI:
Bibkey:
Cite (ACL):: Kazi Reyazul Hasan, A. B. M. Alim Al Islam, and Muhammad Abdullah Adnan. 2026. BanglaSTEM: A Parallel Corpus and Term-Weighted Evaluation for Technical Bangla-English Translation. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 403–412, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: BanglaSTEM: A Parallel Corpus and Term-Weighted Evaluation for Technical Bangla-English Translation (Hasan et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-srw.34.pdf

PDF Cite Search Fix data