The Bangla/Bengali Seed Dataset Submission to the WMT24 Open Language Data Initiative Shared Task

Firoz Ahmed, Nitin Venkateswaran, Sarah Moeller


Abstract
We contribute a seed dataset for the Bangla/Bengali language as part of the WMT24 Open Language Data Initiative shared task. We validate the quality of the dataset against a mined and automatically aligned dataset (NLLBv1) and two other existing datasets of crowdsourced manual translations. The validation is performed by investigating the performance of state-of-the-art translation models fine-tuned on the different datasets after controlling for training set size. Machine translation models fine-tuned on our dataset outperform models tuned on the other datasets in both translation directions (English-Bangla and Bangla-English). These results confirm the quality of our dataset. We hope our dataset will support machine translation for the Bangla/Bengali community and related low-resource languages.
Anthology ID:
2024.wmt-1.42
Volume:
Proceedings of the Ninth Conference on Machine Translation
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venue:
WMT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
556–566
Language:
URL:
https://aclanthology.org/2024.wmt-1.42
DOI:
Bibkey:
Cite (ACL):
Firoz Ahmed, Nitin Venkateswaran, and Sarah Moeller. 2024. The Bangla/Bengali Seed Dataset Submission to the WMT24 Open Language Data Initiative Shared Task. In Proceedings of the Ninth Conference on Machine Translation, pages 556–566, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
The Bangla/Bengali Seed Dataset Submission to the WMT24 Open Language Data Initiative Shared Task (Ahmed et al., WMT 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.wmt-1.42.pdf