Firoz Ahmed
2024
The Bangla/Bengali Seed Dataset Submission to the WMT24 Open Language Data Initiative Shared Task
Firoz Ahmed
|
Nitin Venkateswaran
|
Sarah Moeller
Proceedings of the Ninth Conference on Machine Translation
We contribute a seed dataset for the Bangla/Bengali language as part of the WMT24 Open Language Data Initiative shared task. We validate the quality of the dataset against a mined and automatically aligned dataset (NLLBv1) and two other existing datasets of crowdsourced manual translations. The validation is performed by investigating the performance of state-of-the-art translation models fine-tuned on the different datasets after controlling for training set size. Machine translation models fine-tuned on our dataset outperform models tuned on the other datasets in both translation directions (English-Bangla and Bangla-English). These results confirm the quality of our dataset. We hope our dataset will support machine translation for the Bangla/Bengali community and related low-resource languages.