SPRING Lab IITM’s Submission to Low Resource Indic Language Translation Shared Task

Hamees Sayed, Advait Joglekar, Srinivasan Umesh


Abstract
We develop a robust translation model for four low-resource Indic languages: Khasi, Mizo, Manipuri, and Assamese. Our approach includes a comprehensive pipeline from data collection and preprocessing to training and evaluation, leveraging data from WMT task datasets, BPCC, PMIndia, and OpenLanguageData. To address the scarcity of bilingual data, we use back-translation techniques on monolingual datasets for Mizo and Khasi, significantly expanding our training corpus. We fine-tune the pre-trained NLLB 3.3B model for Assamese, Mizo, and Manipuri, achieving improved performance over the baseline. For Khasi, which is not supported by the NLLB model, we introduce special tokens and train the model on our Khasi corpus. Our training involves masked language modelling, followed by fine-tuning for English-to-Indic and Indic-to-English translations.
Anthology ID:
2024.wmt-1.68
Volume:
Proceedings of the Ninth Conference on Machine Translation
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venue:
WMT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
770–774
Language:
URL:
https://aclanthology.org/2024.wmt-1.68
DOI:
Bibkey:
Cite (ACL):
Hamees Sayed, Advait Joglekar, and Srinivasan Umesh. 2024. SPRING Lab IITM’s Submission to Low Resource Indic Language Translation Shared Task. In Proceedings of the Ninth Conference on Machine Translation, pages 770–774, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
SPRING Lab IITM’s Submission to Low Resource Indic Language Translation Shared Task (Sayed et al., WMT 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.wmt-1.68.pdf