MTNLP-IIITH: Machine Translation for Low-Resource Indic Languages

Abhinav P M; Ketaki Shetye; Parameswari Krishnamurthy

doi:10.18653/v1/2024.wmt-1.65

MTNLP-IIITH: Machine Translation for Low-Resource Indic Languages

Abhinav P M, Ketaki Shetye, Parameswari Krishnamurthy

Abstract

Machine Translation for low-resource languages presents significant challenges, primarily due to limited data availability. We have a baseline model and a primary model. For the baseline model, we first fine-tune the mBART model (mbart-large-50-many-to-many-mmt) for the language pairs English-Khasi, Khasi-English, English-Manipuri, and Manipuri-English. We then augment the dataset by back-translating from Indic languages to English. To enhance data quality, we fine-tune the LaBSE model specifically for Khasi and Manipuri, generating sentence embeddings and applying a cosine similarity threshold of 0.84 to filter out low-quality back-translations. The filtered data is combined with the original training data and used to further fine-tune the mBART model, creating our primary model. The results show that the primary model slightly outperforms the baseline model, with the best performance achieved by the English-to-Khasi (en-kh) primary model, which recorded a BLEU score of 0.0492, a chrF score of 0.3316, and a METEOR score of 0.2589 (on a scale of 0 to 1), with similar results for other language pairs.

Anthology ID:: 2024.wmt-1.65
Volume:: Proceedings of the Ninth Conference on Machine Translation
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venues:: WMT | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 751–755
Language:
URL:: https://aclanthology.org/2024.wmt-1.65/
DOI:: 10.18653/v1/2024.wmt-1.65
Bibkey:
Cite (ACL):: Abhinav P M, Ketaki Shetye, and Parameswari Krishnamurthy. 2024. MTNLP-IIITH: Machine Translation for Low-Resource Indic Languages. In Proceedings of the Ninth Conference on Machine Translation, pages 751–755, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: MTNLP-IIITH: Machine Translation for Low-Resource Indic Languages (P M et al., WMT 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.wmt-1.65.pdf

PDF Cite Search Fix data