Neural Machine Translation for a Low Resource Language Pair: English-Bodo

Boruah Parvez, Talukdar Kuwali, Ahmed Mazida, Kashyap Kishore


Abstract
This paper represent a work done on Neural Machine Translation for English and Bodo language pair. English is a language spoken around the world whereas, Bodo is a language mostly spoken in North Eastern area of India. This work of machine translation is done on a relatively small size of parallel data as there is less parallel corpus available for english bodo pair. Corpus is generally taken from available source National Platform of Language Technology(NPLT), Data Management Unit(DMU), Mission Bhashini, Ministry of Electronics and Information Technology and also generated internally in-house. Tokenization of raw text is done using IndicNLP library and mosesdecoder for Bodo and English respectively. Subword tokenization is performed by using BPE(Byte Pair Encoder) , Sentencepiece and Wordpiece subword. Experiments have been done on two different vocab size of 8000 and 16000 on a total of around 92410 parallel sentences. Two standard transformer encoder and decoder models with varying number of layers and hidden size are build for training the data using OpenNMT-py framework. The result are evaluated based on the BLEU score on an additional testset for evaluating the performance. The highest BLEU score of 11.01 and 14.62 are achieved on the testset for English to Bodo and Bodo to English translation respectively.
Anthology ID:
2023.icon-1.21
Volume:
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
Month:
December
Year:
2023
Address:
Goa University, Goa, India
Editors:
D. Pawar Jyoti, Lalitha Devi Sobha
Venue:
ICON
SIG:
SIGLEX
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
295–300
Language:
URL:
https://aclanthology.org/2023.icon-1.21
DOI:
Bibkey:
Cite (ACL):
Boruah Parvez, Talukdar Kuwali, Ahmed Mazida, and Kashyap Kishore. 2023. Neural Machine Translation for a Low Resource Language Pair: English-Bodo. In Proceedings of the 20th International Conference on Natural Language Processing (ICON), pages 295–300, Goa University, Goa, India. NLP Association of India (NLPAI).
Cite (Informal):
Neural Machine Translation for a Low Resource Language Pair: English-Bodo (Parvez et al., ICON 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.icon-1.21.pdf