Neural Machine Translation for Assamese-Bodo, a Low Resourced Indian Language Pair

Kuwali Talukdar; Shikhar Kr. Sarma; Farha Naznin; Kishore Kashyap; Mazida Akhtara Ahmed; Parvez Aziz Boruah

Neural Machine Translation for Assamese-Bodo, a Low Resourced Indian Language Pair

Kuwali Talukdar, Shikhar Kumar Sarma, Farha Naznin, Kishore Kashyap, Mazida Akhtara Ahmed, Parvez Aziz Boruah

Abstract

Impressive results have been reported in various works related to low resource languages, using Neural Machine Translation (NMT), where size of parallel dataset is relatively low. This work presents the experiment of Machine Translation in the low resource Indian language pair AssameseBodo, with a relatively low amount of parallel data. Tokenization of raw data is done with IndicNLP tool. NMT model is trained with preprocessed dataset, and model performances have been observed with varying hyper parameters. Experiments have been completed with Vocab Size 8000 and 16000. Significant increase in BLEU score has been observed in doubling the Vocab size. Also data size increase has contributed to enhanced overall performances. BLEU scores have been recorded with training on a data set of 70000 parallel sentences, and the results are compared with another round of training with a data set enhanced with 11500 Wordnet parallel data. A gold standard test data set of 500 sentence size has been used for recording BLEU. First round reported an overall BLEU of 4.0, with vocab size of 8000. With same vocab size, and Wordnet enhanced dataset, BLEU score of 4.33 was recorded. Significant increase of BLEU score (6.94) has been observed with vocab size of 16000. Next round of experiment was done with additional 7000 new data, and filtering the entire dataset. New BLEU recorded was 9.68, with 16000 vocab size. Cross validation has also been designed and performed with an experiment with 8-fold data chunks prepared on 80K total dataset. Impressive BLEU scores of (Fold-1 through fold-8) 18.12, 16.28, 18.90, 19.25, 19.60, 18.43, 16.28, and 7.70 have been recorded. The 8th fold BLEU deviated from the trend, might be because of nonhomogeneous last fold data.

Anthology ID:: 2023.icon-1.71
Volume:: Proceedings of the 20th International Conference on Natural Language Processing (ICON)
Month:: December
Year:: 2023
Address:: Goa University, Goa, India
Editors:: Jyoti D. Pawar, Sobha Lalitha Devi
Venue:: ICON
SIG:: SIGLEX
Publisher:: NLP Association of India (NLPAI)
Note:
Pages:: 714–719
Language:
URL:: https://aclanthology.org/2023.icon-1.71/
DOI:
Bibkey:
Cite (ACL):: Kuwali Talukdar, Shikhar Kumar Sarma, Farha Naznin, Kishore Kashyap, Mazida Akhtara Ahmed, and Parvez Aziz Boruah. 2023. Neural Machine Translation for Assamese-Bodo, a Low Resourced Indian Language Pair. In Proceedings of the 20th International Conference on Natural Language Processing (ICON), pages 714–719, Goa University, Goa, India. NLP Association of India (NLPAI).
Cite (Informal):: Neural Machine Translation for Assamese-Bodo, a Low Resourced Indian Language Pair (Talukdar et al., ICON 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.icon-1.71.pdf

PDF Cite Search Fix data