Synthetic Data and Model Dynamics based Performance Analysis for Assamese-Bodo Low Resource NMT

Kuwali Talukdar; Shikhar Kumar Sarma; Kishore Kashyap

Synthetic Data and Model Dynamics based Performance Analysis for Assamese-Bodo Low Resource NMT

Kuwali Talukdar, Shikhar Kumar Sarma, Kishore Kashyap

Abstract

This paper presents details of modelling and performance analysis of Neural Machine Translation (NMT) for the low-resource Assamese-Bodo language pair, focusing on model tuning and the use of synthetic data. Given the scarcity of parallel corpora for these languages, synthetic data generation techniques, such as back-translation, were employed to enhance translation performance. The NMT architecture was used along with necessary preprocessing steps as per the NMT pipeline. Experimentation across varying model parameters have been performed and scores are recorded. The model’s performance was evaluated using the BLEU score, which showed significant improvement when synthetic data was incorporated into the training process. While a base model with gold standard data of relatively smaller size yielded Overall BLEU of 11.35, optimized tuned model with synthetic data has resulted considerable improvement in BLEU scores across the domains, with overall BLEU upto 14.74. Challenges related to data scarcity and model optimization are also discussed, along with potential future improvements.

Anthology ID:: 2024.icon-1.20
Volume:: Proceedings of the 21st International Conference on Natural Language Processing (ICON)
Month:: December
Year:: 2024
Address:: AU-KBC Research Centre, Chennai, India
Editors:: Sobha Lalitha Devi, Karunesh Arora
Venue:: ICON
SIG:
Publisher:: NLP Association of India (NLPAI)
Note:
Pages:: 178–187
Language:
URL:: https://aclanthology.org/2024.icon-1.20/
DOI:
Bibkey:
Cite (ACL):: Kuwali Talukdar, Shikhar Kumar Sarma, and Kishore Kashyap. 2024. Synthetic Data and Model Dynamics based Performance Analysis for Assamese-Bodo Low Resource NMT. In Proceedings of the 21st International Conference on Natural Language Processing (ICON), pages 178–187, AU-KBC Research Centre, Chennai, India. NLP Association of India (NLPAI).
Cite (Informal):: Synthetic Data and Model Dynamics based Performance Analysis for Assamese-Bodo Low Resource NMT (Talukdar et al., ICON 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.icon-1.20.pdf

PDF Cite Search Fix data