A Baseline System for Khasi and Assamese Bidirectional NMT with Zero available Parallel Data: Dataset Creation and System Development

Kashyap Kishore, Talukdar Kuwali, Ahmed Mazida, Boruah Parvez


Abstract
In this work we have tried to build a baseline Neural Machine Translation system for Khasi and Assamese in both directions. Both the languages are considered as low-resourced Indic languages. As per the language family in concerned, Assamese is a language from IndoAryan family and Khasi belongs to the MonKhmer branch of the Austroasiatic language family. No prior work is done which investigate the performance of Neural Machine Translation for these two diverse low-resourced languages. It is also worth mentioning that no parallel corpus and test data is available for these two languages. The main contribution of this work is the creation of Khasi-Assamese parallel corpus and test set. Apart from this, we also created baseline systems in both directions for the said language pair. We got best bilingual evaluation understudy (BLEU) score of 2.78 for Khasi to Assamese translation direction and 5.51 for Assamese to Khasi translation direction. We then applied phrase table injection (phrase augmentation) technique and got new higher BLEU score of 5.01 and 7.28 for Khasi to Assamese and Assamese to Khasi translation direction respectively.
Anthology ID:
2023.icon-1.69
Volume:
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
Month:
December
Year:
2023
Address:
Goa University, Goa, India
Editors:
D. Pawar Jyoti, Lalitha Devi Sobha
Venue:
ICON
SIG:
SIGLEX
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
696–702
Language:
URL:
https://aclanthology.org/2023.icon-1.69
DOI:
Bibkey:
Cite (ACL):
Kashyap Kishore, Talukdar Kuwali, Ahmed Mazida, and Boruah Parvez. 2023. A Baseline System for Khasi and Assamese Bidirectional NMT with Zero available Parallel Data: Dataset Creation and System Development. In Proceedings of the 20th International Conference on Natural Language Processing (ICON), pages 696–702, Goa University, Goa, India. NLP Association of India (NLPAI).
Cite (Informal):
A Baseline System for Khasi and Assamese Bidirectional NMT with Zero available Parallel Data: Dataset Creation and System Development (Kishore et al., ICON 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.icon-1.69.pdf