EduMT: Developing Machine Translation System for Educational Content in Indian Languages

Ramakrishna Appicharla, Asif Ekbal, Pushpak Bhattacharyya


Abstract
In this paper, we explore various approaches to build Hindi to Bengali Neural Machine Translation (NMT) systems for the educational domain. Translation of educational content poses several challenges, such as unavailability of gold standard data for model building, extensive uses of domain-specific terms, as well as the presence of noise in the form of spontaneous speech as the corpus is prepared from subtitle data and noise due to the process of corpus creation through back-translation. We create an educational parallel corpus by crawling lecture subtitles and translating them into Hindi and Bengali using Google translate. We also create a clean parallel corpus by post-editing synthetic corpus via annotation and crowd-sourcing. We build NMT systems on the prepared corpus with domain adaptation objectives. We also explore data augmentation methods by automatically cleaning synthetic corpus and using it to further train the models. We experiment with combining domain adaptation objective with multilingual NMT. We report BLEU and TER scores of all the models on a manually created Hindi-Bengali educational testset. Our experiments show that the multilingual domain adaptation model outperforms all the other models by achieving 34.8 BLEU and 0.466 TER scores.
Anthology ID:
2021.icon-main.6
Volume:
Proceedings of the 18th International Conference on Natural Language Processing (ICON)
Month:
December
Year:
2021
Address:
National Institute of Technology Silchar, Silchar, India
Editors:
Sivaji Bandyopadhyay, Sobha Lalitha Devi, Pushpak Bhattacharyya
Venue:
ICON
SIG:
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
35–43
Language:
URL:
https://aclanthology.org/2021.icon-main.6
DOI:
Bibkey:
Cite (ACL):
Ramakrishna Appicharla, Asif Ekbal, and Pushpak Bhattacharyya. 2021. EduMT: Developing Machine Translation System for Educational Content in Indian Languages. In Proceedings of the 18th International Conference on Natural Language Processing (ICON), pages 35–43, National Institute of Technology Silchar, Silchar, India. NLP Association of India (NLPAI).
Cite (Informal):
EduMT: Developing Machine Translation System for Educational Content in Indian Languages (Appicharla et al., ICON 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.icon-main.6.pdf
Data
Samanantar