MUCS@Adap-MT 2020: Low Resource Domain Adaptation for Indic Machine Translation

Asha Hegde, H.l. Shashirekha


Abstract
Machine Translation (MT) is the task of automatically converting the text in source language to text in target language by preserving the meaning. MT usually require large corpus for training the translation models. Due to scarcity of resources very less attention is given to translating into low resource languages and in particular into Indic languages. In this direction, a shared task called “Adap-MT 2020: Low Resource Domain Adaptation for Indic Machine Translation” is organized to illustrate the capability of general domain MT when translating into Indic languages and low resource domain adaptation of MT systems. In this paper, we, team MUCS, describe a simple word extraction based domain adaptation approach applied to English-Hindi MT only. MT in the proposed model is carried out using Open-NMT - a popular Neural Machine Translation tool. A general domain corpus is built effectively combining the available English-Hindi corpora and removing the duplicate sentences. Further, domain specific corpus is updated by extracting the sentences from generic corpus that contains the words given in the domain specific corpus. The proposed model exhibited satisfactory results for small domain specific AI and CHE corpora provided by the organizers in terms of BLEU score with 1.25 and 2.72 respectively. Further, this methodology is quite generic and can easily be extended to other low resource language pairs as well.
Anthology ID:
2020.icon-adapmt.5
Volume:
Proceedings of the 17th International Conference on Natural Language Processing (ICON): Adap-MT 2020 Shared Task
Month:
December
Year:
2020
Address:
Patna, India
Editors:
Dipti Misra Sharma, Asif Ekbal, Karunesh Arora, Sudip Kumar Naskar, Dipankar Ganguly, Sobha L, Radhika Mamidi, Sunita Arora, Pruthwik Mishra, Vandan Mujadia
Venue:
ICON
SIG:
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
24–28
Language:
URL:
https://aclanthology.org/2020.icon-adapmt.5
DOI:
Bibkey:
Cite (ACL):
Asha Hegde and H.l. Shashirekha. 2020. MUCS@Adap-MT 2020: Low Resource Domain Adaptation for Indic Machine Translation. In Proceedings of the 17th International Conference on Natural Language Processing (ICON): Adap-MT 2020 Shared Task, pages 24–28, Patna, India. NLP Association of India (NLPAI).
Cite (Informal):
MUCS@Adap-MT 2020: Low Resource Domain Adaptation for Indic Machine Translation (Hegde & Shashirekha, ICON 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.icon-adapmt.5.pdf