Multilingual Multi-Domain NMT for Indian Languages

Sourav Kumar, Salil Aggarwal, Dipti Sharma


Abstract
India is known as the land of many tongues and dialects. Neural machine translation (NMT) is the current state-of-the-art approach for machine translation (MT) but performs better only with large datasets which Indian languages usually lack, making this approach infeasible. So, in this paper, we address the problem of data scarcity by efficiently training multilingual and multilingual multi domain NMT systems involving languages of the ๐ˆ๐ง๐๐ข๐š๐ง ๐ฌ๐ฎ๐›๐œ๐จ๐ง๐ญ๐ข๐ง๐ž๐ง๐ญ. We are proposing the technique for using the joint domain and language tags in a multilingual setup. We draw three major conclusions from our experiments: (i) Training a multilingual system via exploiting lexical similarity based on language family helps in achieving an overall average improvement of ๐Ÿ‘.๐Ÿ๐Ÿ“ ๐๐‹๐„๐” ๐ฉ๐จ๐ข๐ง๐ญ๐ฌ over bilingual baselines, (ii) Technique of incorporating domain information into the language tokens helps multilingual multi-domain system in getting a significant average improvement of ๐Ÿ” ๐๐‹๐„๐” ๐ฉ๐จ๐ข๐ง๐ญ๐ฌ over the baselines, (iii) Multistage fine-tuning further helps in getting an improvement of ๐Ÿ-๐Ÿ.๐Ÿ“ ๐๐‹๐„๐” ๐ฉ๐จ๐ข๐ง๐ญ๐ฌ for the language pair of interest.
Anthology ID:
2021.ranlp-1.83
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Month:
September
Year:
2021
Address:
Held Online
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
727โ€“733
Language:
URL:
https://aclanthology.org/2021.ranlp-main.83
DOI:
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2021.ranlp-main.83.pdf