Efficient Neural Machine Translation for Low-Resource Languages via Exploiting Related Languages

Vikrant Goyal, Sourav Kumar, Dipti Misra Sharma


Abstract
A large percentage of the world’s population speaks a language of the Indian subcontinent, comprising languages from both Indo-Aryan (e.g. Hindi, Punjabi, Gujarati, etc.) and Dravidian (e.g. Tamil, Telugu, Malayalam, etc.) families. A universal characteristic of Indian languages is their complex morphology, which, when combined with the general lack of sufficient quantities of high-quality parallel data, can make developing machine translation (MT) systems for these languages difficult. Neural Machine Translation (NMT) is a rapidly advancing MT paradigm and has shown promising results for many language pairs, especially in large training data scenarios. Since the condition of large parallel corpora is not met for Indian-English language pairs, we present our efforts towards building efficient NMT systems between Indian languages (specifically Indo-Aryan languages) and English via efficiently exploiting parallel data from the related languages. We propose a technique called Unified Transliteration and Subword Segmentation to leverage language similarity while exploiting parallel data from related language pairs. We also propose a Multilingual Transfer Learning technique to leverage parallel data from multiple related languages to assist translation for low resource language pair of interest. Our experiments demonstrate an overall average improvement of 5 BLEU points over the standard Transformer-based NMT baselines.
Anthology ID:
2020.acl-srw.22
Volume:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Month:
July
Year:
2020
Address:
Online
Editors:
Shruti Rijhwani, Jiangming Liu, Yizhong Wang, Rotem Dror
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
162–168
Language:
URL:
https://aclanthology.org/2020.acl-srw.22
DOI:
10.18653/v1/2020.acl-srw.22
Bibkey:
Cite (ACL):
Vikrant Goyal, Sourav Kumar, and Dipti Misra Sharma. 2020. Efficient Neural Machine Translation for Low-Resource Languages via Exploiting Related Languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 162–168, Online. Association for Computational Linguistics.
Cite (Informal):
Efficient Neural Machine Translation for Low-Resource Languages via Exploiting Related Languages (Goyal et al., ACL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.acl-srw.22.pdf
Dataset:
 2020.acl-srw.22.Dataset.zip
Software:
 2020.acl-srw.22.Software.zip
Video:
 http://slideslive.com/38928675