Iterative Data Augmentation for Neural Machine Translation: a Low Resource Case Study for English-Telugu

Sandipan Dandapat; Christian Federmann

Iterative Data Augmentation for Neural Machine Translation: a Low Resource Case Study for English-Telugu

Abstract

Telugu is the fifteenth most commonly spoken language in the world with an estimated reach of 75 million people in the Indian subcontinent. At the same time, it is a severely low resourced language. In this paper, we present work on English–Telugu general domain machine translation (MT) systems using small amounts of parallel data. The baseline statistical (SMT) and neural MT (NMT) systems do not yield acceptable translation quality, mostly due to limited resources. However, the use of synthetic parallel data (generated using back translation, based on an NMT engine) significantly improves translation quality and allows NMT to outperform SMT. We extend back translation and propose a new, iterative data augmentation (IDA) method. Filtering of synthetic data and IDA both further boost translation quality of our final NMT systems, as measured by BLEU scores on all test sets and based on state-of-the-art human evaluation.

Anthology ID:: 2018.eamt-main.29
Volume:: Proceedings of the 21st Annual Conference of the European Association for Machine Translation
Month:: May
Year:: 2018
Address:: Alicante, Spain
Editors:: Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Miquel Esplà-Gomis, Maja Popović, Celia Rico, André Martins, Joachim Van den Bogaert, Mikel L. Forcada
Venue:: EAMT
SIG:
Publisher:
Note:
Pages:: 307–312
Language:
URL:: https://aclanthology.org/2018.eamt-main.29/
DOI:
Bibkey:
Cite (ACL):: Sandipan Dandapat and Christian Federmann. 2018. Iterative Data Augmentation for Neural Machine Translation: a Low Resource Case Study for English-Telugu. In Proceedings of the 21st Annual Conference of the European Association for Machine Translation, pages 307–312, Alicante, Spain.
Cite (Informal):: Iterative Data Augmentation for Neural Machine Translation: a Low Resource Case Study for English-Telugu (Dandapat & Federmann, EAMT 2018)
Copy Citation:
PDF:: https://aclanthology.org/2018.eamt-main.29.pdf

PDF Cite Search Fix data