Improving Low-Resource Neural Machine Translation with Filtered Pseudo-Parallel Corpus

Aizhan Imankulova, Takayuki Sato, Mamoru Komachi


Abstract
Large-scale parallel corpora are indispensable to train highly accurate machine translators. However, manually constructed large-scale parallel corpora are not freely available in many language pairs. In previous studies, training data have been expanded using a pseudo-parallel corpus obtained using machine translation of the monolingual corpus in the target language. However, in low-resource language pairs in which only low-accuracy machine translation systems can be used, translation quality is reduces when a pseudo-parallel corpus is used naively. To improve machine translation performance with low-resource language pairs, we propose a method to expand the training data effectively via filtering the pseudo-parallel corpus using a quality estimation based on back-translation. As a result of experiments with three language pairs using small, medium, and large parallel corpora, language pairs with fewer training data filtered out more sentence pairs and improved BLEU scores more significantly.
Anthology ID:
W17-5704
Volume:
Proceedings of the 4th Workshop on Asian Translation (WAT2017)
Month:
November
Year:
2017
Address:
Taipei, Taiwan
Editors:
Toshiaki Nakazawa, Isao Goto
Venue:
WAT
SIG:
Publisher:
Asian Federation of Natural Language Processing
Note:
Pages:
70–78
Language:
URL:
https://aclanthology.org/W17-5704
DOI:
Bibkey:
Cite (ACL):
Aizhan Imankulova, Takayuki Sato, and Mamoru Komachi. 2017. Improving Low-Resource Neural Machine Translation with Filtered Pseudo-Parallel Corpus. In Proceedings of the 4th Workshop on Asian Translation (WAT2017), pages 70–78, Taipei, Taiwan. Asian Federation of Natural Language Processing.
Cite (Informal):
Improving Low-Resource Neural Machine Translation with Filtered Pseudo-Parallel Corpus (Imankulova et al., WAT 2017)
Copy Citation:
PDF:
https://aclanthology.org/W17-5704.pdf
Code
 aizhanti/filtered-pseudo-parallel-corpora