Data augmentation using back-translation for context-aware neural machine translation

Amane Sugiyama, Naoki Yoshinaga


Abstract
A single sentence does not always convey information that is enough to translate it into other languages. Some target languages need to add or specialize words that are omitted or ambiguous in the source languages (e.g, zero pronouns in translating Japanese to English or epicene pronouns in translating English to French). To translate such ambiguous sentences, we need contexts beyond a single sentence, and have so far explored context-aware neural machine translation (NMT). However, a large amount of parallel corpora is not easily available to train accurate context-aware NMT models. In this study, we first obtain large-scale pseudo parallel corpora by back-translating monolingual data, and then investigate its impact on the translation accuracy of context-aware NMT models. We evaluated context-aware NMT models trained with small parallel corpora and the large-scale pseudo parallel corpora on English-Japanese and English-French datasets to demonstrate the large impact of the data augmentation for context-aware NMT models.
Anthology ID:
D19-6504
Volume:
Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019)
Month:
November
Year:
2019
Address:
Hong Kong, China
Venue:
DiscoMT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
35–44
Language:
URL:
https://aclanthology.org/D19-6504
DOI:
10.18653/v1/D19-6504
Bibkey:
Cite (ACL):
Amane Sugiyama and Naoki Yoshinaga. 2019. Data augmentation using back-translation for context-aware neural machine translation. In Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019), pages 35–44, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):
Data augmentation using back-translation for context-aware neural machine translation (Sugiyama & Yoshinaga, DiscoMT 2019)
Copy Citation:
PDF:
https://aclanthology.org/D19-6504.pdf