Exploiting Monolingual Data at Scale for Neural Machine Translation

Lijun Wu, Yiren Wang, Yingce Xia, Tao Qin, Jianhuang Lai, Tie-Yan Liu


Abstract
While target-side monolingual data has been proven to be very useful to improve neural machine translation (briefly, NMT) through back translation, source-side monolingual data is not well investigated. In this work, we study how to use both the source-side and target-side monolingual data for NMT, and propose an effective strategy leveraging both of them. First, we generate synthetic bitext by translating monolingual data from the two domains into the other domain using the models pretrained on genuine bitext. Next, a model is trained on a noised version of the concatenated synthetic bitext where each source sequence is randomly corrupted. Finally, the model is fine-tuned on the genuine bitext and a clean version of a subset of the synthetic bitext without adding any noise. Our approach achieves state-of-the-art results on WMT16, WMT17, WMT18 EnglishGerman translations and WMT19 GermanFrench translations, which demonstrate the effectiveness of our method. We also conduct a comprehensive study on how each part in the pipeline works.
Anthology ID:
D19-1430
Volume:
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Month:
November
Year:
2019
Address:
Hong Kong, China
Editors:
Kentaro Inui, Jing Jiang, Vincent Ng, Xiaojun Wan
Venues:
EMNLP | IJCNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
4207–4216
Language:
URL:
https://aclanthology.org/D19-1430/
DOI:
10.18653/v1/D19-1430
Bibkey:
Cite (ACL):
Lijun Wu, Yiren Wang, Yingce Xia, Tao Qin, Jianhuang Lai, and Tie-Yan Liu. 2019. Exploiting Monolingual Data at Scale for Neural Machine Translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4207–4216, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):
Exploiting Monolingual Data at Scale for Neural Machine Translation (Wu et al., EMNLP-IJCNLP 2019)
Copy Citation:
PDF:
https://aclanthology.org/D19-1430.pdf
Attachment:
 D19-1430.Attachment.zip
Data
WMT 2016