Using Monolingual Data in Neural Machine Translation: a Systematic Study

Franck Burlot, François Yvon


Abstract
Neural Machine Translation (MT) has radically changed the way systems are developed. A major difference with the previous generation (Phrase-Based MT) is the way monolingual target data, which often abounds, is used in these two paradigms. While Phrase-Based MT can seamlessly integrate very large language models trained on billions of sentences, the best option for Neural MT developers seems to be the generation of artificial parallel data through back-translation - a technique that fails to fully take advantage of existing datasets. In this paper, we conduct a systematic study of back-translation, comparing alternative uses of monolingual data, as well as multiple data generation procedures. Our findings confirm that back-translation is very effective and give new explanations as to why this is the case. We also introduce new data simulation techniques that are almost as effective, yet much cheaper to implement.
Anthology ID:
W18-6315
Volume:
Proceedings of the Third Conference on Machine Translation: Research Papers
Month:
October
Year:
2018
Address:
Brussels, Belgium
Venues:
EMNLP | WMT | WS
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
144–155
Language:
URL:
https://aclanthology.org/W18-6315
DOI:
10.18653/v1/W18-6315
Bibkey:
Cite (ACL):
Franck Burlot and François Yvon. 2018. Using Monolingual Data in Neural Machine Translation: a Systematic Study. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 144–155, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Using Monolingual Data in Neural Machine Translation: a Systematic Study (Burlot & Yvon, 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-6315.pdf
Code
 franckbrl/nmt-pseudo-source-discriminator