Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection

Wei Wang, Taro Watanabe, Macduff Hughes, Tetsuji Nakagawa, Ciprian Chelba


Abstract
Measuring domain relevance of data and identifying or selecting well-fit domain data for machine translation (MT) is a well-studied topic, but denoising is not yet. Denoising is concerned with a different type of data quality and tries to reduce the negative impact of data noise on MT training, in particular, neural MT (NMT) training. This paper generalizes methods for measuring and selecting data for domain MT and applies them to denoising NMT training. The proposed approach uses trusted data and a denoising curriculum realized by online data selection. Intrinsic and extrinsic evaluations of the approach show its significant effectiveness for NMT to train on data with severe noise.
Anthology ID:
W18-6314
Volume:
Proceedings of the Third Conference on Machine Translation: Research Papers
Month:
October
Year:
2018
Address:
Brussels, Belgium
Editors:
Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Lucia Specia, Marco Turchi, Karin Verspoor
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
133–143
Language:
URL:
https://aclanthology.org/W18-6314
DOI:
10.18653/v1/W18-6314
Bibkey:
Cite (ACL):
Wei Wang, Taro Watanabe, Macduff Hughes, Tetsuji Nakagawa, and Ciprian Chelba. 2018. Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 133–143, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection (Wang et al., WMT 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-6314.pdf