Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection
Wei Wang, Taro Watanabe, Macduff Hughes, Tetsuji Nakagawa, Ciprian Chelba
Abstract
Measuring domain relevance of data and identifying or selecting well-fit domain data for machine translation (MT) is a well-studied topic, but denoising is not yet. Denoising is concerned with a different type of data quality and tries to reduce the negative impact of data noise on MT training, in particular, neural MT (NMT) training. This paper generalizes methods for measuring and selecting data for domain MT and applies them to denoising NMT training. The proposed approach uses trusted data and a denoising curriculum realized by online data selection. Intrinsic and extrinsic evaluations of the approach show its significant effectiveness for NMT to train on data with severe noise.- Anthology ID:
- W18-6314
- Volume:
- Proceedings of the Third Conference on Machine Translation: Research Papers
- Month:
- October
- Year:
- 2018
- Address:
- Brussels, Belgium
- Editors:
- Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Lucia Specia, Marco Turchi, Karin Verspoor
- Venue:
- WMT
- SIG:
- SIGMT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 133–143
- Language:
- URL:
- https://aclanthology.org/W18-6314
- DOI:
- 10.18653/v1/W18-6314
- Bibkey:
- Cite (ACL):
- Wei Wang, Taro Watanabe, Macduff Hughes, Tetsuji Nakagawa, and Ciprian Chelba. 2018. Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 133–143, Brussels, Belgium. Association for Computational Linguistics.
- Cite (Informal):
- Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection (Wang et al., WMT 2018)
- Copy Citation:
- PDF:
- https://aclanthology.org/W18-6314.pdf
Export citation
@inproceedings{wang-etal-2018-denoising, title = "Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection", author = "Wang, Wei and Watanabe, Taro and Hughes, Macduff and Nakagawa, Tetsuji and Chelba, Ciprian", editor = "Bojar, Ond{\v{r}}ej and Chatterjee, Rajen and Federmann, Christian and Fishel, Mark and Graham, Yvette and Haddow, Barry and Huck, Matthias and Yepes, Antonio Jimeno and Koehn, Philipp and Monz, Christof and Negri, Matteo and N{\'e}v{\'e}ol, Aur{\'e}lie and Neves, Mariana and Post, Matt and Specia, Lucia and Turchi, Marco and Verspoor, Karin", booktitle = "Proceedings of the Third Conference on Machine Translation: Research Papers", month = oct, year = "2018", address = "Brussels, Belgium", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/W18-6314", doi = "10.18653/v1/W18-6314", pages = "133--143", abstract = "Measuring domain relevance of data and identifying or selecting well-fit domain data for machine translation (MT) is a well-studied topic, but denoising is not yet. Denoising is concerned with a different type of data quality and tries to reduce the negative impact of data noise on MT training, in particular, neural MT (NMT) training. This paper generalizes methods for measuring and selecting data for domain MT and applies them to denoising NMT training. The proposed approach uses trusted data and a denoising curriculum realized by online data selection. Intrinsic and extrinsic evaluations of the approach show its significant effectiveness for NMT to train on data with severe noise.", }
<?xml version="1.0" encoding="UTF-8"?> <modsCollection xmlns="http://www.loc.gov/mods/v3"> <mods ID="wang-etal-2018-denoising"> <titleInfo> <title>Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection</title> </titleInfo> <name type="personal"> <namePart type="given">Wei</namePart> <namePart type="family">Wang</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Taro</namePart> <namePart type="family">Watanabe</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Macduff</namePart> <namePart type="family">Hughes</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Tetsuji</namePart> <namePart type="family">Nakagawa</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ciprian</namePart> <namePart type="family">Chelba</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2018-10</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <relatedItem type="host"> <titleInfo> <title>Proceedings of the Third Conference on Machine Translation: Research Papers</title> </titleInfo> <name type="personal"> <namePart type="given">Ondřej</namePart> <namePart type="family">Bojar</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Rajen</namePart> <namePart type="family">Chatterjee</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Christian</namePart> <namePart type="family">Federmann</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Mark</namePart> <namePart type="family">Fishel</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Yvette</namePart> <namePart type="family">Graham</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Barry</namePart> <namePart type="family">Haddow</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Matthias</namePart> <namePart type="family">Huck</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Antonio</namePart> <namePart type="given">Jimeno</namePart> <namePart type="family">Yepes</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Philipp</namePart> <namePart type="family">Koehn</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Christof</namePart> <namePart type="family">Monz</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Matteo</namePart> <namePart type="family">Negri</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Aurélie</namePart> <namePart type="family">Névéol</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Mariana</namePart> <namePart type="family">Neves</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Matt</namePart> <namePart type="family">Post</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Lucia</namePart> <namePart type="family">Specia</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Marco</namePart> <namePart type="family">Turchi</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Karin</namePart> <namePart type="family">Verspoor</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>Association for Computational Linguistics</publisher> <place> <placeTerm type="text">Brussels, Belgium</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> </relatedItem> <abstract>Measuring domain relevance of data and identifying or selecting well-fit domain data for machine translation (MT) is a well-studied topic, but denoising is not yet. Denoising is concerned with a different type of data quality and tries to reduce the negative impact of data noise on MT training, in particular, neural MT (NMT) training. This paper generalizes methods for measuring and selecting data for domain MT and applies them to denoising NMT training. The proposed approach uses trusted data and a denoising curriculum realized by online data selection. Intrinsic and extrinsic evaluations of the approach show its significant effectiveness for NMT to train on data with severe noise.</abstract> <identifier type="citekey">wang-etal-2018-denoising</identifier> <identifier type="doi">10.18653/v1/W18-6314</identifier> <location> <url>https://aclanthology.org/W18-6314</url> </location> <part> <date>2018-10</date> <extent unit="page"> <start>133</start> <end>143</end> </extent> </part> </mods> </modsCollection>
%0 Conference Proceedings %T Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection %A Wang, Wei %A Watanabe, Taro %A Hughes, Macduff %A Nakagawa, Tetsuji %A Chelba, Ciprian %Y Bojar, Ondřej %Y Chatterjee, Rajen %Y Federmann, Christian %Y Fishel, Mark %Y Graham, Yvette %Y Haddow, Barry %Y Huck, Matthias %Y Yepes, Antonio Jimeno %Y Koehn, Philipp %Y Monz, Christof %Y Negri, Matteo %Y Névéol, Aurélie %Y Neves, Mariana %Y Post, Matt %Y Specia, Lucia %Y Turchi, Marco %Y Verspoor, Karin %S Proceedings of the Third Conference on Machine Translation: Research Papers %D 2018 %8 October %I Association for Computational Linguistics %C Brussels, Belgium %F wang-etal-2018-denoising %X Measuring domain relevance of data and identifying or selecting well-fit domain data for machine translation (MT) is a well-studied topic, but denoising is not yet. Denoising is concerned with a different type of data quality and tries to reduce the negative impact of data noise on MT training, in particular, neural MT (NMT) training. This paper generalizes methods for measuring and selecting data for domain MT and applies them to denoising NMT training. The proposed approach uses trusted data and a denoising curriculum realized by online data selection. Intrinsic and extrinsic evaluations of the approach show its significant effectiveness for NMT to train on data with severe noise. %R 10.18653/v1/W18-6314 %U https://aclanthology.org/W18-6314 %U https://doi.org/10.18653/v1/W18-6314 %P 133-143
Markdown (Informal)
[Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection](https://aclanthology.org/W18-6314) (Wang et al., WMT 2018)
- Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection (Wang et al., WMT 2018)
ACL
- Wei Wang, Taro Watanabe, Macduff Hughes, Tetsuji Nakagawa, and Ciprian Chelba. 2018. Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 133–143, Brussels, Belgium. Association for Computational Linguistics.