Extracting In-domain Training Corpora for Neural Machine Translation Using Data Selection Methods

Catarina Cruz Silva; Chao-Hong Liu; Alberto Poncelas; Andy Way

doi:10.18653/v1/W18-6323

Extracting In-domain Training Corpora for Neural Machine Translation Using Data Selection Methods

Catarina Cruz Silva, Chao-Hong Liu, Alberto Poncelas, Andy Way

Abstract

Data selection is a process used in selecting a subset of parallel data for the training of machine translation (MT) systems, so that 1) resources for training might be reduced, 2) trained models could perform better than those trained with the whole corpus, and/or 3) trained models are more tailored to specific domains. It has been shown that for statistical MT (SMT), the use of data selection helps improve the MT performance significantly. In this study, we reviewed three data selection approaches for MT, namely Term Frequency– Inverse Document Frequency, Cross-Entropy Difference and Feature Decay Algorithm, and conducted experiments on Neural Machine Translation (NMT) with the selected data using the three approaches. The results showed that for NMT systems, using data selection also improved the performance, though the gain is not as much as for SMT systems.

Anthology ID:: W18-6323
Volume:: Proceedings of the Third Conference on Machine Translation: Research Papers
Month:: October
Year:: 2018
Address:: Brussels, Belgium
Editors:: Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Lucia Specia, Marco Turchi, Karin Verspoor
Venue:: WMT
SIG:: SIGMT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 224–231
Language:
URL:: https://aclanthology.org/W18-6323/
DOI:: 10.18653/v1/W18-6323
Bibkey:
Cite (ACL):: Catarina Cruz Silva, Chao-Hong Liu, Alberto Poncelas, and Andy Way. 2018. Extracting In-domain Training Corpora for Neural Machine Translation Using Data Selection Methods. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 224–231, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):: Extracting In-domain Training Corpora for Neural Machine Translation Using Data Selection Methods (Silva et al., WMT 2018)
Copy Citation:
PDF:: https://aclanthology.org/W18-6323.pdf

PDF Cite Search Fix data