Data selection for NMT using Infrequent n-gram Recovery

Zuzanna Parcheta, Germán Sanchis-Trilles, Francisco Casacuberta


Abstract
Neural Machine Translation (NMT) has achieved promising results comparable with Phrase-Based Statistical Machine Translation (PBSMT). However, to train a neural translation engine, much more powerful machines are required than those required to develop translation engines based on PBSMT. One solution to reduce the training cost of NMT systems is the reduction of the training corpus through data selection (DS) techniques. There are many DS techniques applied in PBSMT which bring good results. In this work, we show that the data selection technique based on infrequent n-gram occurrence described in (Gasco ́ et al., 2012) commonly used for PBSMT systems also works well for NMT systems. We focus our work on selecting data according to specific corpora using the previously mentioned technique. The specific-domain corpora used for our experiments are IT domain and medical domain. The DS technique significantly reduces the execution time required to train the model between 87% and 93%. Also, it improves translation quality by up to 2.8 BLEU points. The improvements are obtained with just a small fraction of the data that accounts for between 6% and 20% of the total data.
Anthology ID:
2018.eamt-main.22
Volume:
Proceedings of the 21st Annual Conference of the European Association for Machine Translation
Month:
May
Year:
2018
Address:
Alicante, Spain
Editors:
Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Miquel Esplà-Gomis, Maja Popović, Celia Rico, André Martins, Joachim Van den Bogaert, Mikel L. Forcada
Venue:
EAMT
SIG:
Publisher:
Note:
Pages:
239–248
Language:
URL:
https://aclanthology.org/2018.eamt-main.22
DOI:
Bibkey:
Cite (ACL):
Zuzanna Parcheta, Germán Sanchis-Trilles, and Francisco Casacuberta. 2018. Data selection for NMT using Infrequent n-gram Recovery. In Proceedings of the 21st Annual Conference of the European Association for Machine Translation, pages 239–248, Alicante, Spain.
Cite (Informal):
Data selection for NMT using Infrequent n-gram Recovery (Parcheta et al., EAMT 2018)
Copy Citation:
PDF:
https://aclanthology.org/2018.eamt-main.22.pdf