Bilingual Methods for Adaptive Training Data Selection for Machine Translation

Boxing Chen, Roland Kuhn, George Foster, Colin Cherry, Fei Huang


Abstract
In this paper, we propose a new data selection method which uses semi-supervised convolutional neural networks based on bitokens (Bi-SSCNNs) for training machine translation systems from a large bilingual corpus. In earlier work, we devised a data selection method based on semi-supervised convolutional neural networks (SSCNNs). The new method, Bi-SSCNN, is based on bitokens, which use bilingual information. When the new methods are tested on two translation tasks (Chinese-to-English and Arabic-to-English), they significantly outperform the other three data selection methods in the experiments. We also show that the BiSSCNN method is much more effective than other methods in preventing noisy sentence pairs from being chosen for training. More interestingly, this method only needs a tiny amount of in-domain data to train the selection model, which makes fine-grained topic-dependent translation adaptation possible. In the follow-up experiments, we find that neural machine translation (NMT) is more sensitive to noisy data than statistical machine translation (SMT). Therefore, Bi-SSCNN which can effectively screen out noisy sentence pairs, can benefit NMT much more than SMT.We observed a BLEU improvement over 3 points on an English-to-French WMT task when Bi-SSCNNs were used.
Anthology ID:
2016.amta-researchers.8
Volume:
Conferences of the Association for Machine Translation in the Americas: MT Researchers' Track
Month:
October 28 - November 1
Year:
2016
Address:
Austin, TX, USA
Editors:
Spence Green, Lane Schwartz
Venue:
AMTA
SIG:
Publisher:
The Association for Machine Translation in the Americas
Note:
Pages:
93–106
Language:
URL:
https://aclanthology.org/2016.amta-researchers.8
DOI:
Bibkey:
Cite (ACL):
Boxing Chen, Roland Kuhn, George Foster, Colin Cherry, and Fei Huang. 2016. Bilingual Methods for Adaptive Training Data Selection for Machine Translation. In Conferences of the Association for Machine Translation in the Americas: MT Researchers' Track, pages 93–106, Austin, TX, USA. The Association for Machine Translation in the Americas.
Cite (Informal):
Bilingual Methods for Adaptive Training Data Selection for Machine Translation (Chen et al., AMTA 2016)
Copy Citation:
PDF:
https://aclanthology.org/2016.amta-researchers.8.pdf