Clean data for training statistical MT: the case of MT contamination

Michel Simard


Abstract
Users of Statistical Machine Translation (SMT) sometimes turn to the Web to obtain data to train their systems. One problem with this approach is the potential for “MT contamination”: when large amounts of parallel data are collected automatically, there is a risk that a non-negligible portion consists of machine-translated text. Theoretically, using this kind of data to train SMT systems is likely to reinforce the errors committed by other systems, or even by an earlier versions of the same system. In this paper, we study the effect of MT-contaminated training data on SMT quality, by performing controlled simulations under a wide range of conditions. Our experiments highlight situations in which MT contamination can be harmful, and assess the potential of decontamination techniques.
Anthology ID:
2014.amta-researchers.6
Volume:
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track
Month:
October 22-26
Year:
2014
Address:
Vancouver, Canada
Editors:
Yaser Al-Onaizan, Michel Simard
Venue:
AMTA
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
69–82
Language:
URL:
https://aclanthology.org/2014.amta-researchers.6
DOI:
Bibkey:
Cite (ACL):
Michel Simard. 2014. Clean data for training statistical MT: the case of MT contamination. In Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track, pages 69–82, Vancouver, Canada. Association for Machine Translation in the Americas.
Cite (Informal):
Clean data for training statistical MT: the case of MT contamination (Simard, AMTA 2014)
Copy Citation:
PDF:
https://aclanthology.org/2014.amta-researchers.6.pdf