The Impact of Sentence Alignment Errors on Phrase-Based Machine Translation Performance

Cyril Goutte, Marine Carpuat, George Foster


Abstract
When parallel or comparable corpora are harvested from the web, there is typically a tradeoff between the size and quality of the data. In order to improve quality, corpus collection efforts often attempt to fix or remove misaligned sentence pairs. But, at the same time, Statistical Machine Translation (SMT) systems are widely assumed to be relatively robust to sentence alignment errors. However, there is little empirical evidence to support and characterize this robustness. This contribution investigates the impact of sentence alignment errors on a typical phrase-based SMT system. We confirm that SMT systems are highly tolerant to noise, and that performance only degrades seriously at very high noise levels. Our findings suggest that when collecting larger, noisy parallel data for training phrase-based SMT, cleaning up by trying to detect and remove incorrect alignments can actually degrade performance. Although fixing errors, when applicable, is a preferable strategy to removal, its benefits only become apparent for fairly high misalignment rates. We provide several explanations to support these findings.
Anthology ID:
2012.amta-papers.7
Volume:
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers
Month:
October 28-November 1
Year:
2012
Address:
San Diego, California, USA
Venue:
AMTA
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
Language:
URL:
https://aclanthology.org/2012.amta-papers.7
DOI:
Bibkey:
Cite (ACL):
Cyril Goutte, Marine Carpuat, and George Foster. 2012. The Impact of Sentence Alignment Errors on Phrase-Based Machine Translation Performance. In Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers, San Diego, California, USA. Association for Machine Translation in the Americas.
Cite (Informal):
The Impact of Sentence Alignment Errors on Phrase-Based Machine Translation Performance (Goutte et al., AMTA 2012)
Copy Citation:
PDF:
https://aclanthology.org/2012.amta-papers.7.pdf