Using Source-Language Transformations to Address Register Mismatches in SMT

Manny Rayner, Pierrette Bouillon, Barry Haddow


Abstract
Mismatches between training and test data are a ubiquitous problem for real SMT applications. In this paper, we examine a type of mismatch that commonly arises when translating from French and similar languages: available training data is mostly formal register, but test data may well be informal register. We consider methods for defining surface transformations that map common informal language constructions into their formal language counterparts, or vice versa; we then describe two ways to use these mappings, either to create artificial training data or to pre-process source text at run-time. An initial evaluation performed using crowd-sourced comparisons of alternate translations produced by a French-to-English SMT system suggests that both methods can improve performance, with run-time pre-processing being the more effective of the two.
Anthology ID:
2012.amta-papers.25
Volume:
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers
Month:
October 28-November 1
Year:
2012
Address:
San Diego, California, USA
Venue:
AMTA
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
Language:
URL:
https://aclanthology.org/2012.amta-papers.25
DOI:
Bibkey:
Cite (ACL):
Manny Rayner, Pierrette Bouillon, and Barry Haddow. 2012. Using Source-Language Transformations to Address Register Mismatches in SMT. In Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers, San Diego, California, USA. Association for Machine Translation in the Americas.
Cite (Informal):
Using Source-Language Transformations to Address Register Mismatches in SMT (Rayner et al., AMTA 2012)
Copy Citation:
PDF:
https://aclanthology.org/2012.amta-papers.25.pdf