Overcoming Vocabulary Sparsity in MT Using Lattices

Steve DeNeefe, Ulf Hermjakob, Kevin Knight


Abstract
Source languages with complex word-formation rules present a challenge for statistical machine translation (SMT). In this paper, we take on three facets of this challenge: (1) common stems are fragmented into many different forms in training data, (2) rare and unknown words are frequent in test data, and (3) spelling variation creates additional sparseness problems. We present a novel, lightweight technique for dealing with this fragmentation, based on bilingual data, and we also present a combination of linguistic and statistical techniques for dealing with rare and unknown words. Taking these techniques together, we demonstrate +1.3 and +1.6 BLEU increases on top of strong baselines for Arabic-English machine translation.
Anthology ID:
2008.amta-papers.7
Volume:
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Research Papers
Month:
October 21-25
Year:
2008
Address:
Waikiki, USA
Venue:
AMTA
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
89–96
Language:
URL:
https://aclanthology.org/2008.amta-papers.7
DOI:
Bibkey:
Cite (ACL):
Steve DeNeefe, Ulf Hermjakob, and Kevin Knight. 2008. Overcoming Vocabulary Sparsity in MT Using Lattices. In Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Research Papers, pages 89–96, Waikiki, USA. Association for Machine Translation in the Americas.
Cite (Informal):
Overcoming Vocabulary Sparsity in MT Using Lattices (DeNeefe et al., AMTA 2008)
Copy Citation:
PDF:
https://aclanthology.org/2008.amta-papers.7.pdf