Investigations on large-scale lightly-supervised training for statistical machine translation.

Holger Schwenk


Abstract
Sentence-aligned bilingual texts are a crucial resource to build statistical machine translation (SMT) systems. In this paper we propose to apply lightly-supervised training to produce additional parallel data. The idea is to translate large amounts of monolingual data (up to 275M words) with an SMT system, and to use those as additional training data. Results are reported for the translation from French into English. We consider two setups: first the intial SMT system is only trained with a very limited amount of human-produced translations, and then the case where we have more than 100 million words. In both conditions, lightly-supervised training achieves significant improvements of the BLEU score.
Anthology ID:
2008.iwslt-papers.6
Volume:
Proceedings of the 5th International Workshop on Spoken Language Translation: Papers
Month:
October 20-21
Year:
2008
Address:
Waikiki, Hawaii
Venue:
IWSLT
SIG:
SIGSLT
Publisher:
Note:
Pages:
182–189
Language:
URL:
https://aclanthology.org/2008.iwslt-papers.6
DOI:
Bibkey:
Cite (ACL):
Holger Schwenk. 2008. Investigations on large-scale lightly-supervised training for statistical machine translation.. In Proceedings of the 5th International Workshop on Spoken Language Translation: Papers, pages 182–189, Waikiki, Hawaii.
Cite (Informal):
Investigations on large-scale lightly-supervised training for statistical machine translation. (Schwenk, IWSLT 2008)
Copy Citation:
PDF:
https://aclanthology.org/2008.iwslt-papers.6.pdf