Creating the best development corpus for Statistical Machine Translation systems

Mara Chinea-Rios, Germán Sanchis-Trilles, Francisco Casacuberta


Abstract
We propose and study three different novel approaches for tackling the problem of development set selection in Statistical Machine Translation. We focus on a scenario where a machine translation system is leveraged for translating a specific test set, without further data from the domain at hand. Such test set stems from a real application of machine translation, where the texts of a specific e-commerce were to be translated. For developing our development-set selection techniques, we first conducted experiments in a controlled scenario, where labelled data from different domains was available, and evaluated the techniques both with classification and translation quality metrics. Then, the bestperforming techniques were evaluated on the e-commerce data at hand, yielding consistent improvements across two language directions.
Anthology ID:
2018.eamt-main.10
Volume:
Proceedings of the 21st Annual Conference of the European Association for Machine Translation
Month:
May
Year:
2018
Address:
Alicante, Spain
Editors:
Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Miquel Esplà-Gomis, Maja Popović, Celia Rico, André Martins, Joachim Van den Bogaert, Mikel L. Forcada
Venue:
EAMT
SIG:
Publisher:
Note:
Pages:
119–128
Language:
URL:
https://aclanthology.org/2018.eamt-main.10
DOI:
Bibkey:
Cite (ACL):
Mara Chinea-Rios, Germán Sanchis-Trilles, and Francisco Casacuberta. 2018. Creating the best development corpus for Statistical Machine Translation systems. In Proceedings of the 21st Annual Conference of the European Association for Machine Translation, pages 119–128, Alicante, Spain.
Cite (Informal):
Creating the best development corpus for Statistical Machine Translation systems (Chinea-Rios et al., EAMT 2018)
Copy Citation:
PDF:
https://aclanthology.org/2018.eamt-main.10.pdf