Adrià Giménez


pdf bib
Comparison of data selection techniques for the translation of video lectures
Joern Wuebker | Hermann Ney | Adrià Martínez-Villaronga | Adrià Giménez | Alfons Juan | Christophe Servan | Marc Dymetman | Shachar Mirkin
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track

For the task of online translation of scientific video lectures, using huge models is not possible. In order to get smaller and efficient models, we perform data selection. In this paper, we perform a qualitative and quantitative comparison of several data selection techniques, based on cross-entropy and infrequent n-gram criteria. In terms of BLEU, a combination of translation and language model cross-entropy achieves the most stable results. As another important criterion for measuring translation quality in our application, we identify the number of out-of-vocabulary words. Here, infrequent n-gram recovery shows superior performance. Finally, we combine the two selection techniques in order to benefit from both their strengths.