UTFPR at WMT 2018: Minimalistic Supervised Corpora Filtering for Machine Translation

Gustavo Paetzold


Abstract
We present the UTFPR systems at the WMT 2018 parallel corpus filtering task. Our supervised approach discerns between good and bad translations by training classic binary classification models over an artificially produced binary classification dataset derived from a high-quality translation set, and a minimalistic set of 6 semantic distance features that rely only on easy-to-gather resources. We rank translations by their probability for the “good” label. Our results show that logistic regression pairs best with our approach, yielding more consistent results throughout the different settings evaluated.
Anthology ID:
W18-6483
Volume:
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
Month:
October
Year:
2018
Address:
Belgium, Brussels
Editors:
Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Lucia Specia, Marco Turchi, Karin Verspoor
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
923–927
Language:
URL:
https://aclanthology.org/W18-6483
DOI:
10.18653/v1/W18-6483
Bibkey:
Cite (ACL):
Gustavo Paetzold. 2018. UTFPR at WMT 2018: Minimalistic Supervised Corpora Filtering for Machine Translation. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 923–927, Belgium, Brussels. Association for Computational Linguistics.
Cite (Informal):
UTFPR at WMT 2018: Minimalistic Supervised Corpora Filtering for Machine Translation (Paetzold, WMT 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-6483.pdf