Prompsit’s submission to WMT 2018 Parallel Corpus Filtering shared task

Víctor M. Sánchez-Cartagena, Marta Bañón, Sergio Ortiz-Rojas, Gema Ramírez


Abstract
This paper describes Prompsit Language Engineering’s submissions to the WMT 2018 parallel corpus filtering shared task. Our four submissions were based on an automatic classifier for identifying pairs of sentences that are mutual translations. A set of hand-crafted hard rules for discarding sentences with evident flaws were applied before the classifier. We explored different strategies for achieving a training corpus with diverse vocabulary and fluent sentences: language model scoring, an active-learning-inspired data selection algorithm and n-gram saturation. Our submissions were very competitive in comparison with other participants on the 100 million word training corpus.
Anthology ID:
W18-6488
Volume:
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
Month:
October
Year:
2018
Address:
Belgium, Brussels
Venues:
EMNLP | WMT | WS
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
955–962
Language:
URL:
https://aclanthology.org/W18-6488
DOI:
10.18653/v1/W18-6488
Bibkey:
Cite (ACL):
Víctor M. Sánchez-Cartagena, Marta Bañón, Sergio Ortiz-Rojas, and Gema Ramírez. 2018. Prompsit’s submission to WMT 2018 Parallel Corpus Filtering shared task. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 955–962, Belgium, Brussels. Association for Computational Linguistics.
Cite (Informal):
Prompsit’s submission to WMT 2018 Parallel Corpus Filtering shared task (Sánchez-Cartagena et al., 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-6488.pdf
Code
 bitextor/bicleaner