An exploratory approach to the Parallel Corpus Filtering shared task WMT20

Ankur Kejriwal, Philipp Koehn


Abstract
In this document we describe our submission to the parallel corpus filtering task using multilingual word embedding, language models and an ensemble of pre and post filtering rules. We use the norms of embedding and the perplexities of language models along with pre/post filtering rules to complement the LASER baseline scores and in the end get an improvement on the dev set in both language pairs.
Anthology ID:
2020.wmt-1.108
Volume:
Proceedings of the Fifth Conference on Machine Translation
Month:
November
Year:
2020
Address:
Online
Editors:
Loïc Barrault, Ondřej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Alexander Fraser, Yvette Graham, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, André Martins, Makoto Morishita, Christof Monz, Masaaki Nagata, Toshiaki Nakazawa, Matteo Negri
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
959–965
Language:
URL:
https://aclanthology.org/2020.wmt-1.108
DOI:
Bibkey:
Cite (ACL):
Ankur Kejriwal and Philipp Koehn. 2020. An exploratory approach to the Parallel Corpus Filtering shared task WMT20. In Proceedings of the Fifth Conference on Machine Translation, pages 959–965, Online. Association for Computational Linguistics.
Cite (Informal):
An exploratory approach to the Parallel Corpus Filtering shared task WMT20 (Kejriwal & Koehn, WMT 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.wmt-1.108.pdf
Video:
 https://slideslive.com/38939649