Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora

Hainan Xu, Philipp Koehn


Abstract
We introduce Zipporah, a fast and scalable data cleaning system. We propose a novel type of bag-of-words translation feature, and train logistic regression models to classify good data and synthetic noisy data in the proposed feature space. The trained model is used to score parallel sentences in the data pool for selection. As shown in experiments, Zipporah selects a high-quality parallel corpus from a large, mixed quality data pool. In particular, for one noisy dataset, Zipporah achieves a 2.1 BLEU score improvement with using 1/5 of the data over using the entire corpus.
Anthology ID:
D17-1319
Volume:
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
Month:
September
Year:
2017
Address:
Copenhagen, Denmark
Editors:
Martha Palmer, Rebecca Hwa, Sebastian Riedel
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
2945–2950
Language:
URL:
https://aclanthology.org/D17-1319
DOI:
10.18653/v1/D17-1319
Bibkey:
Cite (ACL):
Hainan Xu and Philipp Koehn. 2017. Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2945–2950, Copenhagen, Denmark. Association for Computational Linguistics.
Cite (Informal):
Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora (Xu & Koehn, EMNLP 2017)
Copy Citation:
PDF:
https://aclanthology.org/D17-1319.pdf
Video:
 https://aclanthology.org/D17-1319.mp4