A Comparison of Data Filtering Methods for Neural Machine Translation

Fred Bane, Celia Soler Uguet, Wiktor Stribiżew, Anna Zaretskaya


Abstract
With the increasing availability of large-scale parallel corpora derived from web crawling and bilingual text mining, data filtering is becoming an increasingly important step in neural machine translation (NMT) pipelines. This paper applies several available tools to the task of data filtration, and compares their performance in filtering out different types of noisy data. We also study the effect of filtration with each tool on model performance in the downstream task of NMT by creating a dataset containing a combination of clean and noisy data, filtering the data with each tool, and training NMT engines using the resulting filtered corpora. We evaluate the performance of each engine with a combination of direct assessment (DA) and automated metrics. Our best results are obtained by training for a short time on all available data then filtering the corpus with cross-entropy filtering and training until convergence.
Anthology ID:
2022.amta-upg.22
Volume:
Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas (Volume 2: Users and Providers Track and Government Track)
Month:
September
Year:
2022
Address:
Orlando, USA
Venue:
AMTA
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
313–325
Language:
URL:
https://aclanthology.org/2022.amta-upg.22
DOI:
Bibkey:
Cite (ACL):
Fred Bane, Celia Soler Uguet, Wiktor Stribiżew, and Anna Zaretskaya. 2022. A Comparison of Data Filtering Methods for Neural Machine Translation. In Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas (Volume 2: Users and Providers Track and Government Track), pages 313–325, Orlando, USA. Association for Machine Translation in the Americas.
Cite (Informal):
A Comparison of Data Filtering Methods for Neural Machine Translation (Bane et al., AMTA 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.amta-upg.22.pdf