Filtering Matters: Experiments in Filtering Training Sets for Machine Translation

Steinþór Steingrímsson, Hrafn Loftsson, Andy Way


Abstract
We explore different approaches for filtering parallel data for MT training, whether the same filtering approaches suit different datasets, and if separate filters should be applied to a dataset depending on the translation direction. We evaluate the results of different approaches, both manually and on a downstream NMT task. We find that, first, it is beneficial to inspect how well different filtering approaches suit different datasets and, second, that while MT systems trained on data prepared using different filters do not differ substantially in quality, there is indeed a statistically significant difference. Finally, we find that the same training sets do not seem to suit different translation directions.
Anthology ID:
2023.nodalida-1.58
Volume:
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Month:
May
Year:
2023
Address:
Tórshavn, Faroe Islands
Editors:
Tanel Alumäe, Mark Fishel
Venue:
NoDaLiDa
SIG:
Publisher:
University of Tartu Library
Note:
Pages:
588–600
Language:
URL:
https://aclanthology.org/2023.nodalida-1.58
DOI:
Bibkey:
Cite (ACL):
Steinþór Steingrímsson, Hrafn Loftsson, and Andy Way. 2023. Filtering Matters: Experiments in Filtering Training Sets for Machine Translation. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 588–600, Tórshavn, Faroe Islands. University of Tartu Library.
Cite (Informal):
Filtering Matters: Experiments in Filtering Training Sets for Machine Translation (Steingrímsson et al., NoDaLiDa 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.nodalida-1.58.pdf