Detecting Various Types of Noise for Neural Machine Translation

Christian Herold, Jan Rosendahl, Joris Vanvinckenroye, Hermann Ney


Abstract
The filtering and/or selection of training data is one of the core aspects to be considered when building a strong machine translation system. In their influential work, Khayrallah and Koehn (2018) investigated the impact of different types of noise on the performance of machine translation systems. In the same year the WMT introduced a shared task on parallel corpus filtering, which went on to be repeated in the following years, and resulted in many different filtering approaches being proposed. In this work we aim to combine the recent achievements in data filtering with the original analysis of Khayrallah and Koehn (2018) and investigate whether state-of-the-art filtering systems are capable of removing all the suggested noise types. We observe that most of these types of noise can be detected with an accuracy of over 90% by modern filtering systems when operating in a well studied high resource setting. However, we also find that when confronted with more refined noise categories or when working with a less common language pair, the performance of the filtering systems is far from optimal, showing that there is still room for improvement in this area of research.
Anthology ID:
2022.findings-acl.200
Volume:
Findings of the Association for Computational Linguistics: ACL 2022
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2542–2551
Language:
URL:
https://aclanthology.org/2022.findings-acl.200
DOI:
10.18653/v1/2022.findings-acl.200
Bibkey:
Cite (ACL):
Christian Herold, Jan Rosendahl, Joris Vanvinckenroye, and Hermann Ney. 2022. Detecting Various Types of Noise for Neural Machine Translation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2542–2551, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Detecting Various Types of Noise for Neural Machine Translation (Herold et al., Findings 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.findings-acl.200.pdf
Video:
 https://aclanthology.org/2022.findings-acl.200.mp4