Cheating to Identify Hard Problems for Neural Machine Translation

Proyag Pal; Kenneth Heafield

doi:10.18653/v1/2023.findings-eacl.120

Cheating to Identify Hard Problems for Neural Machine Translation

Abstract

We identify hard problems for neural machine translation models by analyzing progressively higher-scoring translations generated by letting models cheat to various degrees. If a system cheats and still gets something wrong, that suggests it is a hard problem. We experiment with two forms of cheating: providing the model a compressed representation of the target as an additional input, and fine-tuning on the test set. Contrary to popular belief, we find that the most frequent tokens are not necessarily the most accurately translated due to these often being function words and punctuation that can be used more flexibly in translation, or content words which can easily be paraphrased. We systematically analyze system outputs to identify categories of tokens which are particularly hard for the model to translate, and find that this includes certain types of named entities, subordinating conjunctions, and unknown and foreign words. We also encounter a phenomenon where words, often names, which were not infrequent in the training data are still repeatedly mistranslated by the models — we dub this the Fleetwood Mac problem.

Anthology ID:: 2023.findings-eacl.120
Original:: 2023.findings-eacl.120v1
Version 2:: 2023.findings-eacl.120v2
Volume:: Findings of the Association for Computational Linguistics: EACL 2023
Month:: May
Year:: 2023
Address:: Dubrovnik, Croatia
Editors:: Andreas Vlachos, Isabelle Augenstein
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1620–1631
Language:
URL:: https://aclanthology.org/2023.findings-eacl.120/
DOI:: 10.18653/v1/2023.findings-eacl.120
Bibkey:
Cite (ACL):: Proyag Pal and Kenneth Heafield. 2023. Cheating to Identify Hard Problems for Neural Machine Translation. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1620–1631, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):: Cheating to Identify Hard Problems for Neural Machine Translation (Pal & Heafield, Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-eacl.120.pdf
Video:: https://aclanthology.org/2023.findings-eacl.120.mp4

PDF (v2) PDF (v1) Cite Search Video Fix data