Original or Translated? A Causal Analysis of the Impact of Translationese on Machine Translation Performance

Jingwei Ni, Zhijing Jin, Markus Freitag, Mrinmaya Sachan, Bernhard Schölkopf


Abstract
Human-translated text displays distinct features from naturally written text in the same language. This phenomena, known as translationese, has been argued to confound the machine translation (MT) evaluation. Yet, we find that existing work on translationese neglects some important factors and the conclusions are mostly correlational but not causal. In this work, we collect CausalMT, a dataset where the MT training data are also labeled with the human translation directions. We inspect two critical factors, the train-test direction match (whether the human translation directions in the training and test sets are aligned), and data-model direction match (whether the model learns in the same direction as the human translation direction in the dataset). We show that these two factors have a large causal effect on the MT performance, in addition to the test-model direction mismatch highlighted by existing work on the impact of translationese. In light of our findings, we provide a set of suggestions for MT training and evaluation. Our code and data are at https://github.com/EdisonNi-hku/CausalMT
Anthology ID:
2022.naacl-main.389
Volume:
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
July
Year:
2022
Address:
Seattle, United States
Editors:
Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5303–5320
Language:
URL:
https://aclanthology.org/2022.naacl-main.389
DOI:
10.18653/v1/2022.naacl-main.389
Bibkey:
Cite (ACL):
Jingwei Ni, Zhijing Jin, Markus Freitag, Mrinmaya Sachan, and Bernhard Schölkopf. 2022. Original or Translated? A Causal Analysis of the Impact of Translationese on Machine Translation Performance. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5303–5320, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
Original or Translated? A Causal Analysis of the Impact of Translationese on Machine Translation Performance (Ni et al., NAACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.naacl-main.389.pdf
Video:
 https://aclanthology.org/2022.naacl-main.389.mp4
Code
 edisonni-hku/causalmt