Original or Translated? A Causal Analysis of the Impact of Translationese on Machine Translation Performance
Jingwei Ni | Zhijing Jin | Markus Freitag | Mrinmaya Sachan | Bernhard Schölkopf
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Human-translated text displays distinct features from naturally written text in the same language. This phenomena, known as translationese, has been argued to confound the machine translation (MT) evaluation. Yet, we find that existing work on translationese neglects some important factors and the conclusions are mostly correlational but not causal. In this work, we collect CausalMT, a dataset where the MT training data are also labeled with the human translation directions. We inspect two critical factors, the train-test direction match (whether the human translation directions in the training and test sets are aligned), and data-model direction match (whether the model learns in the same direction as the human translation direction in the dataset). We show that these two factors have a large causal effect on the MT performance, in addition to the test-model direction mismatch highlighted by existing work on the impact of translationese. In light of our findings, we provide a set of suggestions for MT training and evaluation. Our code and data are at https://github.com/EdisonNi-hku/CausalMT
Causal Direction of Data Collection Matters: Implications of Causal and Anticausal Learning for NLP
Zhijing Jin | Julius von Kügelgen | Jingwei Ni | Tejas Vaidhya | Ayush Kaushal | Mrinmaya Sachan | Bernhard Schoelkopf
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
The principle of independent causal mechanisms (ICM) states that generative processes of real world data consist of independent modules which do not influence or inform each other. While this idea has led to fruitful developments in the field of causal inference, it is not widely-known in the NLP community. In this work, we argue that the causal direction of the data collection process bears nontrivial implications that can explain a number of published NLP findings, such as differences in semi-supervised learning (SSL) and domain adaptation (DA) performance across different settings. We categorize common NLP tasks according to their causal direction and empirically assay the validity of the ICM principle for text data using minimum description length. We conduct an extensive meta-analysis of over 100 published SSL and 30 DA studies, and find that the results are consistent with our expectations based on causal insights. This work presents the first attempt to analyze the ICM principle in NLP, and provides constructive suggestions for future modeling choices.
- Zhijing Jin 2
- Mrinmaya Sachan 2
- Markus Freitag 1
- Bernhard Schölkopf 1
- Julius von Kügelgen 1
- show all...