Causal Direction of Data Collection Matters: Implications of Causal and Anticausal Learning for NLP

The principle of independent causal mechanisms (ICM) states that generative processes of real world data consist of independent modules which do not influence or inform each other. While this idea has led to fruitful developments in the field of causal inference, it is not widely-known in the NLP community. In this work, we argue that the causal direction of the data collection process bears nontrivial implications that can explain a number of published NLP findings, such as differences in semi-supervised learning (SSL) and domain adaptation (DA) performance across different settings. We categorize common NLP tasks according to their causal direction and empirically assay the validity of the ICM principle for text data using minimum description length. We conduct an extensive meta-analysis of over 100 published SSL and 30 DA studies, and find that the results are consistent with our expectations based on causal insights. This work presents the first attempt to analyze the ICM principle in NLP, and provides constructive suggestions for future modeling choices.


Introduction
NLP practitioners typically do not pay great attention to the causal direction of the data collection process. As a motivating example, consider the case of collecting a dataset to train a machine translation (MT) model to translate from English (En) to Spanish (Es): it is common practice to mix all available En-Es sentence pairs together and train the model on the entire pooled data set (Bahdanau et al., 2015;Cho et al., 2014). However, such mixed corpora actually consist of two distinct types of data: (i) sentences that originated in English and have been translated (by human translators) into Spanish (En→Es); and (ii) sentences that originated in Figure 1: Annotation process for NLP data: the random variable that exists first is typically the cause (e.g., a given prompt), and the one generated afterwards is typically the effect (e.g., the annotated answer).
Spanish and have subsequently been translated into English (Es→En). 2 Intuitively, these two subsets are qualitatively different, and an increasing number of observations by the NLP community indeed suggests that they exhibit different properties (Freitag et al., 2019;Edunov et al., 2020;Riley et al., 2020;Shen et al., 2021). In the case of MT, for example, researchers find that training models on each of these two types of data separately leads to different test performance, as well as different performance improvement by semi-supervised learning (SSL) (Bogoychev and Sennrich, 2019;Graham et al., 2020;Edunov et al., 2020). Motivated by this observation that the data collection process seems to matter for model performance, in this work, we provide an explanation of this phenomenon from the perspective of causality (Pearl, 2009;Peters et al., 2017).
First, we introduce the notion of the causal direction for a given NLP task, see Fig. 1 for an example. Throughout, we denote the input of a learning task by X and the output which is to be predicted by Y . If, during the data collection process, X is generated first, and then Y is collected based on X (e.g., through annotation), we say that X causes Y , and denote this by X → Y . If, on the other hand, Y is A causal graph C → E, where C is the cause and E is the effect. The function f (·, N E ) denotes the causal process, or mechanism, P E|C by which the effect E is generated from C and unobserved noise N E . (Bottom) Based on whether the direction of prediction aligns with the direction of causation or not, we distinguish two types of tasks: (i) causal learning, i.e., predicting the effect from the cause; and (ii) anticausal learning, i.e., predicting the cause from the effect. generated first, and then X is collected based on Y , we say that Y causes X (Y → X). 3 Based on whether the direction of prediction aligns with the causal direction of the data collection process or not, Schölkopf et al. (2012) categorize these types of tasks as causal learning (X → Y ), or anticausal learning (Y → X), respectively; see Fig. 2 for an illustration. In the context of our motivating MT example this means that, if the goal is to translate from English (X = En) into Spanish (Y = Es), training only on subset (i) of the data consisting of En→Es pairs corresponds to causal learning (X → Y ), whereas training only on subset (ii) consisting of Es→En pairs is categorised as anticausal learning (Y → X).
Based on the principle of independent causal mechanisms (ICM) Peters et al., 2017), it has been hypothesized that the causal direction of data collection (i.e., whether a given NLP learning task can be classified as causal or anticausal) has implications for the effectiveness of commonly used techniques such as SSL and domain adaptation (DA) (Schölkopf et al., 2012). We will argue that this can explain performance differences reported by the NLP community across different data collection processes and tasks. In particular, we make the following contributions: 1. We categorize a number of common NLP tasks according to the causal direction of the underlying data collection process ( § 2). 2. We review the ICM principle and its implications for common techniques of using unlabelled data such as SSL and DA in the context 3 This corresponds to an interventional notion of causation: if one were to manipulate the cause, the annotation process would lead to a potentially different effect. A manipulation of the effect, in contrast, would not change the cause. of causal and anticausal NLP tasks ( § 3). 3. We empirically assay the validity of ICM for NLP data using minimum description length in a machine translation setting ( § 4). 4. We verify experimentally and through a metastudy of over respectively 100 (SSL) and 30 (DA) published findings that the difference in SSL ( § 5) and domain adaptation (DA) ( § 6) performance on causal vs anticausal datasets reported in the literature is consistent with what is predicted by the ICM principle. 5. We make suggestions on how to use findings in this paper for future work in NLP ( § 7).

Categorization of Common NLP Tasks into Causal and Anticausal Learning
We start by categorizing common NLP tasks which use an input variable X to predict a target or output variable Y into causal learning (X → Y ), anticausal learning (Y → X), and other tasks that do not have a clear underlying causal direction, or which typically rely on mixed (causal and anticausal) types of data, as summarised in Tab. 1. Key to this categorization is determining whether the input X corresponds to the cause or the effect in the data collection process. As illustrated in Fig. 1, if the input X and output Y are generated at two different time steps, then the variable that is generated first is typically the cause, and the other that is subsequently generated is typically the effect, provided it is generated based on the previous one (rather than, say, on a common confounder that causes both variables). If X and Y are generated jointly, then we need to distinguish based on the underlying generative process whether one of the two variables is causing the other variable.

Learning Effect from Cause (Causal Learning)
Causal (X → Y ) NLP tasks typically aim to predict a post-hoc generated human annotation (i.e., the target Y is the effect) from a given input X (the cause). Examples include: summarization (article→summary) where the goal is to produce a summary Y of a given input text X; parsing and tagging (text→linguists' annotated structure) where the goal is to predict an annotated syntactic structure Y of a given input sentence X; data-totext generation (data→description) where the goal is to produce a textual description Y of a set of structured input data X; and information extraction (text→entities/relations/etc) where the goal is to extract structured information from a given text.
Learning Cause from Effect (Anticausal Learning) Anticausal (Y → X) NLP tasks typically aim to predict or infer some latent target property Y such as an unobserved prompt from an observed input X which takes the form of one of its effects. Typical anticausal NLP learning problems include, for example, author attribute identification (author attribute→text) where the goal is to predict some unobserved attribute Y of the writer of a given text snippet X; and review sentiment classification (sentiment→review text) where the goal is to predict the latent sentiment Y that caused an author to write a particular review X.
Other/Mixed Some tasks can be categorized as either causal or anticausal, depending on how exactly the data is collected. In § 1, we discussed the example of MT where different types of (causal and anticausal) data are typically mixed. Another example is the task of intent classification: if the same author reveals their intent before the writing (i.e., intent→text), it can be viewed as an anticausal learning task; if, on the other hand, the data is annotated by other people who are not the original author (i.e., text→annotated intent), it can be viewed as a causal learning task. A similar reasoning applies to question answering and generation tasks which respectively aim to provide an answer to a given question, or vice versa: if first a piece of informative text is selected and annotators are then asked to come up with a corresponding question (answer→question) as, e.g., in the SQuAD dataset (Rajpurkar et al., 2016), then question answering is an anticausal and question generation a causal learning task; if, on the other hand, a question such as a search query is selected first and subsequently an answer is pro-vided (question→answer) as, e.g., in the Natural Questions dataset (Kwiatkowski et al., 2019), then question answering is a causal and question generation an anticausal learning task. Often, multiple such datasets are combined without regard for their causal direction.

Implications of ICM for Causal and Anticausal Learning Problems
Whether we are in a causal or anticausal learning scenario has important implications for semisupervised learning (SSL) and domain adaptation (DA) (Schölkopf et al., 2012;Sgouritsa et al., 2015;Zhang et al., 2013Zhang et al., , 2015Gong et al., 2016;von Kügelgen et al., 2019von Kügelgen et al., , 2020, which are techniques also commonly used in NLP. These implications are derived from the principle of independent causal mechanisms (ICM) (Schölkopf et al., 2012;Lemeire and Dirkx, 2006) which states that "the causal generative process of a system's variables is composed of autonomous modules that do not inform or influence each other" (Peters et al., 2017).
In the bivariate case, this amount to a type of independence assumption between the distribution P C of the cause C, and the causal process, or mechanism, P E|C that generates the effect from the cause. For example, for a question answering task, the generative process P C by which one person comes up with a question C is "independent" of the process P E|C by which another person produces an answer E for question C. 4 Here, "independent" is not meant in the sense of statistical independence of random variables, but rather as independence at the level of generative processes or distributions in the sense that P C and P E|C do not share information (the person asking the question and the one answering may not know each other) and can be manipulated independently of each other (we can swap either of the two for another participant without the other one being influenced by this). Crucially, this type of independence is generally violated in the opposite, i.e., anticausal, direction: P E and P C|E may share information and change dependently (Daniušis et al., 2010;. This has two important implications for common learning tasks (Schölkopf et al., 2012) which are illustrated in Fig. 3.  Figure 3: The ICM principle assumes that the generative process P C of the cause C is independent of the causal mechanism P E|C : the two distributions share no information and each may be changed or manipulated without affecting the other. In the anticausal direction, on the other hand, the effect distribution P E is (in the generic case) not independent of the inverse mechanism P C|E : they may share information and change dependently. (Left) SSL, which aims to improve an estimate of the target conditional P Y |X given additional unlabelled input data from P X , should therefore not help for causal learning (X → Y ), but may help in the anticausal direction (Y → X). (Right) DA, which aims to adapt a model of P Y |X from a source domain to a target domain (e.g., fine-tuning on a smaller dataset), should work better for causal learning settings where a change in P C is not expected to lead to a change in the mechanism P E|C , whereas in the anticausal direction P E and P C|E may change in a dependent manner.

C differs from C S P E|C is domain invariant
Implications of ICM for SSL First, if P C shares no information with P E|C , SSL-where one has additional unlabelled input data from P X and aims to improve an estimate of the target conditional P Y |X -should not work in the causal direction (X → Y ), but may work in the anticausal direction (Y → X), as P E and P C|E may share information. Causal NLP tasks should thus be less likely to show improvements over a supervised baseline when using SSL than anticausal tasks.
Implications of ICM for DA Second, according to the ICM principle, the causal mechanism P E|C should be invariant to changes in the cause distribution P C , so domain-specifically, covariate shift (Shimodaira, 2000;Sugiyama and Kawanabe, 2012)-adaptation, where P X changes but P Y |X is assumed to stay invariant, should work in the causal direction, but not necessarily in the anticausal direction. Hence, DA should be easier for causal NLP tasks than for anticausal NLP tasks.

Investigating the Validity of ICM for NLP Data Using MDL
Traditionally, the ICM principle is thought of in the context of physical processes or mechanisms, rather than social or linguistic ones such as language. Since ICM amounts to an independence assumption that-while well motivated in principlemay not always hold in practice, 5 we now assay its validity on NLP data. Recall, that ICM postulates a type of independence between P C and P E|C . One way to formalize this uses Kolmogorov complexity K(·) as a measure of algorithmic information, which can be 5 E.g., due to confounding influences from unobserved variables, or mechanisms which have co-evolved to be dependent understood as the length of the shortest program that computes a particular algorithmic object such as a distribution or a function (Solomonoff, 1964;Kolmogorov, 1965). ICM then reads (1) In other words, the shortest description of the joint distribution P C,E corresponds to describing P C and P E|C separately (i.e., they share no information), whereas there may be redundant (shared) information in the non-causal direction such that a separate description of P E and P C|E will generally be longer than that of the joint distribution P C,E .

Estimation by MDL
Since Kolmogorov complexity is not computable (Li et al., 2008), we adopt a commonly used proxy, the minimum description length (MDL) (Grünwald, 2007), to test the applicability of ICM for NLP data. Given an input, such as a collection of observations {(c i , e i )} n i=1 ∼ P C,E , MDL returns the shortest codelength (in bits) needed to compress the input, as well as the parameters needed to decompress it. We use MDL to approximate (1) as follows: MDL(c 1:n , e 1:n ) = MDL(c 1:n ) + MDL(e 1:n |c 1:n ) ≤ MDL(e 1:n ) + MDL(c 1:n |e 1:n ), where MDL(·|·) denotes a conditional compression where the second argument is treated as "free parameters" which do not count towards the compression length of the first argument. Eq.
(2) can thus 6 Here, + = and + ≤ hold up a constant due to the choice of a Turing machine in the definition of algorithmic information. be interpreted as a comparison between two ways of compressing the same data (c 1:n , e 1:n ): either we first compress c 1:n and then compress e 1:n conditional on c 1:n , or vice versa. According to the ICM principle, the first way should tend to be more "concise" than the second.

Calculating MDL Using Machine
Translation as a Case Study To empirically assess the validity of ICM for NLP data using MDL as a proxy, we turn to MT as a case study. We choose MT because the input and output spaces of MT are relatively symmetric, as opposed to other NLP tasks such as text classification where the input space is sequences, but the output space is a small set of labels. There are only very few studies which calculate MDL on NLP data, so we extend the method of Voita and Titov (2020) to calculate MDL using online codes (Rissanen, 1984) for deep learning tasks (Blier and Ollivier, 2018). Since the original calculation method for MDL by Voita and Titov (2020) was developed for classification, we extend it to sequence-to-sequence (Seq2Seq) generation. Specifically, given a translation dataset D = {(x 1 , y 1 ), . . . , (x n , y n )} of n pairs of sentences x i with translation y i , denote the size of the vocabulary of the source language by V x , and the size of the vocabulary of the target language by V y . In order to assess whether (2) holds, we need to calculate four different terms: two marginal terms MDL(x 1:n ) and MDL(y 1:n ), and two conditional terms MDL(y 1:n |x 1:n ) and MDL(x 1:n |y 1:n ).

Codelength of the Conditional Terms
To calculate the codelength of the two conditional terms, we extend the method of Voita and Titov (2020) from classification to Seq2Seq generation. Following the setting of Voita and Titov (2020), we break the dataset D into 10 disjoint subsets with increasing sizes and denote the end index of each subset as t i . 7 We then estimate MDL(y 1:n |x 1:n ) as ' MDL(y 1:n |x 1: where length(y i ) refers to the number of tokens in the sequence y i , θ i are the parameters of a translation model h i trained on the first t i data points, and seq idx 1 :idx 2 refers to the set of sequences from 7 The sizes of the 10 subsets are 0.1, 0.2, 0.4, 0.8, 1.6, 3.2, 6.25, 12.5, 25, and 50 percent of the dataset size, respectively. E.g., t1 = 0.1%n, t2 = (0.1% + 0.2%)n, . . . .  the idx 1 -th to the idx 2 -th sample in the dataset D, where seq ∈ {x, y} and idx i ∈ {1, . . . , n}.

Dataset Size Note
Similarly, when calculating MDL(x 1:n |y 1:n ), we simply swap the roles of x and y.
Codelength of the Marginal Terms When calculating the two marginal terms, MDL(x 1:n ) and MDL(y 1:n ), we make two changes from the above calculation of conditional terms: first, we replace the translation models h i with language models; second, we remove the conditional distribution.
That is, we calculate MDL(x 1:n ) as where θ i are the parameters of a language model h i trained on the first t i data points. We apply the same method to calculate MDL(y 1:n ). For the language model, we use GPT2 (Radford et al., 2019), and for the translation model, we use the Marian neural machine translation model (Junczys-Dowmunt et al., 2018) trained on the OPUS Corpus (Tiedemann and Nygaard, 2004). For fair comparison, all models adopt the transformer architecture (Vaswani et al., 2017), and have roughly the same number of parameters. See Appendix B for more experimental details.

CausalMT Corpus
For our MDL experiment, we need datasets for which the causal direction of data collection is known, i.e., for which we have ground-truth annotation of which text is the original and which is a translation, instead of a mixture of both. Since existing MT corpora do not have this property as discussed in § 1, we curate our own corpus, which we call the CausalMT corpus.
Specifically, we consider the existing MT dataset WMT'19, 8 and identify some subsets that have a clear notion of causality. The subsets we use are the EuroParl (Koehn, 2005) and Global Voices   translation corpora. 9 For EuroParl, each text has meta information such as the speaker's language; for Global Voices, each text has meta information about whether it is translated or not. We regard text that is in the same language as the speaker's native language in EuroParl (and non-translated text in Global Voices) as the original (i.e., the cause). We then retrieve a corresponding effect by using the cause text to match the parallel pairs in the processed dataset. In this way, we compile six translation datasets with clear causal direction as summarized in Tab. 2. For each dataset, we use 1K samples each as test and validation sets, and use the rest for training.

Results
The results of our MDL experiment on the six CausalMT datasets are summarised in Tab. 3. If ICM holds, we expect the sum of codelengths to be smaller for the causal direction than for the anticausal one, see (2). As can be seen from the last column, this is the case for five out of the six datasets. For example, on one of the largest datasets (En→Es), the MDL difference is 346 kbits. 10 Comparing the dataset sizes in Tab. 2 and results in Tab. 3, we observe that the absolute MDL values are roughly proportional to dataset size, but other factors such as language and task complexity also play a role. This is inherent to the nature of MDL being the sum of codelengths of the model and of the data given the model. Since we use equally-sized datasets for each language pair in the CausalMT corpus (i.e., in both the X → Y and Y → X directions, see Tab. 2), numbers for the same language pair in Tab. 3, including the most important column "MDL(X)+MDL(Y|X) vs. MDL(Y)+MDL(X|Y)", form a valid comparison. That is, En&Es experiments are comparable within 9 Link to Global Voices. 10 As far as we know, determining statistical significance in the investigated setting remains an open problem. While, in theory, one may use information entropy to estimate it, in practice, this may be inaccurate since (i) MDL is only a proxy for algorithmic information; and (ii) ICM may not hold exactly, but only approximately. We evaluate on six different datasets, so that the overall results can show a general trend. themselves, so are the other language pairs. For some of the smaller differences in the last column in Tab. 3, and, in particular the reversed inequality in row 4, a potential explanation may be the relatively small dataset size, as well as the fact that text data may be confounded (e.g., through shared grammar and semantics).

SSL for Causal vs. Anticausal Models
In semi-supervised learning (SSL), we are given a typically-small set of k labeled observations D L = {(x 1 , y 1 ), . . . , (x k , y k )}, and a typicallylarge set of m unlabeled observations of the input m }. SSL then aims to use the additional information about the input distribution P X from the unlabeled dataset D U to improve a model of P Y |X learned on the labeled dataset D L .
As explained in § 3, SSL should only work for anticausal (or confounded) learning tasks, according to the ICM principle. Schölkopf et al. (2012) have observed this trend on a number of classification and regression tasks on small-scale numerical inputs, such as predicting Boston housing prices from quantifiable neighborhood features (causal learning), or breast cancer from lab statistics (anticausal learning). However, there exist no studies investigating the implications of ICM for SSL on NLP data, which is of a more complex nature due to the high dimensionality of the input and output spaces, as well as potentially large confounding. In the following, we use a sequence-to-sequence decipherment experiment ( § 5.1) and a meta-study of existing literature ( § 5.2) to showcase that the same phenomenon also occurs in NLP.

Decipherment Experiment
To have control over causal direction of the data collection process, we use a synthetic decipherment dataset to test the difference in SSL improvement between causal and anticausal learning tasks.
Dataset We create a synthetic dataset of encrypted sequences. Specifically, we (i) adopt a monolingual English corpus (for which we use the English corpus of the En→Es in the CausalMT dataset, for  convenience), (ii) apply the ROT13 encryption algorithm (Schneier, 1996) to obtain the encrypted corpus, and then (iii) apply noise on the corpus that is chosen to be the effect corpus.
In the encryption step (ii), for each English sentence x, its encryption ROT13(x) replaces each letter with the 13th letter after it in the alphabet, e.g., "A"→"N," "B"→"O." Note that we choose ROT13 due to its invertibility, since ROT13(ROT13(x)) = x. Therefore, without any noises, the corpus of English and the corpus of encrypted sequences by ROT13 are symmetric.
In the noising step (iii), we apply noise either to the English text or to the ciphertext, thus creating two datasets Cipher→En, and En→Cipher, respectively. When applying noise to a sequence, we use the implementation of the Fairseq library. 11 Namely, we mask some random words in the sequence (word masking), permute a part of the sequence (permuted noise), randomly shift the endings of the sequence to the beginning (rolling noise), and insert some random characters or masks to the sequence (insertion noise). We set the probability of all noises to p = 5%.
Results For each of the two datasets En→Cipher and Cipher→En, we perform SSL in the causal and anticausal direction by either treating the input X as the cause and the target Y as the effect, or vice versa. Specifically, we use a standard Transformer architecture for the supervised model, and for SSL, we multitask the translation task with an additional denoising autoencoder (Vincent et al., 2008) using the Fairseq Python package. The results are shown in Tab. 4. It can be seen that in both cases, anticausal models show a substantially larger SSL improvement than causal models.
We also note that there is a substantial gap in the supervised performance between causal and anticausal learning tasks on the same underlying data. This is also expected as causal learning is typically easier than anticausal learning since it corresponds to learning the "natural" forward function, or causal mechanism, while anticausal learning cor-11 Link to the Fairseq implementation.  responds to learning the less natural, non-causal inverse mechanism.

SSL Improvements in Existing Work
After verifying the different behaviour in SSL improvement predicted by the ICM principle on the decipherment experiment, we conduct an extensive meta-study to survey whether this trend is also reflected in published NLP findings. To this end, we consider a diverse set of tasks, and SSL methods. The tasks covered in our meta-study include machine translation, summarization, parsing, tagging, information extraction, review sentiment classification, text category classification, word sense disambiguation, and chunking. The SSL methods include self-training, co-training (Blum and Mitchell, 1998), tri-training (Zhou and Li, 2005), transductive support vector machines (Joachims, 1999), expectation maximization (Nigam et al., 2006), multitasking with language modeling (Dai and Le, 2015), multitasking with sentence reordering (as used in Zhang and Zong (2016)), and crossview training (Clark et al., 2018). Further details on our meta study are explained in Appendix A. We covered 55 instances of causal learning and 50 instances of anticausal learning. A summary of the trends of causal SSL and anticausal SSL are listed in Tab. 5. Echoing with the implications of ICM stated in § 3, for causal learning tasks, the average improvement by SSL is only very small, 0.04%. In contrast, the anticausal SSL improvement is larger, 1.70% on average. We use Welch's t-test (Welch, 1947) to assess whether the difference in mean between the two distributions of SSL improvment (with unequal variance) is significant and obtain a p-value of 0.011.

DA for Causal vs. Anticausal Models
We also consider a supervised domain adaptation (DA) setting in which the goal is to adapt a model trained on a large labeled data set from a source domain, to a potentially different target domain from which we only have a a small labeled data set. As explained in § 3, DA should only work well for causal learning, but not necessarily for anticausal learning, according to the ICM principle.  Similar to the meta-study on SSL, we also review existing NLP literature on DA. We focus on DA improvement, i.e., the performance gain of using DA over an unadapted baseline that only learns from the source data and is tested on the target domain. Since the number of studies on DA that we can find is smaller than for SSL, we cover 22 instances of DA on causal tasks, and 11 instances of DA on anticausal tasks.
The results are summarised in Tab. 6. We find that the observations again echo with our expectations (according to ICM) that DA should work better for causal, than for anticausal learning tasks. Again, we use Welch's t-test (Welch, 1947) to verify that the DA improvements of causal learning and anticausal learning are statistically different, and obtain a p-value of 0.023.

How to Use the Findings in this Study
Data Collection Practice in NLP Due to the different implications of causal and anticausal learning tasks, we strongly suggest annotating the causal direction when collecting new NLP data. One way to do this is to only collect data from one causal direction and to mention this in the meta information. For example, summarization data collected from the TL;DR of scientific papers SciTldr (Cachola et al., 2020) should be causal, as the TL;DR summaries on OpenReview (some from authors when submitting the paper, others derived from the beginning of peer reviews) were likely composed after the original papers or reviews were written. Alternatively, one may allow mixed corpora, but label the causal direction for each (x, y) pair, e.g., which is the original vs. translated text in a translation pair. Since more data often leads to better model performance, it is common to mix data from both causal directions, e.g., training on both En→Es and Es→En data. Annotating the causal direction for each pair allows future users of the dataset to potentially handle the causal and anticausal parts of the data differently.
Causality-Aware Modeling When building NLP models, the causal direction provides additional information that can potentially be built into the model. In the MT case, since causal and anticausal learning can lead to different performance (Ni et al., 2021), one way to take advantage of the known causal direction is to add a prefix such as " [Modeling-Effect-to-Cause]" to the original input, so that the model can learn from causally-annotated input-output pairs. For example, Riley et al. (2020) use labels of the causal direction to elicit different behavior at inference time. Another option is to carefully design a combination of different modeling techniques, such as limiting self-training (a method for SSL) only to the anticausal direction and allowing back-translation in both directions, as preliminarily explored by Shen et al. (2021).
Causal Discovery Suppose that we are given measurements of two types of NLP data X and Y (e.g., text, parse tree, intent type) whose collection process is unknown, i.e., which is the cause and which the effect. One key finding of our study is that there is typically a causal footprint of the data collection process which manifests itself, e.g., when computing the description length in different directions ( § 4) or when performing SSL ( § 5) or DA ( § 6). Based on which direction has the shorter MDL, or allows better SSL or DA, we can thus infer one causal direction over the other.
Prediction of SSL and DA Effectiveness Being able to predict the effectiveness of SSL or DA for a given NLP task can be very useful, e.g., to set the weights in an ensemble of different models (Søgaard, 2013). While predicting SSL performance has previously been studied from a non-causal perspective (Nigam and Ghani, 2000;Asch and Daelemans, 2016), our findings suggest that a simple qualitative description of the data collection process in terms of its causal direction (as summarised for the most common NLP tasks in Tab. 1) can also be surprisingly effective to evaluate whether SSL or DA should be expected to work well.

Limitations and Future Work
We note that ICM-when taken strictly-is an idealized assumption that may be violated and thus may not hold exactly for a given real-world data set, e.g., due to confounding, i.e., when both variables are influenced by a third, unobserved variable. In this case, one may observe less of a difference between causal and anticausal learning tasks.
We also note that, while we have made an effort to classify different NLP tasks as typically causal or anticausal, our categorization should not be ap-plied blindly without regard for the specific generative process at hand: deviations are possible as explained in the Mixed/Other category.
Another limitation is that the SSL and DA settings considered in this paper are only a subset of the various settings that exist in NLP. Our study does not cover, for example, SSL that uses additional output data (e.g., Jean et al. (2015); Gülçehre et al. (2015); Sennrich and Zhang (2019)), or unsupervised DA (as reviewed by Ramponi and Plank (2020)). In addition, in our meta-study of published SSL and DA findings, the improvements of causal vs. anticausal learning might be amplified by the scale of research efforts on different tasks and potentially suffer from selection bias.
Finally, we remark that, in the present work, we have focused on bivariate prediction tasks with an input X and output Y . Future work may also apply ICM-based reasoning to more complex NLP settings, for example, by (i) incorporating additional (sequential/temporal) structure of the data (e.g., for MT or language modeling) or (ii) considering settings in which the input X consists of both cause X CAU and effect X EFF features of the target Y (von Kügelgen et al., 2019, 2020).

Related Work
NLP and Causality Existing work on NLP and causality mainly focuses on the extracting text features for causal inference. Researchers first propose a causal graph based on domain knowledge, and then use text features to represent some elements in the causal graph, e.g., the cause (Egami et al., 2018), effect (Fong and Grimmer, 2016), and confounders (Roberts et al., 2020;Veitch et al., 2020;Keith et al., 2020). Another line of work mines causal relations among events from textual expressions, and uses them to perform relation extraction (Do et al., 2011;Mirza and Tonelli, 2014;Dunietz et al., 2017;Hosseini et al., 2021), question answering (Oh et al., 2016), or commonsense reasoning Bosselut et al., 2019). For a recent survey, we refer to Feder et al. (2021).
Usage of MDL in NLP Although MDL has been used for causal discovery for low-dimensional data (Budhathoki and Vreeken, 2017;Mian et al., 2021;, only very few studies adopt MDL on high-dimensional NLP data. Most existing uses of MDL on NLP are for probing and interpretability: e.g., Voita and Titov (2020) use it for probing of a small Bayesian model and network pruning, based on the method proposed by Blier and Ollivier (2018) to calculate MDL for deep learning. We are not aware of existing work using MDL for causal discovery, or to verify causal concepts such as ICM in the context of NLP.
Existing Discussions on SSL and DA in NLP SSL and DA has long been used in NLP, as reviewed by Søgaard (2013) and Ramponi and Plank (2020). However, there have been a number of studies that report negative results for SSL Steedman et al., 2003;Reichart and Rappoport, 2007;Abney, 2007;Spreyer and Kuhn, 2009;Søgaard and Rishøj, 2010) and DA (Plank et al., 2014). Our works constitutes the first explanation of the ineffectiveness of SSL and DA on certain NLP tasks from the perspective of causal and anticausal learning.

Conclusion
This work presents the first effort to use causal concepts such as the ICM principle and the distinction between causal and anticausal learning to shed light on some commonly observed trends in NLP. Specifically, we provide an explanation of observed differences in SSL (Tabs. 4 and 5) and DA (Tab. 6) performance on a number of NLP tasks: DA tends to work better for causal learning tasks, whereas SSL typically only works for anticausal learning tasks, as predicted by the ICM principle. These insights, together with our categorization of common NLP tasks (Tab. 1) into causal and anticausal learning, may prove useful for future NLP efforts. Moreover, we empirically confirm using MDL that the description of data is typically shorter in the causal than in the anticausal direction (Tab. 3), suggesting that a causal footprint can also be observed for text data. This has interesting potential implications for discovering causal relations between different types of NLP data.

Ethical Considerations
Use of Data This paper uses two types of data, a subset of existing machine translation dataset, and synthetic decipherment data. As far as we are concerned, there are no sensitive issues such as privacy regarding the data usage.
Potential Stakeholders This research focuses on meta properties of two commonly applied methodologies, SSL and DA in NLP. Although this research is not directly connected to specific applications in society, the usage of this study can benefit future research in SSL and DA.

A Meta Study Settings of SSL and DA
For the meta study of SSL, we covered but are not limited to all relevant papers cited by the review on NLP SSL by Søgaard (2013). We went through the leaderboard of many NLP tasks and covered the SSL papers listed on the leaderboards. The papers covered by our meta study are available on our GitHub.
For supervised DA, we searched papers with the keyword domain adaptation and task names from a wide range of tasks that use supervised DA.
Note that for fair comparison, we do not consider papers without a comparable supervised baseline corresponding to the SSL, or a comparable unadapted baseline corresponding to the DA. We do not consider MT DA which tackles the out-ofvocabulary (OOV) problem because P (E|C) may be different for OOV (Habash, 2008;III and Jagarlamudi, 2011).

B Experimental Details of Minimum Description Length
We calculate the MDL(X) and MDL(Y) by a language model, and obtain MDL(X|Y) and MDL(Y|X) using translation models. For language model, we use the autoregressive GPT2 (Radford et al., 2019), and for the translation model, we the Marian Neural Machine Translation model (Junczys-Dowmunt et al., 2018) trained on the OPUS Corpus (Tiedemann and Nygaard, 2004). Both these models use the layers from the transformer model (Vaswani et al., 2017). The autoregressive language model consists only of decoder layers, whereas the translation model used six encoder and six decoder layers. Both of these models have roughly the same number of parameters. We used the huggingface implementation (Wolf et al., 2020) of these models for their respective set of languages.