Invariant Language Modeling

Modern pretrained language models are critical components of NLP pipelines. Yet, they suffer from spurious correlations, poor out-of-domain generalization, and biases.Inspired by recent progress in causal machine learning, in particular the invariant risk minimization (IRM) paradigm, we propose invariant language modeling, a framework for learning invariant representations that generalize better across multiple environments. In particular, we adapt a game-theoretic implementation of IRM (IRM-games) to language models, where the invariance emerges from a specific training schedule in which all the environments compete to optimize their own environment-specific loss by updating subsets of the model in a round-robin fashion.We focused on controlled experiments to precisely demonstrate the ability of our method to (i) remove structured noise, (ii) ignore specific spurious correlations without affecting global performance, and (iii) achieve better out-of-domain generalization.These benefits come with a negligible computational overhead compared to standard training, do not require changing the local loss, and can be applied to any language model. We believe this framework is promising to help mitigate spurious correlations and biases in language models.


Introduction
While modern pretrained transformer models have led to dramatic progress on many NLP tasks, important limitations remain.In particular, pretrained language models suffer from poor generalization, even under small perturbations of the input distribution (Moradi and Samwald, 2021).Indeed, these models encode (Moradi and Samwald, 2021) and exploit (Tu et al., 2020;Niven and Kao, 2019) spurious correlations, i.e., correlations that do not generalize across data distributions.Since language models are trained on large unverified corpora, they also suffer from biases (Nadeem et al., 2021;Bordia and Bowman, 2019).Here the term "biases" refers to correlations that may or may not be spurious according to the available textual data distributions, but are nevertheless undesired.Existing techniques aiming to remove spuriousness or biases involve computationally expensive domain alignment (Akuzawa et al., 2019;Liu et al., 2020;Zhao et al., 2020), domain transfer (Balaji et al., 2018), or the addition of penalty terms to the loss targeted at specific undesired correlations (Qian et al., 2019;Zhao et al., 2018).Alternatively, data preprocessing (Zhao et al., 2017;Zhou et al., 2021) or manipulation such as counterfacual data-augmentation (Lu et al., 2018) can yield datasets where undesired correlations are less present.Pretraining with larger and more diverse datasets can also help (Tu et al., 2020;Brown et al., 2020).
However, recent works on the theory of causality (Pearl, 2018;Schölkopf, 2019) argue that removal of spurious correlations requires altogether different learning and training paradigms going beyond purely statistical learning.Indeed, generalization, spuriousness, and biases are all better understood in the language of causality (Pearl, 2018).Intuitively, causal relationships are the ones expected to be stable (Schölkopf et al., 2021;Peters et al., 2017) and generalizable (Peters et al., 2016).When the causal graph underlying the data generation mechanism is known, there exist causal identification algorithms to distinguish desired from undesired correlations (Shpitser and Pearl, 2008).However, for complex tasks of interest, the underlying causal model is not known.Language modeling is one of these tasks, where it is unclear what would even be the relevant random variables constituting the causal model.Therefore, causal identification from the causal graph seems out-of-reach for language modeling.Similarly, removing undesired correlations one by one is impractical due to the sheer amount of possible correlations to consider.In this work, we propose to benefit from recent progress in causal machine learning to offer a new and more flexible lever for dealing with spuriousness and biases.We take inspiration from the invariance principle, which states that only relationships invariant across training environments should be learned (Peters et al., 2016).Under specific assumptions, the invariant representation would then only encode the causal relationships relevant to the task and should thus generalize.Environments correspond to different views of the learning task, i.e., different data distributions.The invariance principle is illustrated by Fig. 1 with a simplified causal model as an example.A practical formulation of the invariance principle was proposed by Arjovsky et al. (2019).They introduced invariant risk minimization (IRM), an alternative to ERM as a training objective enforcing the learning of invariant representations.Ahuja et al. (2020) later improved the training procedure to solve the IRM objective with a method called IRM-games.Unlike previous methods for removing biases and spurious correlations, IRM-games does not modify the loss with a regularization term and does not compute domain alignment (or matching) statistics.The invariance benefits come from the specific training schedule where environments compete to optimize their own environmentspecific loss by updating subsets of the model in a round-robin fashion.
We argue that the IRM paradigm, and IRMgames specifically, is well-suited to improve NLP systems.Textual data naturally comes from different environments, e.g., encyclopedic texts, social media posts, news articles, etc.Moreover, not knowing the causal mechanisms behind language generation within these environments is not a blocker, as the relevant variables can now remain latent.By adapting IRM-games to language modeling, we introduce invariant language modeling (iLM), where the training of existing pretrained models is continued to enforce invariant representations, using a simple and efficient modification of the training process.We then investigate the ability of iLM to deal with undesired correlations in a series of controlled experiments, answering our core research question: Does the invariance principle give rise to a practical strategy to deal with spurious correlations in language models?Contributions.(i) We introduce a new training paradigm (iLM) for language models based on the invariance principle (Sec.3).Thanks to the use of the IRM-games training schedule (see Sec. 2), our iLM framework results in negligible computational overhead compared to standard ERM training, does not require changing the local loss, and is agnostic to the language model architecture.(ii) In a series of controlled experiments (Sec.4), we demonstrate the ability of iLM to remove structured noise (Sec.4.1), ignore specific spurious correlations without affecting global performance (Sec.4.2), and achieve better out-of-domain generalization (Sec.4.3).(iii) We discuss our contributions in relation to previous work (Sec.5).(iv) Finally, we release Huggingfacecompatible code for training iLM using existing language model checkpoints (Wolf et al., 2020): https://github.com/epfl-dlab/invariant-language-models 2 Background

Invariance Across Environments (IaE)
Recent works on the theory of causality (Pearl, 2018;Schölkopf, 2019) have argued that out-ofdistribution generalization and removal of spurious correlations require going beyond purely statistical learning.This is motivated by the intuition that causal relationships are the ones that are expected to be robust and generalizable (Peters et al., 2016).In causal machine learning, these ideas crystallized in the invariance principle which states that only relationships invariant across training environments should be learned (Peters et al., 2016;Muandet et al., 2013).In this paradigm, different environments correspond to data collected in different setups, i.e., different data distributions (Pearl, 2018).For NLP, spurious correlations and lack of out-of-distribution generalization are particularly well-documented and important problems (Moradi and Samwald, 2021;Tu et al., 2020;Niven and Kao, 2019).Fortunately, separations between environments naturally emerge in textual data: encyclopedic, news, social media, movie subtitles, etc.This separation makes invariance-based approaches particularly well-suited for NLP.

Invariant Risk Minimization (IRM)
While the invariance principle is a general and powerful idea, works based on this principle often require knowing which random variables are part of the causal model (Akuzawa et al., 2019;Peters et al., 2016).Arjovsky et al. (2019) introduced invariant risk minimization (IRM), an alternative to empirical risk minimization (ERM), and a practical training objective enforcing invariance in the learned latent representation.IRM also builds on the idea that the training data comes from different environments e ∈ E. Each environment e ∈ E induces i.i.d.samples D e from a distribution P(X e ,Y e ).The goal, then, is to use these multiple datasets to learn a predictor Y ≈ f (X), which performs well across the set of all environments E * , only part of which were seen during training: E ⊂ E * .This is accomplished by decomposing f into a feature representation ϕ and a classifier w, as f = w • ϕ, where • denotes function composition.The feature representation ϕ elicits an invariant representation of the data if the same classifier w is simultaneously optimal for all environments e ∈ E. Intuitively, ϕ learns a representation that is invariant with respect to the environments if its representation is equally useful for all environments.
For NLP, we propose to use the main body of a language model as the invariant feature learner ϕ.When trained on a language modeling task, w will be the language modeling heads.Then, Y is the masked word and X the context.

IRM-Games
IRM is a challenging bi-level optimization originally solved (Arjovsky et al., 2019) by setting the invariance criteria as a regularizer.Later, Ahuja et al. (2020) improved the training procedure by using a game-theoretic perspective in which each environment e is tied to its own classifier w e .A global classifier w is then defined as the ensemble of all environment-specific classifiers: w = 1 |E| e∈E w e (where the predictions, not the weights, are averaged).Then, environments take turns making a stochastic gradient update to minimize their own local empirical risk, by updating only the weights of their own classifier w e , while the shared ϕ is updated periodically.For more details see the algorithm called V-IRM in the original paper.Ahuja et al. (2020) showed that the equilibrium of this game is a solution to the IRM objective, i.e., the resulting ϕ learns invariant features.For NLP, we argue that IRM-games is a particularly meaningful candidate to adapt to language modeling because it requires little structural modifications.

Why Invariance Is Needed for NLP
Textual data is particularly subject to distribution shifts and out-of-domain distributions as texts naturally come from different environments.This creates a highly non-i.i.d.setting with problems of generalizability and spurious correlations.The curse becomes a blessing when moving to invariancebased ideas, as having diverse and naturally emerg-ing environments is the necessary starting point of algorithms like IRM-games.
As a simple example, consider gender bias in pretrained language models.When the model is queried with q = "MASK is the best doctor", it feeds q into its main body ϕ, from which a language modeling head w outputs softmax scores w • ϕ(q).Despite the context q containing no gender information, existing models score the pronoun he much higher than she.The problem comes from the presence of spurious correlations, where the context, here the word doctor, is correlated with he.In an invariance-based approach, the training data comes from different environments.Suppose there is an environment e where the data is not gender-biased, i.e., there is no correlation between the latent representation ϕ(q) and he.It is thus not stable across environments (not invariant) and will not be learned.Now, consider the slightly different query q ′ = "MASK is the best doctor, she is great!".Here, the context ϕ(q ′ ) contains gender information.In all environments, the pronoun she should be preferred.This association arises not from a spurious correlation in data but from a commonsense, almost grammatical, constraint.Therefore, this correlation is invariant and will be learned by invariance-based approaches.
This exemplifies the potential benefits of invariance-based approaches and illustrates the importance of choosing environment splits appropriately.One should not expect any arbitrary split of environments to magically yield generalization benefits.However, the choice of environments within the invariance-based learning framework provides a flexible new lever to inject (i) inductive biases, (ii) knowledge about the data generation mechanism, and (iii) desirable stable properties (like removing gender bias).

Model
We introduce a way to train language models inspired by the IRM-games setup.This involves distinguishing the shared invariant feature extractor ϕ from the environment-specific w e 's.With modern language model architectures, a natural choice emerges: ϕ as the main body of the encoder, and w e as the language modeling head that outputs the logits after the last layer.
Formally, suppose we have n environments consisting of data {(X e ,Y e )} e=1,...,n .For a batch (x i , y i ) ∼ P(X i ,Y i ) from environment i, the model output is formed using an ensemble of n language modeling heads {w e } e=1,...,n on top of the trans- Then, a (masked) language modeling loss L is computed on the model output ŷ.Note that it is the predictions of the n heads that are averaged not the weights or the gradients.No head gets to predict alone; the n heads always predict together as an ensemble.The heads are subject to competitive gradient updates in a round-robin fashion as described below, which in turn creates the conditions that enforce the invariance.
Training.The training of iLM follows the pseudocode described in Alg. 1, where environments take turns to send a batch of data and update ϕ and their associated head.An illustration is provided in Appendix A. Each head periodically gets an opportunity to pull the global ensemble classifier w and the feature learner ϕ towards fitting the distribution of its associated environment.Intuitively, since each head gets the same amount of updates, the game converges to a global classifier that is simultaneously optimal for each environment, as demonstrated by Ahuja et al. (2020).While the V-IRM algorithm of Ahuja et al. (2020) only updates ϕ periodically, we found it more stable to update it together with every head update.
for environment e ∈ E do 4: CompetitiveUpdate(x i , y i , ϕ, {w e } e∈E ) 6: end for 7: end for 8: function COMPETITIVEUPDATE(x i , y i , ϕ, {w e }) 9: GradientUpdate(L, ϕ, w i ) 11: end function An advantage of this implementation is that invariance is obtained with few modifications to language models.Such simplicity arises from our leveraging of IRM-games, where invariance comes from the training schedules and ensembling of classifiers.Furthermore, we implement two baselines that appear similar but do not enjoy the same theoretical properties: mtLM and ensLM.task baseline (Liu et al., 2019a), mtLM, also uses data split into environments with one head per environment and each environment being seen as a different task.The ensemble baseline (lan et al., 2018), ensLM, has a similar architecture as iLM, ensembling n heads for predictions but always updating every head with every batch.The ensemble baseline has the same forward pass as iLM but does not perform the competitive gradient update.These baselines serve as ablations of iLM to demonstrate the importance of splitting the data into environments, ensembling the heads, and using the competitive gradient update.

Experiments
Invariance training comes with the promise of robustness and generalization (Peters et al., 2016;Muandet et al., 2013;Ahuja et al., 2020).In the following series of experiments, we test whether our proposed architecture for language modeling can provide such benefits.Since our approach is agnostic to the language model, we focus on two small LMs used heavily in practice: distilBERT (Sanh et al., 2019) and RoBERTa (Liu et al., 2019b).In this work, we do not aim to engineer the best possible LM but rather precisely test iLM in controlled setups by crafting environments whose difference is known, from which we know the expected behavior.We describe three experiments: robustness to noise, bias removal, and out-of-domain generalization.
Throughout the experiments, we report estimated uncertainties with 95% confidence intervals.We repeat experiments for varying hyperparameters and different random seeds (see Appendix B).

Robustness to Noise
In this experiment, we test robustness in a controlled setup.We craft two environments: Env-A made of clean Wikipedia articles and Env-B made of full HTML pages of Wikipedia articles.We use 120K articles split equally into the two environments (see Appendix B.1 for data details).Then, we continue the training with the masked language modeling (MLM) loss from existing checkpoints for each of iLM, eLM, mtLM, and ensLM with these two environments and evaluate the MLM perplexity on a held-out dataset of clean Wikipedia articles (25K held-out sentences).Intuitively, eLM should try to fit the HTML part of the training data and thus be more surprised by the clean Wikipedia articles during the test set.However, iLM should learn to ignore the HTML because it does not generalize from Env-B to Env-A.
Results.The results averaged over 16 hyperparameters choices are reported in Table 1.See Appendix B.1 for hyper-parameters considered.For reference, the perplexities on the same test set of off-the-shelf pretrained distilBERT and RoBERTa are, respectively, 14.43 and 6.71.We observe that iLM systematically has a significantly better test perplexity.Also, ensLM and mtLM perform significantly better than eLM but significantly worse than iLM.This indicates that splitting data in n environments and ensembling n heads gives some robustness benefits.The full benefit comes when further combined with the training schedule of iLM.We come back to this discussion in Sec.4.4.
To compare architectures over the test set with different hyper-parameters, base transformers, and random seeds, we also performed paired aggregation comparison based on the Bradley-Terry model, following the recommendations of Peyrard et al. (2021).The Pairformance tool1 measures the probability that iLM beats eLM when hyper-parameters are matched.We obtain that iLM significantly beats eLM with .98 estimated probability.Similarly, iLM beats ensLM with .89estimated probability and mtLM with .92estimated probability.In these experiments, paired comparisons are particularly important because varying hyper-parameters result in large variations of perplexity, such that blindly averaging can amplify the variance and hide the structure of model performance.

Bias Removal
In this experiment, we test the capacity to remove one precise and known correlation by crafting two p in percentages) between the modified environment and the unmodified one and the y-axis is the average bias for both iLM and eLM.Note that, according to Pairformance, P(iLM beats eLM) > 0.95 when the relative size is < 80%, and that eLM and iLM become indistinguishable for relative sizes > 80%.Due to space, we report the results obtained by ensLM and mtLM in Appendix B.2 which also shows that they perform in between iLM and eLM.
environments differing only in this specific correlation.We use binary gendered terms and create two environments where the gendered terms are used differently. 2We follow the standard setup of Counterfactual Data Augmentation (CDA) (Lu et al., 2018): we take a textual data source with known gender bias, in this case, Wikitext-2 (Merity et al., 2016).A fraction p of the data goes into Env-A, the rest (1 − p) goes into Env-B.Env-A remains untouched and preserves all the properties of the original data source, whereas Env-B is intervened upon by inverting all gendered terms based on a dictionary provided by previous work (Bordia and Bowman, 2019).When p = 1 − p = 0.5 and the language model is finetuned with eLM, this setup matches the CDA method (Lu et al., 2018) used to mitigate gender bias in NLP.Intuitively, iLM should learn to ignore gender-based correlations no matter what the fraction p.However, eLM is only expected to ignore them when p = 1 − p = 0.5, i.e., the two environments precisely balance each other.
Experimental setup.To measure whether the correlation has been successfully removed, we (i) take all gendered terms in the test set, (ii) replace them by the MASK token, (iii) use trained models to pre-dict the missing term, (iv) look in the softmax for the scores received by the terms of the target gendered pair.We note s f and s m the score assigned to the female and male terms in the softmax.Finally, (v) we compute an entropy-based bias measure: , where H 2 is the binary entropy (note that H 2 1 2 = 1).B H measures the extent to which a softmax has a preference for the male or female term in a gendered pair of terms.For example, in the sentence "MASK is the best doctor" we look at the softmax score of the genderedpair [he, she].If a model has learned to ignore gender-based correlation, the entropy should be high (entropy-bias low), not favoring one gendered term over the other.We remove sentences with several gendered terms from the test set to avoid penalizing models for preferring a gender when the context contains gender information.
We ran the experiments for varying values of p averaging across different hyper-parameters, and report the results in Fig. 2 for iLM and eLM.The results for ensLM and mtLM are reported in Appendix B.2. See Appendix B.2 for hyperparameters considered.For reference, the entropy bias of distilBERT and RoBERTa before training are, respectively, 0.39 and 0.46.
Analysis.Compared to off-the-shelf models, both eLM and iLM largely decrease the average entropy bias in the balanced setup but only iLM succeeds in the unbalanced setup.In the balanced setup (relative sizes close to 100%), eLM and iLM perform within each other's confidence intervals.However, in the unbalanced setup, iLM largely outperforms eLM.We note that, according to Pairformance, the probability that iLM beats eLM for any given hyper-parameter configuration is > 0.9 for both distilBERT and RoBERTa when the relative sizes is below 80%.As desired iLM is not affected by the relative size of the environments.These results confirm the hypothesis, that bias removal needs a precisely balanced dataset for eLM (Lu et al., 2018), while it does not matter for iLM.Furthermore, this entropy bias reduction does not happen at the cost of worst general perplexities (see Appendix B.2).These findings are significant for the field of bias removal, as iLM offers a practical and efficient way of removing biases.It is now not necessary to carefully counter-balance the bias in the augmented data.In Fig. 2, we see that already at 10% of relative size, iLM performs as well as existing approaches (100% relative size + eLM).The first column is for language modeling evaluation in-domain (perplexity, lower is better), the second column is for language modeling evaluation out-of-domain (perplexity, lower is better), and the last column is for GLUE tasks averaged (higher is better).We mark with * the cases where iLM is statistically significantly better than other architectures (paired t-test).

Out-of-Domain Generalization
In this experiment, we venture beyond controlled environments and test out-of-domain generalization with naturally occurring environments.We use thePile dataset (Gao et al., 2020) which contains 20 very diverse textual domains: OpenSubtitles, ArXiv papers, News, GitHub comments, etc.
Experimental setup.We randomly sample 11 domains from thePile for training, the remaining 9 domains are used for testing language models outof-domain.Once the models are trained, using domains as environments, we evaluate their perplexity in-domain (InD) using held-out data from the training environments and OoD using data from unseen environments.See Appendix B.3 for details regarding training domains and hyper-parameters.Furthermore, the trained models are evaluated on the GLUE benchmark.Indeed, models trained with iLM can be used downstream exactly as if they were trained with eLM.We report aggregated results in Table 2.The results also show significant improvement of iLM over other architecture across the board.In particular, iLM is beneficial for both in-domain (InD) and out-of-domain (OoD) evaluation.

Ablation
The eLM, mtLM, and ensLM architectures serve as ablated versions of iLM testing the three main components of iLM: splitting the data into envi- The results are reported in Table 3 and confirm the intuition built-up with previous experiments that simply having n environments with n heads is not beneficial on its own, as mtLM does not provide benefits over eLM.However, when combined with head ensembling (ensLM), significant improvements can be observed over both eLM and mtLM.Further significant benefits arise from the competitive gradient update specific to iLM.While both mtLM and ensLM have slightly better capacity to overfit with their n heads, they don't benefit from the invariance regularization provided by competitive gradient updates.Notice that iLM is significantly better than any other architecture, as shown by the last row of Table 3 (or equivalently, the last column).

Discussion
In this section, we discuss our contributions in the context of previous work.

Related Work
Domain generalization.The performance of deep learning models substantially degrades on out-of-domain (OoD) datasets, even in the face of small variations of the data-generating process (Hendrycks and Dietterich, 2019).Blanchard et al. (2011) have proposed domain generalization (DG) as a formalism for studying this problem.In DG, the goal is to learn a model using data from a single or multiple related but distinct training domains, in such a way that the model generalizes well to any OoD testing domain, unknown during training.Recently, the problem of DG has attracted a lot of attention, and has been approached from different facets.Most of the existing methods fall under the paradigm of domain alignment (Muandet et al., 2013;Li et al., 2018b;Akuzawa et al., 2019;Liu et al., 2020;Zhao et al., 2020).Motivated by the idea that features that are stable across the training domains should also be robust to the unseen testing domains, these methods try to learn domaininvariant representations.A group of other methods is based on meta-learning (Dou et al., 2019;Balaji et al., 2018;Li et al., 2018a).The motivation behind this approach is that it exposes the model to domain shifts during training, which will allow it to generalize better during testing.Regularization through data augmentation is commonly used in the training of machine learning models to alleviate overfitting and thereby improve generalization (Zhou et al., 2021(Zhou et al., , 2020)).Based on this idea, (Zhou et al., 2021(Zhou et al., , 2020) ) apply transformations on the original data to simulate a domain shift in training.
Domain generalization applied to language models.In NLP, the default pipeline involves pretraining a task-agnostic language model, which is then finetuned on downstream tasks.This pretraining/finetuning division of learning is already known to improve robustness on downstream tasks (Hendrycks and Dietterich, 2019).However, the language models themselves suffer from spurious correlations and poor generalization even with small perturbations of the inputs (Moradi and Samwald, 2021).To alleviate such problems, Oren et al. (2019) adapted Distribution Robust Optimization (Ben-Tal et al., 2013) to language models.This resulted in a new loss minimizing the worst-case performance over subsamples of the training set.They focused on domains with topic shifts.Later, Vernikos et al. (2020) used domain -adversarial regularization to improve testing performance on unseen domains.
Compared to these previous works, iLM enjoys theoretical justification rooted in the causal framework of invariance (Peters et al., 2016).Our implementation is simple, comes at negligible computational cost and can be applied directly to any transformer LM.

Environment Design
One question that might arise from the iLM training schedule is what happens when environments have no lexical overlap.Maybe no correlation would remain for iLM to model?We emphasize that iLM learns a latent representation ϕ and stable correlations are those connecting this latent representation to observables, and not surface correlations between observables.To demonstrate that iLM operates on latent variables and not just on surface-level correlations, we performed a simple experiment with languages as environments.We trained iLM with a pretrained multilingual model (XLM-RoBERTa) using English Wikipedia articles and Farsi Wikipedia articles as two environments.Despite almost no surface-level overlap, iLM is still able to improve perplexity in each language individually and does not destroy previously learned correlations.This experiment is detailed in Appendix B.4.Also, if the number of environments grows arbitrarily large, certainly iLM would not find any stable correlations in the data?We emphasize that the choice of environments is not intended to be arbitrary; simply contriving as many environments as possible could not be expected to be useful.Rather, the choice of environments has to reflect assumptions about the underlying data generation mechanism; iLM then leverages the assumptions encoded in the choice of environments.
Indeed, after this work has shown that iLM can effectively remove unstable correlations, the next question becomes that of environment design: how to choose environment splits to be useful in practice?Useful environment splits will likely be different for different tasks and different purposes.This work already demonstrated that the new paradigm of (i) environment design then (ii) iLM is practical for language-related problems.Choosing environment splits is a flexible way to inject priors and inductive biases, compared to manually deciding which correlations are desired (as in bias removal) or fully learning the causal graph (as in causal reasoning).Now, iLM provides a computationally efficient framework to inject such priors and move the discussion from model inductive biases to data inductive biases.It already offers robustness to noise, a ready-to-use bias removal strategy for any existing language model needing few data points, and improves OoD generalization.

Limitations
In this work, we focus on crafting controlled experiments with easily manageable dataset and language model sizes to carefully test the invariance benefits of iLM.However, it is possible to expect different qualitative behavior for large-scale language models recently deployed due to emergent properties.
Our implementation could be applied largely to various downstream tasks, other than language modeling measured by perplexity.Here, we focus on the language modeling task and perplexity measure because they allow clear and precise experiments measuring the ability of iLM to deal with spurious correlations.The strong positive results observed in this work motivate future work to test iLM in other setups closer to direct practical usecases.
It is expected that different choices of environment splits will be useful for different downstream tasks.While this work demonstrates that iLM is useful to remove spurious correlation, it does not say how to choose environments for which tasks.For instance, we observed smaller improvements when using thePile datasets and evaluating on the downstream GLUE tasks, indicating that thePile environment splits are not optimal for these downstream tasks.We believe that environment design is an important avenue for future research.

A Illustration of iLM Architecture
In the main paper, we described formally the pseudo-code involved in training iLM models.The model architecture and the logic of the training schedule is illustrated in Fig. 3 for the special-case of 2 environments (n = 2).

A.1 mtLM and ensLM baselines
We implemented two similar architectures that do not enjoy the same theoretical justifications.
In the mtLM baseline, the data is also split into n environments with one head per environment.As in iLM, environments take turns to send a batch of data and perform a batch update on the body of the transformer ϕ and the head associated with this environment.This is like viewing different environments as different tasks with uniform weights, even though they are all language modeling loss.
In the ensLM baseline, the data is split into n environments with one head per environment.However, here, the heads are always predicting as an ensemble like iLM.Here also the environments take turns to send a batch of data.The forward pass is exactly the same as the one of iLM.In the backward pass, every head and the transformer body ϕ are always updated for every batch of every environment.The files were pre-processed to remove the HTML content resulting in clean text articles.We randomly selected 60K articles with HTML (Env-B), and 60K different articles without HTML (Env-A).The test set contains 25K sentences coming from Wikipedia without HTML.

B Details about Experiments
Hyper-parameters.We ran the experiments reported in the main paper while varying several hyper-parameters: base transformers (ϕ): [distilBERT, RoBERTa], learning rates: [1e −5 , 5e −5 ], number of training steps: [10,100,200,500,2500,5000] when trained with environments having the same number of articles.However, the HTML articles have more lines and thus more sentences.Therefore, we also report in Fig. 4 the same analysis repeated when the number of lines between Env-A and Env-B is the same, meaning Env-B contains fewer articles.The conclusion remains largely unchanged in this scenario.As seen in Fig. 4 (c), iLM has still a probability of beating eLM for match hyper-parameters close to 1, highly significantly far away from 0.5.
Results for mtLM and ensLM.In Fig. 4, we report the average bias as a function of the relative sizes of environments for mtLM and ensLM alongside those of iLM and eLM.We also observe here that iLM outperform other architectures.Interestingly, ensLM seems to bring benefits in comparison to eLM and mtLM.
Details about the results.Here, we report complementary analysis compared to the results described in the paper.We report the performance of eLM and iLM as a function of the number of training steps and the probability that iLM is better then eLM when matched on hyper-parameter configuration as computed by the Bradley-Terry model.This is reported by Fig. 5 for two relative size: 25% (the modified environment has 4 times fewer examples) and 100%.
Perplexities after training.To ensure that the gender-based correlations were not removed at the cost of a worse perplexity, we report in Table 5 the perplexities of iLM models in comparison eLM ones on the test set of Wikitext-2.For reference, before our training distilBERT and RoBERTa had, this same test set, perplexities of 14.25 and 6.92, respectively.
In Table 5, the 95% confidence intervals all give uncertainties ≈ 0.15, meaning that for a fixed base model (distilBERT or RoBERTa) all perplexities are within each other's error bounds.There is no significant perplexity difference between eLM and iLM or between the unbalanced and balanced setups.

B.3 Out-of-domain Generalization
Data.The data used for this experiment comes from subsamples of thePile (Gao et al., 2020).
We randomly selected train and test domains as follow: conclude that iLM destroy surface-level correlations.
Results.We found that before finetuning, XLM-RoBERTa had a perplexity of 14.56 on the held-out test set, where iLM could improve it perplexity down to 6.44.This indicates that iLM with environments having no lexical overlap does not destroy previously learned correlations.It can even improve its perplexities for each language.A possible reason why iLM can even improve so dramatically compared to before finetuning might come from the fact that ϕ learns to recognize the languages, separate them and treat them separately.Similar effects have been observed in previous work (Guo et al., 2021) when the correlation between the environment index and the target variable is very strong (which is the case here).

B.5 Head dynamics
The main components of our framework are the heads and their training dynamic.Therefore, we investigate aspects related to behaviour of the heads.
Description.During training, the loss of each head is still entangled with the prediction of every other head.So we wonder whether the heads still capture information related to the environment it is tied to during training.In particular, we ask (i) whether the parameters of the heads for different environments are drifting apart during training?Indeed, all heads are initialized to the same pretrained weights at the beginning of training.(ii) Are the parameters of the heads predicting which environments are more similar?
Experimental setup.To answer these two questions in one go, we take two environments A and B and split each of them into two new environments resulting in A 1 , A 2 , B 1 , and B 2 such that A 1 and A 2 are very similar B 1 and B 2 are very similar but A i and B i are different.We then train iLM with the four environments and, thus, with four heads w A 1 , w A 2 , w B 1 , and w B 2 .We measure whether the heads' weights can predict the similarities between A's and B's environments.
where d is the L2 distance between the linearized weights of two heads.Then, D in is the average distance between heads tied the same domain, and D out is the average distance between heads tied to different domains.Remember that in this case, there are 2 domains A and B and 4 environments A i and B i .
In this experiment, we randomly select the base environments A and B from the domains of thePile (A is the Enron-Email, and B is PubMed abstract).We create A i and B i by randomly subsampling 2 environments of the same size from each domain.We train iLM with RoBERTa for 5000 training steps, taking checkpoints of the heads every 500 steps.We perform 10 random restarts with different seeds to uncertainty estimates.In Fig. 6, we report D in and D out as functions of the number of training steps.Analysis.We first notice that indeed the heads are drifting apart from each other as training advances.More interestingly, the distance between 5742 heads from the same domain is significantly much smaller than the distance between heads from different domains.We conclude that heads retain environment-specific information in their parameters and are predictive of environment similarities.Now, we visualize the geometry of head similarity by training iLM with RoBERTa for 5000 steps with 9 environments from thePile: .After training, we take the heads' parameters and compute the pairwise distance between all 9 heads and embed them in 2D with Multi-Dimensional Scaling to visualize the similarity structure.The result is depicted in Fig. 7. 5743

Figure 1 :
Figure 1: High-level overview using a simplified causal structure.The distinction between environments makes it possible to separate spurious from stable features.Indeed, the relationship between the target variable Y and the stable features X C is invariant across environments: E[Y |X C , E] = E[Y |X C ].However, the correlation between Y and X S is spurious and does not generalize across environments: E[Y |X S , E = e] ̸ = E[Y |X S , E = e ′ ] for e ̸ = e ′ .Language models trained with the standard ERM, denoted as eLM in this work, exploit all correlations available during training and aim to learn E[Y |X C , X S ].Our proposed invariant language models, denoted as iLM, focus on invariant features and aim to learn E[Y |X C ].In language modeling, Y could represent the missing-word prediction task.
and X S are the spurious features, not generalizing across environments (E[Y |X S , E = e] ̸ = E[Y |X S , E = e ′ ] for e ̸ = e ′ ).Language models trained with standard empirical risk minimization (ERM), denoted as eLM in this work, exploit all correlations available during training and aim to learn E[Y |X C , X S ].Our proposed invariant language models, denoted as iLM, focus on invariant features and aim to learn E[Y |X C ].In practice, since the causal model is unknown, it is the choice of environments that defines what correlations are spurious.Invariant learning with appropriate choices of environments is the lever we propose to employ to more flexibly deal with spuriousness and biases.

Figure 2 :
Figure2: Bias removal.The x-axis represents the relative size (x = 1−p p in percentages) between the modified environment and the unmodified one and the y-axis is the average bias for both iLM and eLM.Note that, according to Pairformance, P(iLM beats eLM) > 0.95 when the relative size is < 80%, and that eLM and iLM become indistinguishable for relative sizes > 80%.Due to space, we report the results obtained by ensLM and mtLM in Appendix B.2 which also shows that they perform in between iLM and eLM.

B. 1
Robustness to NoiseData.The data used for this experiment comes from an HTML Wikipedia Dump of August 2018.

ForwardFigure 3 :
Figure3: Model description.In the forward pass, input text goes through the main body of language model noted ϕ (e.g., a Transformer(Devlin et al., 2019)), then one head per environment predicts logits over the vocabulary.These predictions are averaged over all heads and go through a softmax.During training, the model receives a batch of data from one environment e and performs a gradient update only on the parameters of the main body of the language model (ϕ) and on the parameters of the head tied to this environment w e .Then batches are taken from each environment in a round-robin fashion.

Figure 4 :
Figure 4: Structured noise removal experiment with environments having the same number of lines: a) average perplexity over all hyper-parameters b) average perplexity as a function of the number of training steps (for learning rate 10 −5 ), c) Probability that iLM is better than eLM when compared on the same hyper-parameters

Figure 6 :
Figure 6: Comparing distance between heads weights inand out-domain as functions of the number of training step.(95% confidence interval from random restart with different seeds.)

Figure 7 :
Figure 7: Heads embeddings: 2D projection of the heads parameters similarity structure after training iLM with RoBERTa for 5000 steps with 9 domains.Each dot represent one head of the model after training and the labels indicate to which domain it is tied to.

Table 1 :
Robustness to noise.Average perplexity over hyper-parameters (lower is better).The differences between iLM and the others are statistically significant (paired t-test, p < 10 −7 ).