Debiasing Methods in Natural Language Understanding Make Bias More Accessible

Model robustness to bias is often determined by the generalization on carefully designed out-of-distribution datasets. Recent debiasing methods in natural language understanding (NLU) improve performance on such datasets by pressuring models into making unbiased predictions. An underlying assumption behind such methods is that this also leads to the discovery of more robust features in the model’s inner representations. We propose a general probing-based framework that allows for post-hoc interpretation of biases in language models, and use an information-theoretic approach to measure the extractability of certain biases from the model’s representations. We experiment with several NLU datasets and known biases, and show that, counter-intuitively, the more a language model is pushed towards a debiased regime, the more bias is actually encoded in its inner representations.


Introduction
State of the art neural language models such as BERT (Devlin et al., 2019) usually work by pretraining an encoder to learn universal word representations, and then fine-tuning it on some classification or regression task. From a robustness point of view, such pretrain-and-fine-tune pipelines are known to be prone to biases that are present in data (Gururangan et al., 2018;Poliak et al., 2018;Mc-Coy et al., 2019;Schuster et al., 2019). Various methods were proposed to mitigate such biases in a form of robust training, where a bias model is trained to capture the bias and then used to relax the predictions of a main model, so that it can focus less on biased examples and more on the "hard", more challenging examples (Clark et al., 2019;Mahabadi et al., 2020;Utama et al., 2020b; Supported by the Viterbi Fellowship in the Center for Computer Engineering at the Technion. 1 Our code and data are available at: https://github. com/technion-cs-nlp/bias-probing. 2021, inter alia). Then, the resulting model is evaluated on out-of-distribution (o.o.d) data, in the form of challenge datasets containing "hard" examples that were deliberately constructed to be anti-biased. Examples of such datasets include HANS (McCoy et al., 2019) for natural language inference (NLI) and FEVER-Symmetric (Schuster et al., 2019) for fact verification. An underlying assumption behind this methodology is that better generalization out of distribution also means that the model learned more robust features. However, while evaluation using challenge datasets only relays information about the generalization of the model through predictions, it does not reveal what actually caused it and how the internal representations were affected. To assess whether bias has been removed from the internal representations, we design probing tasks targeting several known biases: lexical overlap biases and negative word bias. While probing is usually concerned with simple linguistic properties such as part-of-speech tags (Belinkov and Glass, 2019), we instead define probing tasks with the purpose of revealing bias in the representations. An example of such probing task is to predict whether a sentence-pair is lexically overlapping given only access to their joint representation-a classifier which is able to label the pair by this property consequently must use information about the bias which is encoded in the representation. We construct probing datasets for assessing bias in several natural language understanding (NLU) datasets. Lastly, we use information-theoretic probing (Voita and Titov, 2020) to analyze the extractability of bias from vanilla and debiased models using the probing classifier.
We conduct experiments on two NLI datasets and one fact verification dataset across a variety of debiasing methods and bias types, and test whether the bias removal is as successful as o.o.d evaluation suggests. Surprisingly, we discover that making models robust from the perspective of the downstream task, causes the inner representations to encode more of the information about the specific bias in question. Figure 1 shows an example of this trend in NLI, where as robustness of the fine-tuned model to biased predictions increases, so does the ability of the probing classifier to extract bias.
To summarize, we make the following contributions: • We present a general probing-based framework to measure extractability of bias from inner model representations.
• We use this framework to construct several new probing tasks based on well-studied dataset biases in NLU tasks.
• We show that pressuring a model into making unbiased predictions actually makes biased features more extractable from the model representations.
2 Related Work

Dataset Biases
Deep neural models are prone to shortcut learning (Geirhos et al., 2020), by discovering and using idiosyncratic biases, heuristics, and statistical cues in the data. For example, Poliak et al. (2018) showed that the Stanford natural language inference dataset (SNLI; Bowman et al. 2015) contains "give-away" words, i.e., words w which have a high value of p (l | w) w.r.t a given label l. They noticed that 4 out of the 10 words with the high-est p (contradiction | w) are universal negation words, 2 suggesting that negation is strongly correlated with contradiction in the data. These clues appeared in the hypothesis side, making them a kind of hypothesis-only bias, where a classifier receiving as input only the hypothesis is able to correctly predict the label (Poliak et al., 2018;Gururangan et al., 2018). A similar type of bias, known as claim-only bias, is found in the FEVER fact verification dataset (Thorne et al., 2018), and was also associated with a strong correlation of negation words with the labels in the dataset (Schuster et al., 2019). Another kind of bias is the association of entailment with cases of lexical overlap between the premise and hypothesis. This bias leads to poor performance of models on the HANS challenge dataset (McCoy et al., 2019), where all samples contain lexical overlap and non-entailed samples are formed such that the bias does not entail the label. This suggests that models rely on features that are cues for lexical overlap bias when predicting the entailment of premise-hypothesis pairs.

Bias Mitigation and Robustness
Recent work on bias mitigation attempts to create more robust models by training a combination model, based on the main model. The main model, parameterized by θ m , is a non-robust language model. The bias model, parameterized by θ b , is a weak model whose purpose is to model the biases during training, by minimizing a loss L b . The objective of the combination model is to minimize a combined loss function L c (θ m , θ b ), such that the main model leverages knowledge about bias in data, obtained using the weak model. This pipeline is general, and it allows models to be trained either end-to-end, or step-by-step by first training the bias model and then using its predictions to robustly train the main model. Recent papers show that such techniques are effective when evaluated on challenge datasets specifically designed to target known biases and hard examples (He et al., 2019;Clark et al., 2019;Utama et al., 2020b,a;Sanh et al., 2021;Mahabadi et al., 2020). However, this approach does not ensure that the model indeed learns more robust features, nor does it shed light on exactly how the feature detectors react to this change, and how the bias is represented in the model.

Probing
Probing was somewhat successfully used to analyze sentence embeddings and to show that such models capture surface features such as sentence length, word content, and the order of words (Adi et al., 2017), or various syntactic and semantic features (Conneau et al., 2018); see Belinkov and Glass (2019) for a survey. In contrast, we focus our analysis on biased features, and employ advances in probing methodology to analyze two kinds of bias-lexical overlap and negation bias. Designing probes to accurately interpret the desired behavior is not trivial and measuring their accuracy is insufficient, since the probing classifiers are prone to memorization and bias as well (Hewitt and Liang, 2019), among other shortcomings (Belinkov, 2021). Recently, Voita and Titov (2020) presented an information-theoretic approach for evaluating probing classifiers, which accounts for the complexity of the probing classifier by measuring its minimum description length (MDL). MDL measures how efficiently a model can extract information about the labels from the inputs, and we use it as a measure of extractability of certain biases from model representations.

Methods
We lay down a general framework for interpreting bias in inner model representations. Given a model f θ : X → Y with learnable parameters θ, we assume that it can be decoupled into two stages: • A representation layer (or multiple layers) with learnable parameters θ 1 , which we denote R θ 1 : X → Z, maps samples from the input space to a latent space Z, the "representation".
• A classification layer with learnable parameters θ 2 , which we denote F θ 2 : Z → Y , maps the latent representations to the final output.
We can thus re-define our classifier as For example, in NLI we assume that data samples are given as sentence pairs x = (p, h) where p is a premise and h is a hypothesis. R (p, h) is the joint representation of the two, and this representation is then used by F to produce a prediction.
In this work, we compare baseline models finetuned on some down-stream task to models debiased during the fine-tuning step. We produce representations from both types of models and measure the extractability of bias using a probing classifier. Our probing tasks are defined in terms of "biasrevealing" properties, which are based on a-priori knowledge of the bias in question, and are able to distinguish between biased and unbiased samples from the original dataset. We next describe how to construct such probing tasks and appropriate datasets.

Probing Tasks and Datasets
We define a probing classifier as a classifier g Ψ : Z → Y P with learnable parameters Ψ, which maps inputs from a latent representation space Z to a probing property space Y P , where P : X → Y P is some real property of the original input, which we call the probing property. Next, we define a probing dataset for each probing task: (2) Lastly, we train the probing classifier on the constructed dataset and evaluate its performance on the probing task. We introduce two new probing tasks that target the well researched types of bias present in several datasets: lexical bias and negative word bias. For presentation purposes, consider the NLI task, where data samples are given as sentence pairs x = (p, h) where p is a premise and h is a hypothesis. The extension to fact verification and other pair relationship classification tasks is straightforward.
NegWords To analyze negation bias in NLI and fact verification, we define a list of negative words V 3 and a sentence pair property That is, an example is positive if its hypothesis (in the case of NLI) or claim (in the case of fact verification) contains at least one negative word from the list. This method poses some limitations: For example, we do not consider double negatives in the hypothesis that affect its meaning, or the presence of negation in both premise and hypothesis. However, our construction is consistent with prior findings on negation bias (Gururangan et al., 2018;Poliak et al., 2018;Schuster et al., 2019).
Overlap/Subsequence Based on the analysis of McCoy et al. (2019), we define a class of probing tasks for identifying the different lexical heuristics in NLI. We focus on lexical overlap and subsequences 4 and define two sentence pair properties: where an example is positive if all the hypothesis words are found in the premise (regardless of word order), and where an example is positive if the hypothesis is a subsequence of the premise.

Data Processing
To alleviate issues of data balancing, we take the following steps when processing the probing datasets: First, we identify all the biased samples in a given dataset, according to the probing property. Since in all our datasets the positive class (biased samples) is the minority class, we subsample the same amount of samples from the remaining subset (the majority class). We end up with a balanced probing dataset. This ensures that when splitting the data during online code training, and when measuring performance on the entire dataset, the process is unaffected by the bias evidence, that is, the amount of bias in the original dataset. The probing datasets are constructed from three base NLU datasets: SNLI (Bowman et al., 2015), MultiNLI (Williams et al., 2018) and FEVER (Thorne et al., 2018), following the original train/validation/test splits. 5 Inspired by previous work on biases in NLU datasets (Section 2), we construct NegWords probing datasets from all three base NLU datasets and Overlap/Subsequence probing datasets from SNLI and MNLI. The dataset statistics are presented in Table 1.

Evaluation
We use a linear probe across all experiments. We evaluate both the probe's accuracy and its minimum description length (MDL; Voita and Titov 2020), to measure bias extractability. Formally, given a dataset D = {(x 1 , y 1 ) , . . . , (x n , y n )} and a probabilistic model p θ (y | x), the description length of the model is defined as the number of bits required to transmit the labels Y = (y 1 , . . . , y n ), given X = (x 1 , . . . , x n ). We estimate MDL using Voita and Titov's online coding, and denote the result L online . Given a uniform distribution over the K labels, we get L unif = |D| log K. Thus, the compression is defined as C = L unif L online and it holds that 1 ≤ C ≤ C * where C * is the compression given by a perfect model. We interpret a lower MDL score (and consequently, a higher compression score) to mean that the probing property is more extractable from the model representation. The hyperparameters we use in the evaluation process are outlined in Appendix A.1.

Debiasing Methods
To deploy our framework in the context of robustness to bias, we examine several proposed strategies for debiasing NLU models. In all cases, a weak learner models the bias and is combined with a main model to produce less biased predictions.
We note that there are three different criteria for controlling the debiasing strategy: (1) Models may be trained end-to-end by propagating errors to the weak learner as well as the main model (Mahabadi et al., 2020) or in a pipeline, where the weak learner is trained first and frozen, such that only its predictions are used to tune the combination loss (He et al., 2019;Clark et al., 2019;Sanh et al., 2021;Utama et al., 2020a).  3) The objective function by which the main and bias model are combined can vary. Below we describe three common objective functions. We test different combinations of all strategies where they are feasible, resulting in a wide array of debiased models.

Debiasing Objectives
Debiased Focal Loss (DFL) Focal loss was first proposed by Lin et al. (2017) to encourage a classifier to focus on the harder examples, for which the model is less confident. This is achieved by weighing standard cross-entropy with (1 − p m ) γ , where p m is the class probability and γ is the focusing parameter. Mahabadi et al. (2020) propose DFL, where the weighting is achieved by a bias-only model's class probability p b and the loss becomes: We re-implement their model with two bias-only models: a hypothesis-only model and a lexical bias model that uses the same input features as Mahabadi et al. (2020), outlined in Appendix A.2 Product of Experts (PoE) Product of experts (PoE) was first proposed by Hinton (2000) as a method for training ensembles of models that are experts at specific sub-spaces of the entire distribution space. Each model can focus on an "area of expertise" and their multiplied predictions form the combination model. This idea was utilized in several studies (He et al., 2019;Clark et al., 2019;Mahabadi et al., 2020;Sanh et al., 2021) to train a combination of models where the experts are weak models. The combination model output becomes and is trained with standard cross-entropy. FEVER The Fact Extraction and VERification (FEVER) dataset contains around 180k pairs of claim-evidence pairs, where the task is to predict one of three labels: either the evidence supports or refutes the claim, or there is not enough information. We evaluate on FEVER-Symmetric, which was designed such that it cannot be predicted by a claim-only classifier (Schuster et al., 2019).

Models
We test different models based on BERT, by removing the classification head and using the pooled representation of the [CLS] token as input to our probes. In settings where previous work compared in-distribution and o.o.d performance, we use hyperparameters which are known to work well for the task and dataset. For new settings which were not reported in previous work, we sweep for the best hyperparameters based on the in-distribution accuracy on the validation set. 7 All hyperparameters are available in Appendix A.3. We train all models with five random seeds and report means 6 Our probing tasks contain examples from all original labels of the datasets. A reviewer pointed out that one can look at probing datasets where examples are drawn only from a specific down-stream label, but our experiments found that splitting per label does not reveal different trends than those we observe here. 7 In our experiments, some methods did not converge, notably PoE and DFL using a model with subset sampling. This method was used to train ConfReg models and is likely much more sensitive to selection of the weak model.  and standard deviations, to account for known variability of fine-tuned models, espeically when evaluated out of distribution (McCoy et al., 2020). We reimplement all debiasing methods in a unified codebase to facilitate a fair comparison. Training details are available in Appendix A.4.
Baselines We use the standard base BERT implementation of Wolf et al. (2020). We take the pretrained model without further fine-tuning on any downstream task (denoted as Pretrained) and we also fine-tune the model on the target dataset (Base). To obtain a lower bound on the performance of these models, we take the same model and randomly initialize its weights (Random).

Results
In this section, we first report our main findingthe correlation between the robustness of models. We then analyze each bias type and dataset in a more fine-grained manner. Table 3 shows the Pearson correlations (ρ) between robustness and bias extractability. Robustness is measured as the difference between the performance of a debiased model on a relevant o.o.d dataset and that of a baseline model. Higher values mean that the debiased model is more robust. Bias extractability is measured as the compression score using a probing classifier designed to target the bias. In all but one case, we find positive correlations, indicating that the more successful a method is in debiasing model predictions, the more it makes the bias accessible in the inner representations.
The only exception is NegWords bias on MNLI, where we report a negative correlation. As we analyze below, in this case some models do not improve on o.o.d data, but their compression still increases. This suggests that even though various debiasing methods are not always successful on different datasets and bias types, they still make bias more accessible in the representations.
In fact, as performance on the anti-biased examples from the HANS subset increases, so does the compression of the probe; Figure 1 shows an example of this trend in the subsequence case. DFL with implicit bias from the TinyBERT model (trained either end-to-end or in a pipeline) has the highest compression values, as well as the biggest improvement out of distribution.
SNLI Table 4 shows results for the Overlap and Subsequence probing tasks. All debiasing methods lead to improved o.o.d performance, as expected. Compression of the random and pretrained baselines remains very close, with most of the bias being made more extractable in the representations of the fine-tuned baseline (Base). Most of the debiased models still largely surpass the baseline for compression and probing accuracy, indicating that they make bias more extractable. ConfReg and DFL with a fine-tuned TinyBERT are exceptions; 3.16 ± 0.07 60.5 ± 2.5 PoE 3.12 ± 0.05 61.0 ± 3.6 PoE e2e 3.06 ± 0.06 61.4 ± 3.2  Schuster et al. (2019), which is designed such that a claim-only classifier cannot achieve higher-than-guess performance on it. Probing accuracy is reported in Appendix A.5. they do not exhibit higher compression than the baseline, but still improve out of distribution.

Negative Word Bias
FEVER Table 5 shows the results for the Neg-Words task on FEVER. All models improve on FEVER-Symmetric compared to the baseline (Base), indicating that they are less biased in their predictions. Conversely, when probed for the bias, all models achieve higher compression compared to the baseline and outperform it in terms of probing accuracy. That is, this bias is more extractable in the debiased models than in the baseline model. As a point of reference, the compression of the random model is smallest, closely followed by the pre-trained model. Any fine-tuning leads to significantly larger compression scores. These trends are consistent with the Overlap/Sub. results. The best model in terms of o.o.d accuracy is DFL with an implicit TinyBERT bias model. We also see that bias is most extractable in this model, compared to the baseline. While previous work used statistical tools to show that the REFUTES label is spuriously correlated with negative bigrams (Schuster et al., 2019), we reveal that this information is preserved and even amplified in the model when an attempt is made to make the predictions less reliant on it.
SNLI In this case, all models perform better or as well as the baseline model when evaluated on the hard subset, yet the compression values of all models significantly surpass the baseline. While any form of debiasing makes bias more available in the representations, it does not necessarily lead to an improvement on the o.o.d set. Models with a hypothesis-only model perform best out of distribution, and also expose the most bias. Similarly to the results on FEVER, the compression of the random and pretrained models is significantly lower and close to each other, with most of the bias being made available by fine-tuning the model (Base). Table 7 in Appendix A.5 provides the full results.
MNLI Compression results are much closer to the fine-tuned baseline, but all debiased models still contain more information about negation words. This is on-par with previous results that anaylzed the statistical correlation of such negation words to the CONTRADICTION label (Gururangan et al., 2018;Poliak et al., 2018), and we show that not only does the correlation exist in the data, but attempts to remove such evidence result in more extractability. Still, most growth in compression compared to the random and pretrained models is attributed to the fine-tuning process itself (without debiasing). Interestingly, some of the models do not improve the performance on the hard test set, but their compression still increases, suggesting that the more accessible bias can also be decoupled from the predictions of the model. Table 8 in Appendix A.5 provides the full results.

Varying the Debiasing Effect
So far we evaluated the effect of debiasing on bias extractability across debiasing methods. To evaluate this effect within the same method, we analyze the effect of stronger debiasing in the DFL method, by increasing the "focusing parameter" γ (Eq. 6).
We test our probing tasks on models trained with increasing values of γ ∈ {1, 2, 3, 4}. Figure 2 shows the results for the Overlap/Subsequence tasks. As we increase γ, the extractability of bias from the model's representations increases. This is consistent with our main results.

Linguistic Information in Debiased Models
Following the main results, a useful question to ask is whether debiased models also tend to learn useful linguistic information more broadly, which may explain the noticeable increase in performance out of distribution. 8 To test this, we take our models trained for NLI on the MNLI dataset and apply the SentEval probing tasks (Conneau et al., 2018), which test ten different linguistic properties in model representations. We exclude the word content (WC) task, because it is a 1000-way classification problem and takes substantially more time to train with an MDL probe. Table 6 shows the average results for all debiased models and the remaining nine tasks, compared to our three baselines (random, pretrained, fine-tuned). First, we notice that for 8/9 tasks, compression of the model decreases when it is fine-tuned, compared to the pretrained model. This can be explained by the close connection between the linguistic phenomena and the masked language modelling (MLM) objective, compared to fine-tuning on NLI. Furthermore, on average, debiased models do not decrease in compression compared to the fine-tuned model, 2.82 ± 0.0 2.41 ± 0.06 2.47 ± 0.14 51.9 ± 3.28 TreeDepth 1.48 ± 0.0 1.55 ± 0.0 1.53 ± 0.01 1.56 ± 0.01 25.6 ± 0.6 Table 6: Average accuracy and compression scores for debiased models and baselines, when probed for the Sen-tEval tasks (Conneau et al., 2018). Random is the randomly initialized model, Pretrained is the pretrained model without fine-tuning, and Base is the fine-tuned model. Accuracy and Average denote the average accuracy and compression score of M debiased models trained on MNLI (M = 8).
but the differences are very subtle and generally within standard deviation bounds. This suggests that while debiasing does not make linguistic information measured in these probing tasks less extractable, it also does not substantially amplify it, as opposed to extractability of bias information.

Discussion and Conclusion
All of our experiments tested model-based debiasing, where a weak learner is used to capture biased features and discourage their use in model predictions. We discover that for both explicit and implicit modeling of the bias, this method exposes the biased features in the representation. When we fix the model and change the effect of debiasing (through the "focusing parameter" of DFL), we observe the same trend, where stronger bias mitigation leads to higher extractability of the modelled bias. Based on our results, we stipulate that while current debiasing methods are good at making model predictions less biased, they are a bad proxy for learning unbiased text representations. The increased extractability of bias from the representations is not necessarily a bad trait: For example, the NegWords task does not reveal more granular semantics of negation, which may be useful for the generalization of the model. By probing for linguistic properties using the SentEval tasks, we also observe that debiased models do not make linguistic information less extractable, which can also contribute to their improvement in performance.
We argue that future research should look for more interpretable methods for debiasing language mod-els, and consider the problem of finding robust, bias-free feature detectors.
Another domain where this finding may be alarming is social bias. Previous studies show that word vectors contain social bias (Caliskan et al., 2017), and that debiasing them does not necessarily remove this information (Gonen and Goldberg, 2019). Our work shows that debiasing sometimes increases the information available about bias in the representations, albeit in the context of dataset bias rather than social bias.
Our work shows that unbiased predictions =⇒ biased representations. We speculate that there exists a proxy for the language model that removes bias information from the representations and consequently improves the generalization of predictions out of distribution. Future work could focus on methods that are both representation-robust and prediction-robust w.r.t various biases. Finding such methods can help alleviate leakage of bias from data to the model's representations, without sacrificing the in-distribution performance.

A.1 Online Code Evaluation
Following Voita and Titov (2020), we evaluate our models using an online code probe, with timestamps [2.0, 3.0, 4.4, 6.5, 9.5, 14.0, 21.0, 31.0, 45.7, 67.6, 100] (Each timestamp corresponds to a percentage of the samples in the training dataset). We use a slightly different scale than Voita and Titov (2020), to account for the smaller datasets and the resulting instability in the first fractions of training. The last timestamp is used to train the probe on the full training dataset, and it is then evaluated for accuracy on the entire test set. During all training phases, we employ early-stopping when the validation accuracy does not improve over four epochs, with a tolerance of 10 −3 .

A.2 Bias-only Models
For the lexical bias-only model, we use the following features as bias input features: 1) Whether all words in the hypothesis are included in the premise; 2) If the hypothesis is the contiguous subsequence of the premise; 3) If the hypothesis is a subtree in the premise's parse tree; 4) The number of tokens shared between premise and hypothesis normalized by the number of tokens in the premise, and 5) The cosine similarity between premise and hypothesis's pooled token representations from BERT followed by min, mean, and max-pooling. Following Mahabadi et al., we also give equal weights to neutral and contradiction labels (by calculating a weighted cross-entropy loss) to encourage the model towards biased predictions.

A.3 Hyperparameters
ConfReg We train all models for five epochs and use the same hyperparameters as in Utama et al. (2020b): 2000 samples for the weak learner sub-sampling, a batch size of 32, learning rate of 5 · 10 −5 , a weight decay of 0.01 and a linear scheduler for modulating the learning rate with a 10% warm-up proportion. For training FEVER, we set a learning rate of 2 · 10 −5 and sub-sample 500 samples. For SNLI we use the same parameters as in MNLI, but we sub-sample 3 000 samples to account for the larger dataset, and make sure that the weak model still follows the constraints: at least 90% of the predictions on the sampled training set fall within the 0.9 probability bin, and the weak learner achieves more than 60% accuracy on the entire training set.

DFL and PoE
We train all models for three epochs on MNLI and SNLI with a batch size of 32, learning rate of 5 · 10 −5 , a weight decay of 0.0 and a linear scheduler for modulating the learning rate with a 10% warm-up proportion. We choose γ = 2.0 for most of the DFL models. Exceptions are made for DFL with the subsampled bias model and end-to-end DFL with a TinyBERT bias model, where we sweep γ ∈ {1.0, 2.0} and choose γ = 1.0 based on the highest validation accuracy (in-distribution). Another exception is made for FEVER, where we set the learning rate at 2 · 10 −5 to be consistent with previous work.

A.4 Training Details
To train all models, we have used single instances of NVIDIA GeForce RTX 2080 Ti, with an average training time of 1-7 hours. Models where the weak learner is frozen have 110M parameters, as in the base BERT model. TinyBERT models have 4.4M parameters (Turc et al., 2019) and any combination of a weak model and a main model is straightforward to calculate.  2.64 ± 0.09 87.5 ± 0.1 76.8 ± 0.1 3.00 ± 0.05 87.9 ± 0.5 61.9 ± 1.6