End-to-End Self-Debiasing Framework for Robust NLU Training

Existing Natural Language Understanding (NLU) models have been shown to incorporate dataset biases leading to strong performance on in-distribution (ID) test sets but poor performance on out-of-distribution (OOD) ones. We introduce a simple yet effective debiasing framework whereby the shallow representations of the main model are used to derive a bias model and both models are trained simultaneously. We demonstrate on three well studied NLU tasks that despite its simplicity, our method leads to competitive OOD results. It significantly outperforms other debiasing approaches on two tasks, while still delivering high in-distribution performance.

The most common approach to tackle the problem consists in training a bias model with handcrafted features with the goal of identifying biased training examples. This information is used in a later stage to discourage the main model from adopting the naive strategy of the bias model. Several debiasing training paradigms have been proposed to adjust the importance of biased training samples, such as product of experts (Clark et al., 2019;He et al., 2019), learned-mixin (Clark et al., 2019), example reweighting (Schuster et al., 2019), debiased focal loss (Mahabadi et al., 2020), and confidence regularization (Utama et al., 2020a).
Recently, there has been a number of endeavours to produce a bias model without prior knowledge on the targeted biases or without the need for manually designing features. Utama et al. (2020b) propose to use instead a model trained on a tiny fraction (< 1%) of the training data for few epochs as a bias model; while Clark et al. (2020) andSanh et al. (2020) trained a low capacity model on the full training set. These approaches target the training of the bias model alone, which is subsequently queried while training the main model of interest.
In this paper, we propose an end-to-end debiasing framework which does not require an extra training stage, or manual bias features engineering. The bias model is indeed a simple attention-based classification layer on top of the main model's intermediate representations. Both models are trained simultaneously in an end-to-end manner as in (Mahabadi et al., 2020), where the importance of training samples for both models are adjusted using the example reweighting technique of (Schuster et al., 2019). 1 The idea of using intermediate classifiers has previously been explored to reduce the inference cost (Schwartz et al., 2020;Zhou et al., 2020;Xin et al., 2021) for Transformer-based (Vaswani et al., 2017) models.
In contrast to all previous works, our bias model helps locating lexico-syntactic bias features, inside the main model's intermediate layers, whose importance is reduced by adding a noise vector; therefore preventing the main model to rely on them. During training, both the main and bias models interact by interchangeably re-weighting the importance of each others' examples.
Our learning framework, when applied to a vanilla BERT (Devlin et al., 2019)

Method
We describe an end-to-end solution for debiasing a Transformer-based classification model. We note x = {x 1 , x 2 , . . . , x n } the input sequence of length n, and y ∈ {1, 2, . . . , T } the gold label, where T is the number of classes. Figure 1 shows a diagram of our method.

Main Model
The main model is a Transformer-based BERT encoder (Vaswani et al., 2017;Devlin et al., 2019) with a classifier on top of the classification [CLS] token of the last layer. This classifier generates the probability distribution p m ∈ R T over the output classes given an input x. At each layer k, BERT produces an internal hidden representation h k i ∈ R d for each token x i in the input.

Bias Model
We hypothesize that the bottom layers of the main model can serve as input for the bias model, thus avoiding the need for an external model or for manually designed features. Our intuition is built on the observation made by Jawahar et al. (2019) that bottom layers of BERT mainly encode lexical and syntactic information, and on the fact that such models tend to quickly overfit this type of information (Zellers et al., 2018;McCoy et al., 2019).
Our bias model is composed of an additive attention layer (Bahdanau et al., 2014), followed by a softmax one. This module uses the k th layer of the main model as input in order to produce a probability distribution over classes. First, a scalar valueã i ∈ R is computed using a feed forward neural network with weight matrices W e ∈ R d×d and W a ∈ R 1×d such that: Those scalar values are normalized and referred to as attentions a i : The representation vector, v b , for the bias model is computed as the weighted sum of the intermediate representations of the k th layer: We obtain the probability distribution p b over the classes by applying a projection layer (parameterized by W y ∈ R d×T ) followed by a softmax function:

Biased Features Regularization
The attention weights of the bias model reveal which hidden representations are the most informative for the classification decision of that model. We propose to de-emphasize these "bias features" from the main model. We do so by adding a weighted noise vector to the hidden representations of the main model at layer k. Let z ← {z 1 , . . . , z n } be a set of zero-mean Gaussian noise vectors corresponding to each token in the sequence, where z i ∈ R d . The intermediate representations of the main model at layer k are updated as follows: where a i determines the relative amplitude of noise added to each token representation. A high value of a i means that h k i most likely contains bias features. Consequently, a large amount of noise is added at this position, which will prevent the main model from overfitting to these features at subsequent layers.

Example Re-weighting
For both models, we adjust the importance of a training sample by directly assigning it a scalar weight that indicates whether the sample exhibits a bias or not. Let p c m and p c b be the predicted probabilities corresponding to the ground-truth class by the bias and main model respectively. The weight assigned to the main model training sample is calculated as follows: Differently from the reweighing method of (Clark et al., 2019;Schuster et al., 2019), we add a hard threshold γ ∈ [0, 1] to control the number of samples to be re-weighted. The goal is to ensure that the importance of main model samples will be attenuated only if the bias model predictions fall into the high confidence bin (γ > 0.8) on biased samples. To further strengthen this constraint, we down-weight the importance of bias model training samples if the main model confidence is below a threshold β: This ensures that the bias model is focusing on easy examples at early training stages, while challenging ones are gradually fed in the later training steps. Since the main and bias models are trained simultaneously, they interchangeably re-weight each others' examples. This is different from previous works where all samples have the same importance during the training phase of the bias model. For a single training instance, the individual loss term for the main and bias models are: where θ B = {W e , W a , W y } are the trainable parameters for the bias model, while θ M are the ones for the main model (BERT parameters). We train both models by minimizing the aforementioned losses. At inference time, only the main model is used for prediction, and no noise is added to h k . It is worth mentioning that our proposed extension has no impact on the overall training time.

Datasets
We test our method on 3 sentence-pair classification tasks supported by ID training and validation sets as well as an OOD test set, which are specifically built to measure robustness to dataset bias.
For the Natural Language Inference task, we use the MNLI (Williams et al., 2017)

Implementation
We use the 12-layer BERT-base model (Devlin et al., 2019) as our main model, thus our results can be compared with prior works. We adopt the standard setup of BERT and represent a pair of sentences as: [CLS] 1 st sentence [SEP] 2 nd sentence [SEP]. For BERT hyper-parameters, we use those of the baseline: a batch size of 64, learning rate of 1e-5 with the Adam (Kingma and Ba, 2014) optimizer.
However, different set of parameters performed roughly equally well, provided that k is 3 or 4, γ ≥ 0.8 and β is set to 0.5. We use early stopping and report mean performance and standard deviation over 6 runs with different seeds. Ablation study of our method without adding noise, and without reweighting main or bias model's training samples respectively. Symbols are used to distinguish variants from the same paper that use a different training technique: ( †) example reweighting, ( ‡) confidence regularization, (♠) product of Experts (PoE), (♣) PoE+ cross-entropy. Table 1 reports accuracy scores on the development and OOD test sets of the 3 benchmarks we considered. The baseline is the vanilla BERT-base model which is used as a backbone for the main model in all the configurations reported. Also, the table shows the ablation results of our method without adding noise, and without reweighting the training instances of the bias and main models (setting β and γ to 0 or 1 respectively).

Results
First, we notice high variance in performances between runs in the debiasing setting, which was also reported in (Clark et al., 2019;Mahabadi et al., 2020;Utama et al., 2020b;Sanh et al., 2020). Second, we observe that our method offers a good balance between gains on OOD test sets over the baseline, and losses on ID sets. More precisely, we report the best results on FEVER (both ID and OOD test sets), while we improve the HANS score on the MNLI task, but fail to maintain the baseline score on dev as Utama et al. (2020a) did.
On QQP, Utama et al. (2020b) reported a much higher score on PAWS (57.4% vs. 46.5%), but at the expense of an important drop on the ID dev set (85.2% vs. 90.2%). However, our method outperforms this particular model on both MNLI sets, and also shows better cross-task performances compared to prior works. The results are satisfactory, especially when considering the simplicity and ef-ficiency of our approach. Moreover, the fact that a single configuration works well on 3 tasks is an indicator that our method has the potential to generalize on completely unknown OOD sets (Clark et al., 2020).
Expectedly, deactivating the main model reweighting mechanism results in near baseline performances. Solely adding noise signal leads to a modest gain of 2-3% on the OOD test sets and a slight drop (< 1%) on the dev sets compared to the baseline. On one hand, without adding noise, our scores are comparable with previous works across the 3 tasks, that is, a significant drop on OOD test sets and minor gains on ID ones. These observations suggest that down-weighting biased examples is important, while de-emphasizing bias features further improves robustness.

Task
Acc.  On the other hand, we observe that not reweight-ing bias model examples results in the worst performance on the ID sets. We conduct an analysis on the bias model to better understand the impact of reweighting its examples. As shown in Table 2 We inspected the confident scores of the main and bias (with and without example reweighting) models on MNLI dev set, an excerpt of which is reported in Figure 2. On one hand, we observe that bias models successfully assign high probabilities to samples that can be easily classified via keywords (e.g. "not" in example a) and those with high lexical overlap (example b). On the other hand, we noticed that the bias model is performs undesirably well on some challenging examples like (c) and (d) of Figure 2. However, the bias model probability always decreases when we re-weight its examples, which eventually leads to correct prediction of the main model as in example (c) and (e). Interestingly, this observation suggests that training a pure bias model is as important (and challenging) as training the robust one.

Conclusion
Our key contribution is a framework that jointly identifies biased examples and features in an end-to-end manner. Our approach is geared towards addressing lexico-syntactic bias features for Transformer-based NLU models. Future work involves testing our approach on other tasks such as Question Answering (Rajpurkar et al., 2016;Agrawal et al., 2018), exploring methods to obtain proxy OOD data for hyper-parameters selection, and making our method hyper-parameter free.