Robust Natural Language Understanding with Residual Attention Debiasing

Natural language understanding (NLU) models often suffer from unintended dataset biases. Among bias mitigation methods, ensemble-based debiasing methods, especially product-of-experts (PoE), have stood out for their impressive empirical success. However, previous ensemble-based debiasing methods typically apply debiasing on top-level logits without directly addressing biased attention patterns. Attention serves as the main media of feature interaction and aggregation in PLMs and plays a crucial role in providing robust prediction. In this paper, we propose REsidual Attention Debiasing (READ), an end-to-end debiasing method that mitigates unintended biases from attention. Experiments on three NLU tasks show that READ significantly improves the performance of BERT-based models on OOD data with shortcuts removed, including +12.9% accuracy on HANS, +11.0% accuracy on FEVER-Symmetric, and +2.7% F1 on PAWS. Detailed analyses demonstrate the crucial role of unbiased attention in robust NLU models and that READ effectively mitigates biases in attention. Code is available at https://github.com/luka-group/READ.

Although the attention mechanism (Vaswani et al., 2017) is essential to the success of Transformer-based pretrained language models (PLMs), attention can also capture potentially spurious shortcut features leading to prediction biases.For example, too much or too little attention across sentences in natural language inference may lead to the lexical overlap bias (McCoy et al., 2019;Rajaee et al., 2022) or the hypothesis-only bias (Poliak et al., 2018).Since attention serves as the main media for feature interactions in PLMs, many of the aforementioned biases can be associated with biased attention patterns.In fact, a number of recent studies have shown that appropriate attention plays a critical role in ensuring robust2 prediction (Chen et al., 2020;Li et al., 2020;Stacey et al., 2022).However, existing ensemble-based debiasing methods typically apply debiasing on top-level logits (Clark et al., 2019a;He et al., 2019;Sanh et al., 2020;Utama et al., 2020b;Ghaddar et al., 2021).These methods do not proactively mitigate attention biases, but instead, rely on debiasing signals being propagated from final predictions to the attention modules in a top-down manner.Top-level logits are highly compressed and the propagation may suffer from information loss, thus providing limited debiasing signal to low-level attention.Instead, we seek for an effective attention debiasing method that prevents models from learning spurious shortcuts, especially those captured by the attention mechanism.
In this paper, we propose REsidual Attention Debiasing (READ), an end-to-end debiasing method that mitigates unintended biases from attention.Our method is inspired by the recent success of onestage PoE (Ghaddar et al., 2021).As an ensemblebased debiasing method, it trains a biased model to capture spurious in-distribution shortcuts and trains the ensemble of the biased model and a main model to prevent the main model from relying on spurious shortcuts.To do this end-to-end, one-stage PoE trains the biased model and its ensemble with the main model simultaneously in a weight-sharing manner.In READ, we let the two models share all weights except attention modules and classification heads, allowing the main model to fit the unbiased attention residual with respect to the attention of the biased model.Intuitively, since they are trained on the same batch of data at each iteration, biased model attention and main model attention are likely to capture similar spurious features, making their residual free of such biases.Fig. 1 presents an example of the attention change.Given a nonduplicate sentence pair, BERT, which suffers from lexical overlap bias, does not aggregate much information from non-overlapping tokens.In contrast, READ learns to pay more attention to informative non-overlapping tokens.
Experiments on three NLU tasks show that READ significantly improves the performance of BERT-based models on OOD data where common types of shortcuts are removed, including +12.9% accuracy on HANS, +11.0%accuracy on FEVER-Symmetric, and +2.7% F1 on PAWS.We further examine the attention scores of the debiased main model and find that its distribution is more balanced ( §4.1).These results indicate the crucial role of unbiased attention in robust NLU models.We also demonstrate that our method is still effective when using a biased model with the same parameter size as the main model ( §4.2), which differs from the previous assumption that the biased model of unknown biases should be weaker3 (Sanh et al., 2020;Utama et al., 2020b).
Our contributions are three-fold.First, we propose READ, an ensemble-based debiasing method for NLU models, mitigating attention biases through learning attention residual.Second, experiments on three NLU tasks consistently demonstrate that READ can significantly improve the OOD performance of different NLU tasks with various dataset biases.Third, detailed analyses provide useful insights for developing robust NLP models, including the importance and properties of unbiased attention, and the design of biased models in ensemble-based debiasing methods.

Method
Our method, READ, combines one-stage productof-experts (PoE) with learning attention residual to mitigate unknown dataset biases for NLU tasks.Based on the problem definition, we introduce the two key components of our method, followed by the details of training and inference.

Problem Definition
For a discriminative task, given the dataset D = {x i , y i }, where x i is the raw input and y i is the gold label, our goal is to learn a robust function f with parameters θ, that can predict a probability distribution p = f (x i ; θ) without relying on spurious features.in NLU tasks, x i is typically a textual  sequence.As discussed by prior studies (Gardner et al., 2021;Eisenstein, 2022), spurious features captured by f , such as particular words (Gardner et al., 2021) and lexical overlap ratios (Rajaee et al., 2022), although may be statistically correlated with y i due to dataset artifacts (Gururangan et al., 2018), should not be regarded as useful information for predicting y i .In other words, the prediction of a robust and faithful model should be independent of these non-causal features.Since diverse spurious features may exist in NLU datasets, we focus on mitigating dataset biases without any assumption of the type or structure of bias, so that the proposed method is generalizable to most unknown biases.

One-Stage PoE
Due to the automated process of feature extraction in neural networks, it is impractical to train a robust model that directly identifies all robust features in the tremendous feature space.Considering spurious features are often simple features (also, of easy-to-learn data instances) that model tends to memorize in the first place (Shah et al., 2020), an ensemble-based debiasing method trains a biased model to collect biased prediction p b and approximates an ensemble prediction p e based on p b and another prediction p m from a main model towards the observations in training data.Considering both parts of the ensemble prediction p e , since the bi-ased model mainly captures spurious shortcuts, as its complement, the main model then focuses on capturing robust features.READ adopts PoE (Clark et al., 2019a) to obtain a multiplicative ensemble of the two models' predictions: (1) Specifically, READ follows the one-stage PoE framework (Ghaddar et al., 2021) that simultaneously optimizes the ensemble prediction and biased prediction, and shares weights between the main model and biased model, as shown in Fig. 2. When using a PLM as the main model, one-stage PoE typically uses one or a few bottom layers of PLMs stacked with an independent classification head as the biased model, because these low-level layers preserve rich surface features (Jawahar et al., 2019) which can easily cause unintended biases (McCoy et al., 2019;Gardner et al., 2021).The main model has shared encoder layers at the bottom followed by independent encoder layers and its classification head.This weight-sharing design makes it possible to debias the model end-to-end with a few additional parameters.However, shared layers result in shared biases in these layers.Although PoE mitigates biases from predictions, it preserves biases in shared layers.

Learning Attention Residual
Ensemble prediction with PoE cannot effectively mitigate unintended biases in attention, which is the major part of feature aggregation and interaction in PLMs.For example, the [CLS] representation aggregates information from all token representations according to the attention distribution, and all token representations interact with each other based on the attention values.Therefore, biased attention becomes the direct source of many spurious features, such as lexical overlap in natural language inference and semantic-neutral phrases in sentiment analysis (Friedman et al., 2022).To prevent the main model from learning biased attention, READ further conducts additive ensemble of the attention distributions of both the main and biased models.Similar to ensemble prediction, the attention ensemble here encourages the main model attention to learn from the residual of biased model attention, so as to mitigate the biases captured by the latter from the former.
Fig. 2 shows the workflow of learning attention residual.The self-attention mechanism (Vaswani et al., 2017) allows each vector in a matrix to interact with all vectors in the same matrix.Specifically, the input matrix H is first projected to a query matrix Q, a key matrix K, and a value matrix V .Attention scores of all vectors to each vector is a probability distribution computed based on the dot product between Q and K.With attention scores as weights, the self-attention module maps each vector in H to the weighted average of V .In READ, the main attention and biased attention use distinct projection weights for Q and K, but take the same H as inputs and share the same projection weights for V .Distinct Q and K allow the two models to have their own attention.Sharing H and V ensures the attention in the biased and main models are distributed in the same semantic space so that they are additive. 4he ensemble attention a e combines main attention a m and biased attention a b with weighted average. 5This additive ensemble is inspired by the success of using the probability difference for post-hoc debiasing (Niu et al., 2021;Qian et al., 2021;Wang et al., 2022c) and preventing overconfidence (Miao et al., 2021).In our case, the main attention is the difference between ensemble attention and biased attention.READ also adds a coefficient α ∈ (0, 1) to balance the ensemble ratio.An appropriate coefficient can prevent over-or under-debiasing.Finally, the ensemble attention becomes (2) Now that we have three paths in the attention module, including ensemble attention, main attention, and biased attention.In each forward pass from the input to p m or p b , only one of them is activated as the final attention distribution.During training, READ adopts ensemble attention to compute p m and biased attention to compute p b , for mitigating biases from main attention by learning their residual.During inference, READ adopts main attention, which is free of bias, to compute robust prediction p m .

Training and Inference
We train the ensemble model and the biased model on the same dataset batch B simultaneously with a cross-entropy loss (4)

Experiment
In this section, we evaluate the debiasing performance of READ on three NLU tasks.We first provide an overview of the experimental settings ( §3.1 and §3.2), followed by a brief description of baseline methods ( §3.3).Finally, we present a detailed analysis of empirical results ( §3.4).

Datasets
Following  2018)) is a natural language inference dataset.The dataset contains 392k pairs of premises and hypotheses for training, which are annotated with textual entailment information (entailment, neutral, contradiction).For evaluation, we report accuracy on the MNLI dev set and the OOD challenge set HANS (McCoy et al., 2019).HANS contains premise-hypothesis pairs that have significant lexical overlap, and therefore models with lexical overlap bias would perform close to an entailment-only baseline.
FEVER (Thorne et al., 2018) is a fact verification dataset that contains 311k pairs of claims and evidence labeled with the validity of the claim with the given evidence as context.For OOD testing, we report accuracy on the FEVER-Symmetric6 test set (Schuster et al., 2019) where each claim is paired with both positive and negative evidences to avoid claim-only bias 7 .
QQP is a paraphrase identification dataset consisting of pairs of questions that are labeled as either duplicated or non-duplicate depending on whether one sentence is a paraphrased version of the other.For testing, we report F1 score on PAWS (Zhang et al., 2019c), which represents a more challenging test set containing non-duplicate question pairs with high lexical overlap.

Implementation
Following previous works (Utama et al., 2020b;Ghaddar et al., 2021;Gao et al., 2022), we use BERT-base-uncased model (Devlin et al., 2019) as the backbone of the debiasing framework.All experiments are conducted on a single NVIDIA RTX A5000 GPU.We use the same set of hyperparameters across all three tasks, with the learning rate, batch size, and ensemble ratio (α) set to 2e-5, 32, and 0.1 respectively.We train all models for 5 epochs and pick the best checkpoint based on the main model performance on the in-distribution dev set.On each dataset, we report average results and standard deviations of five runs.More details can be found in Appx.§B.

Baseline
We include a vanilla BERT model and compare our method with a wide selection of previous debiasing methods for language models as follows: • Reweighting (Clark et al., 2019a)  • DCT (Lyu et al., 2023) reduces biased latent features through contrastive learning with a specifically designed sampling strategy.
• Kernel-Whitening (Gao et al., 2022) transforms sentence representations into isotropic distribution with kernel approximation to eliminate nonlinear correlations between spurious features and model predictions.
In addition, previous methods can also be categorized based on whether prior knowledge of specific biased features, such as hypothesis-only and lexical overlap biases in NLI, is incorporated in the debiasing process.We accordingly group the compared methods when reporting the results (Tab. 1) in the following two categories: • Methods for known bias mitigation have access to the biased features before debiasing and there-fore can train a biased model that only takes known biased features as inputs.While each of the OOD test sets we use for evaluation is crafted to target one specific form of bias, biased features can be highly complex and implicit in real-world scenarios, which limits the applicability of these methods.
• Methods for unknown bias mitigation do not assume the form of bias in the dataset to be given.Our proposed method belongs to this category.

Results
As shown in Tab. 1, among all baselines, unknown bias mitigation methods can achieve comparable or better performance than those for mitigating known biases on OOD test sets of NLI and fact verification.Although all baseline methods improve OOD performance in comparison with vanilla BERT, there is not a single baseline method that outperforms others on all three tasks.Overall, our proposed method, READ, significantly improves model robustness and outperforms baseline methods on all OOD test sets with different biases.On HANS, the challenging test set for MNLI, our method achieves an accuracy score of 73.1%, i.e. a 12.9% of absolute improvement from vanilla BERT and a 1.9% improvement from the best-performing baseline End2End.Compared to Overlapping Non-Overlapping Special

BERT READ
Figure 3: Average attention probability of each overlapping token, non-overlapping token, and special token per sentence pair on the PAWS test set for all instances (left), non-duplicated instances (middle), and duplicated instances (right).We present the [CLS] attention over all input tokens from the last ensemble layer of READ and the same attention layer of BERT.READ increases the attention over non-overlapping tokens to reduce lexical overlap bias.
End2End, residual debiasing on attention of READ directly debiases on the interactions of token-level features, leading to more effective mitigation of lexical overlap biases.On FEVER-Symmetric, READ outperforms vanilla BERT by 11.0% accuracy and outperforms the best-performing method Kernel-Whitening by 2.5%.On PAWS, the challenging test set for paraphrase identification, READ improves model performance by 2.7% F1, and outperforms the best-performing baseline method Conf-Reg, which relies on extra training data with lexical overlap bias.These results demonstrate the generalizability of READ for mitigating various biases in different NLU tasks.
We also observe that the in-distribution performance of READ is generally lower than baseline methods.In fact, almost all debiasing methods shown in Tab. 1 enhance OOD generalization performance at the cost of decreased in-distribution performance This aligns with the inherent tradeoff between in-distribution performance and OOD robustness as shown by recent studies (Tsipras et al., 2018;Zhang et al., 2019a).The optimal in-distribution classifier and robust classifier rely on fundamentally different features, so not surprisingly, more robust classifiers with less distribution-dependent features perform worse on in-distribution dev sets.However, note that generalizability is even more critical to a learning-based system in real-world application scenarios where it often sees way more diverse OOD inputs than it uses in in-distribution training.Our method emphasizes the effectiveness and generalizability of debiasing on unknown OOD test sets and demonstrates the importance of learning unbiased attention patterns across different tasks.In the case where indistribution performance is prioritized, the ensemble prediction p e can always be used in place of the debiased main prediction p m without requir-ing any additional training.Future work may also explore to further balance the trade-off between indistribution and OOD performance (Raghunathan et al., 2020;Nam et al., 2020;Liu et al., 2021).It is also worth noting that our method only introduces a very small amount of additional parameters, thanks to the majority of shared parameters between biased and main models.

Analysis
To provide a comprehensive understanding of key techniques in READ, we further analyze the debiased attention distribution ( §4.1) and the effect of number of ensemble layers ( §4.2).

Debiased Attention Distribution
To understand the influence of READ on attention, we examine the attention distribution of BERT and READ on the PAWS test set.Specifically, we take the attention between [CLS], which serves as feature aggregation, and all other tokens as an example.We group tokens into three categories, including overlapping tokens (e.g.how and does in Fig. 1), non-overlapping tokens (e.g. one and those in Fig. 1), and special tokens (e.g.[CLS] and [SEP]).Since attention residual for attention debiasing exists in ensemble layers of READ, we compare the attention on the last ensemble layer of READ and the corresponding layer of BERT.
As discussed in §3.4,vanilla BERT finetuned on QQP suffers from the lexical overlap bias and does not generalize well on PAWS.This problem is reflected in the inner attention patterns.As shown in Fig. 3, BERT assigns less (-0.25%) attention to nonoverlapping tokens than to overlapping tokens on average.In contrast, READ increases the attention on non-overlapping tokens to larger than (+0.27%) the attention on overlapping tokens.The same observation also appears in the subset of duplicate sentence pairs and the subset of non-duplicate sentence pairs.This change in attention patterns reveals the inner behavior of READ for effectively preventing the model from overly relying on the lexical overlap feature.

Effect of Number of Ensemble Layers
Some previous studies assume that the biased model in PoE for unknown bias should be weaker (i.e. less trained or less parameterized) than the main model so as to focus on learning spurious features (Sanh et al., 2020;Utama et al., 2020b).One-stage PoE follows this assumption, using the bottom layers of the main model as the encoder of the biased model (Ghaddar et al., 2021).Since biased attention patterns may appear in any layer, including top layers, we examine whether this assumption holds for READ.Specifically, we evaluate READ with different numbers of ensemble layers on three OOD evaluation sets.
As shown in Fig. 4, although the best-performing READ variant has few ensemble layers, the configuration where the biased and main models share all encoder layers is still effective on HANS and PAWS.For example, on HANS, READ achieves comparable performance with the previous state-ofthe-art method when the biased and main models share all encoder layers.This observation indicates that the shared encoder layer with distinct attention allows the biased model to focus on spurious attention patterns.Moreover, it is apart from the assumption that a biased model is necessarily a weak model, such as the bottom layers of the main model with a simple classification head.Future works on ensemble-based debiasing can explore a larger model space for the biased model.

Related Work
We present two lines of relevant research topics, each of which has a large body of work, so we can only provide a highly selected summary.
Debiasing NLU Models.Unintended dataset biases hinder the generalizability and reliability of NLU models (McCoy et al., 2019;Schuster et al., 2019;Zhang et al., 2019b).While a wide range of methods have been proposed to tackle this problem, such as knowledge distillation (Utama et al., 2020a;Du et al., 2021), neural network pruning (Meissner et al., 2022;Liu et al., 2022), and counterfactual inference (Udomcharoenchaikit et al., 2022), ensemble-based methods (Clark et al., 2019a;He et al., 2019;Lyu et al., 2023) stand out for their impressive empirical success.Recent works extend ensemble-based methods, such as PoE, to mitigate unknown biases by training a weak model to proactively capture the underlying data bias, then learn the residue between the captured biases and original task observations for debiasing (Sanh et al., 2020;Utama et al., 2020b;Ghaddar et al., 2021).Xiong et al. (2021) further improves the performance of these methods using a biased model with uncertainty calibration.Nevertheless, most prior works only mitigate unintended biases from toplevel logits, ignoring biases in low-level attention.
Attention Intervention.In current language modeling technologies, the attention mechanism is widely used to characterize the focus, interactions and aggregations on features (Bahdanau et al., 2015;Vaswani et al., 2017).Although the interpretation of attention is under discussion (Li et al., 2016;Jain and Wallace, 2019;Wiegreffe and Pinter, 2019), it still provides useful clues about the internal behavior of deep, especially Transformer-based, language models (Clark et al., 2019b).Through attention intervention, which seeks to re-parameterize the original attention to represent a conditioned or restricted structure, a number of works have successfully improved various model capabilities, such as long sequences understanding (Beltagy et al., 2020;Shi et al., 2021;Ma et al., 2022), contextualizing entity representation (Yamada et al., 2020), information retrieval (Jiang et al., 2022), and salient content selection (Hsu et al., 2018;Wang et al., 2022a).Some recent works also add attention constraints to improve model robustness towards specific distribution shifts, including identity biases (Pruthi et al., 2020;Attanasio et al., 2022;Gaci et al., 2022) and structural perturbations (Wang et al., 2022b).

Conclusion
In this paper, we propose READ, an end-to-end debiasing method that mitigates unintended feature biases through learning the attention residual of two models.Evaluation on OOD test sets of three NLU tasks demonstrates its effectiveness of unknown bias mitigation and reveals the crucial role of attention in robust NLU models.Future work can apply attention debiasing to mitigate dataset biases in generative tasks and multi-modality tasks, such as societal biases in language generation (Sheng et al., 2021) and language bias in visual question answering (Niu et al., 2021).

A Datasets
We use all the datasets in their intended ways.
MNLI dataset contains different subsets released under the OANC's license, Creative Commons Share-Alike 3.0 Unported License, and Creative Commons Attribution 3.0 Unported License, respectively.Among all the data entries, 392,702 samples are used for training.9,815 and 9,832 samples from validation matched and validation mismatched subsets of MNLI respectively are used for evaluation.
HANS is released under MIT License.The validation subset of HANS contains 30,000 data entries, which are used for OOD evaluation of natural language inference.
FEVER follows the Wikipedia Copyright Policy, and Creative Commons Attribution-ShareAlike License 3.0 if the former is unavailable.311,431 examples from the FEVER dataset are used to train the model.
FEVER-Symmetric test set with 717 samples is used as the OOD challenge set for fact verification.
QQP8 consists of 363,846 samples for training, and 40430 samples for in-distribution evaluation.
PAWS dataset with 677 entries is used for OOD evaluation of paraphrase identification.

B Implementation
Our Implementation is based on HuggingFace's Transformers (Wolf et al., 2020) and PyTorch (Paszke et al., 2019).Since the training sets of three tasks are of roughly the same size, it takes about 5 to 6 hours to finetune the BERT-base model, which has around 110 million parameters, on each task.Our ensemble model adds 5.3M parameters, a 4.8% increase from the BERT-base model.These additional parameters will be removed after the completion of training.During training, we use a linear learning rate scheduler and the AdamW optimizer (Loshchilov and Hutter, 2018).Models finetuned on the MNLI dataset will predict three labels, including entailment, neutral, and contradiction.During inference on the OOD test set, we map the latter two labels to the non-entailment label in HANS.

Figure 1 :
Figure 1: Attention distribution on a non-duplicated sentence pair.Red bars are debiased [CLS] attention from the last ensemble layer of READ and blue bars are corresponding attention from finetuned BERT.Distinct tokens in the two sentences are highlighted with orange borderlines.READ pays more attention to distinct tokens and is more robust to lexical overlap bias.

Figure 2 :
Figure 2: Illustration of one-stage PoE (left) and learning attention residual in ensemble layers (right).Dotted lines in the right figure are conditionally activated.During training, ensemble attention is activated to compute the main prediction and biased attention is activated to compute biased prediction.Through learning their residual, READ mitigates biases from the main attention.During inference, the debiased main attention is activated to compute robust main prediction.
When minimizing L e , gradients on p b in Eq. 1 and gradients on a b in Eq. 2 are disabled, because they serve as auxiliary values for computing p e .Backward passes on p b and a b are only allowed when minimizing L b .During inference, only the main model is used to predict a label ŷi from the label set C: ŷi = |C| argmax c=1 p m (c|x i ).

Figure 4 :
Figure 4: Performance of READ by the number of ensemble layers on HANS (left), FEVER-Symmetric (middle), and PAWS (right).READ is still effective when using twelve ensemble layers on HANS and PAWS.
(Utama et al., 2020b;Ghaddar et al., 2021;Gao et al., 2022)ao et al., 2022), we use three English NLU tasks for evaluation, namely natural language inference, fact verification, and paraphrase identification.Specifically, each of the tasks uses an out-of-distribution (OOD) test set where common types of shortcuts in the training data have been removed, in order to test the robustness of the debiased model.More details can be found in Appx.§A.MNLI (Multi-Genre Natural Language Inference;Williams et al. (

Table 1 :
Model performance on MNLI, FEVER, and QQP.We report results on both the in-distribution dev set and the OOD challenge set (highlighted in blue).All baseline results are copied from the referenced paper unless marked otherwise.For methods that have multiple variants, we report the variant with the best average OOD performance.