Towards Interpreting and Mitigating Shortcut Learning Behavior of NLU models

Recent studies indicate that NLU models are prone to rely on shortcut features for prediction, without achieving true language understanding. As a result, these models fail to generalize to real-world out-of-distribution data. In this work, we show that the words in the NLU training set can be modeled as a long-tailed distribution. There are two findings: 1) NLU models have strong preference for features located at the head of the long-tailed distribution, and 2) Shortcut features are picked up during very early few iterations of the model training. These two observations are further employed to formulate a measurement which can quantify the shortcut degree of each training sample. Based on this shortcut measurement, we propose a shortcut mitigation framework LGTR, to suppress the model from making overconfident predictions for samples with large shortcut degree. Experimental results on three NLU benchmarks demonstrate that our long-tailed distribution explanation accurately reflects the shortcut learning behavior of NLU models. Experimental analysis further indicates that LGTR can improve the generalization accuracy on OOD data, while preserving the accuracy on in-distribution data.


Introduction
Pre-trained language models, such as BERT (Devlin et al., 2019), have demonstrated substantial gains on many NLU (natural language understanding) benchmarks. However, recent studies show that these models tend to exploit dataset biases as shortcuts to make predictions, rather than learn the semantic understanding and reasoning (Geirhos et al., 2020;Gururangan et al., 2018). Here we focus on the lexical bias, where NLU models rely on spurious correlations between shortcut words and * Most of the work was done while the first author was an intern at Adobe Research. labels. This eventually results in their low generalizability on out-of-distribution (OOD) samples and low adversarial robustness (Zellers et al., 2018).
In this work, we show that the shortcut learning behavior of NLU models can be explained by the long-tailed phenomenon. Previous empirical analysis indicates that the performance of BERT-like models for NLI task could be mainly explained by the reliance of spurious statistical cues such as unigrams "not", "do", "is" and bigrams "will not" (Niven and Kao, 2019;Gururangan et al., 2018). Here we generalize these hypotheses using the long-tailed phenomenon. Specifically, the features in training set could be modeled using a long-tailed distribution via using local mutual information (Evert, 2005) as a measurement. By utilizing an interpretation method to analyze model behavior, we observe that these NLU models concentrate mainly on information on the head of the distribution, which usually corresponds to nongeneralizable shortcut features. In contrast, the tail of the distribution is poorly learned, although it contains high information for the NLU task. Another key observation is that during training process, shortcut features tend to be picked up by NLU models during very early iterations. Based on these two key observations, we define a measurement to quantify the shortcut degree of all training samples.
Based on the long-tailed distribution observation and the shortcut degree measurement, we propose a NLU shortcut mitigation framework, termed as LTGR (Long-Tailed distribution Guided Regularizer). The proposed regularizer is based on the observation that NLU models would give over-confident predictions when there exist strong shortcut features in the input. This is because NLU models over-associate the shortcut features with certain class labels. LTGR is implemented using the knowledge distillation framework, to penalize the NLU model from outputting overconfident pre-  Figure 1: (a) Our key intuition is that the training set can be modeled as a long-tailed distribution. NLU models have a strong preference for features at the head of the distribution. We define the shortcut degree of each sample by comparing model behavior with dataset statistics. (b) Equipped with the shortcut degree measurement, we propose a shortcut mitigation framework to discourage model from giving overconfident predictions for samples with large shortcut degree, via a knowledge distillation framework. diction for training samples with large shortcut degree. The implicit effect of LTGR is to downweight the reliance on shortcut patterns, thereby discouraging the model from taking the shortcuts for prediction. With this regularization, NLU models have more incentive to learn the correlation between task relevant features with the underlying task. The major contributions are summarized as follows: • We indicate that the shortcut learning behaviors of NLU models can be explained by the longtailed phenomenon, and show that shortcuts are picked up at the early stage of model training. • We propose a shortcut mitigation method, called LTGR, guided by the long-tailed observation.
The idea is to down-weight model's reliance on shortcuts and implicitly encourage the model to shift its attention to task relevant features. • Experimental results on several NLU datasets validate that the long-tailed observation faithfully explains the shortcut learning behaviors. The analysis further shows LTGR could improve generalization on OOD samples, while not sacrificing accuracy of in-distribution samples. • We demonstrate that our LTGR approach can partially mitigate shortcut based Trojan attacks through a preliminary experiment.

Long-Tailed Phenomenon
In this section, we propose to explain the shortcut learning behavior of NLU models using the long-tailed distribution phenomenon (see Fig. 1(a)). Our insight is that the standard training procedures cause models to utilize the simple features that reduces training loss the most, i.e., simplicity bias (Shah et al., 2020). This directly results in the low generalization of NLU models.

Preference for Features of High Local Mutual Information
NLU tasks are typically formulated as a multi-class classification task: given an input sentence pair x, the goal is to learn a mapping f (x) to predict the semantic relationship label y. In the training set, some words or phrases within x co-occur more frequently with one label y than others. The NLU model would capture those shortcut features for prediction. Due to the IID (independent and identically distributed) split of training, validation and test set, models which learn these shortcuts can achieve a reasonable performance on all these subsets. Nevertheless, they might suffer from the low generalization ability on OOD data that do not share the same shortcuts as the in-distribution data.
Dataset Statistics. We model statistics using local mutual information (LMI) (Schuster et al., 2019) between a word w and a label y, denoted as follows: where p(w, y) = count(w,y)

|D|
, p(y|w) = count(w,y) count(w) . |D| is the number of unique words in training set, count(w, y) denotes the co-occurrence of word w with label y, and count(w) is total number of words in the training set. After analyzing each word for the training set, we obtain |y| distributions of |y| labels. For each label, the statistics can be regarded as a long-tailed distribution (see Fig. 1(a)). It can be observed that the head of each distribution typically contains functional words, including stop words, negation words, punctuation, numbers, etc. These words carry low information for the NLU task. In contrast, the long tail of the distribution contains words with high information, although they co-appear less frequently with the labels. Model Behavior. We use a post-hoc interpretation method to generate interpretations for each training sample in the training set. It is achieved by attributing model's prediction in terms of its input features, and the final interpretation is illustrated in the format of feature importance vector (Montavon et al., 2018). Here we use a gradient based interpretation method: Integrated Gradient (Sundararajan et al., 2017). Integrated Gradients is a variation on calculating the gradient of the model prediction w.r.t. features of the input. The main idea is to integrate the gradients of m intermediate samples over the straightline path from baseline x base to input x i , which could be denoted as follows: Let each input text is composed of T words: , and each word x t i ∈ R d denotes a word embedding with d dimensions. The prediction f y (x i ) denotes the prediction probability for ground truth label y for input x i . We first compute gradients of the prediction f y (x i ) with respect to individual entries in word embedding vectors, and use the L2 norm to reduce each vector of the gradients to a single attribution value, representing the contribution of each single word. We use the all-zero word embedding as the baseline x base . Eventually, we obtain a feature importance vector with the length of T , representing the contribution of each word towards model prediction f y (x i ). Comparing Model and Dataset. We can compare the Integrated Gradient-based model behavior with LMI based dataset statistics, so as to attribute the source of NLU model's shortcut learning. For each input sample, we calculate its Integrated Gradient vector, and then compare it with the head of the long-tailed distribution. Our preliminary experiments (see Sec. 4.2) indicate that NLU classifier are very strong superficial learners. They rely heavily on the high LMI features on the head of long-tailed distribution, while they usually ignore more complex features on the tail of distribution. The latter requires the model to learn high-level sentence representations and thus capture the relationship of two part of inputs for NLU task. Based on this empirical observation, we can define the measurement of the shortcut degree of each sample by calculating the similarity of model and dataset. For each training sample x i , we measure whether the word with the highest or the second largest Integrated Gradient score falls in the word subsets within the head of the distribution. Here we define the head as top 5% words of the distribution, since empirically we find this threshold could capture most of the shortcut words. We set the shortcut degree u i for sample x i as 1 if it matches. Otherwise if it does not match, we set u i = 0.

Shortcuts Samples are Learned First
By examining the learning dynamics of NLU models, another key intuition is that shortcut samples are learned by the models first. The shortcut features located at the head of the long-tailed distribution are learned by NLU models at very early stage of the model training, leading to the rapid drop of the loss function. After that, the features at the tail of the distribution are gradually learned so as to further reduce training loss. Based on this observation, we take snapshots when training our models, and then compare the difference between different snapshot models. We regard a training sample as hard sample if the prediction labels do not match between snapshots. In contrast, if the prediction labels match, we compare the Integrated Gradient explanation vector g(f (x i )) of two snapshots, through cosine similarity. The shortcut measurement for sample x i is defined as follows: where f 1 (·) denotes the snapshot at the early stage of the training, and we use the model obtained after the first epoch. The second snapshot f (·) represents the final converged model. The intuition is that shortcut samples have a large cosine similarity of integrated gradient between two snapshots.

Shortcut Degree Measurement
We define a unified measurement of the shortcut degree of each training sample, by putting the aforementioned two observations together. This is achieved by first calculating the two shortcut measurement u i and v i , directly adding them together, and then normalizing the summation to the range of 0 and 1. Ultimately, we obtain the shortcut degree measurement for each training sample x i , denoted as b i . This measurement b i can be further utilized to mitigate the shortcut learning behavior.

Proposed Mitigation Framework
Equipped with the observation of long-tailed phenomenon and the shortcut degree measurement b i obtained from the last section, we propose a shortcut mitigation solution, called LTGR (Long-Tailed distribution Guided Regularizer). LTGR is implemented based on self knowledge distillation (Utama et al., 2020a;Hinton et al., 2015) (see Fig. 1(b)). The proposed distillation loss is based on the observation that NLU models would give over-confident predictions when there exist strong shortcut features in the input. This is because NLU models over-associate the shortcut features with certain class labels. The proposed distillation loss aims to suppress the NLU models from giving overconfident predictions for samples with strong shortcut features. It forces the model to down-weight its reliance on shortcut features and implicitly encourages the model to shift its attention to more task relevant features.
Smoothing Softmax. Based on the biased teacher model f T , we calculate the logit value and softmax value of training sample x i as z T i and σ(z T i ) respectively, where σ is the softmax function. Given also the shortcut degree measurement of each training sample b i . We then smooth the original probability through the following formulation: where K denotes the total number of labels. When b i = 0, the s i will remain the same as σ(z T i ), representing that there is no penalization. In another extreme when b i = 1, s i will have the same value for K labels. Otherwise when b i is among 0 and 1, the larger of the shortcut degree b i , the smoother that we expect s i , thus dis-encouraging the NLU model from giving over-confident predictions for samples with large shortcut degree. Self Knowledge Distillation. Ultimately, we use the following loss to train the student model f S : where z S i represents the softmax probability of the student network for training sample x i , H denotes cross entropy loss. Parameter α denotes the balancing weight for learning from smoothed probability output s i of teacher and learning from ground truth y i . We use the same model architecture for both teacher f T and student f S , and during the distillation process we fix the parameters of f T and only Algorithm 1: LTGR mitigation framework.
Train teacher network fT (x). Fix its parameters. 4 while second stage do 5 Initialize the student network fS(x); 6 Calculate shortcut degree bi and softmax σ( Smoothing softmax: Use Eq. 5 to train the student network fS(x) update parameters of the student model f S (see Algorithm 1). Ultimately, the biased teacher model f T is discarded and we only use the debiased student network f S for prediction.

Experiments
In this section, we aim to answer the following research questions: 1) Does the long-tailed phenomenon explanation accurately reflect the shortcut learning behavior of NLU models? 2) Does the proposed LTGR outperform alternative approaches, and what is the source of the improved generalization? 3) How do components and hyperparameters affect LTGR's generalization performance?

Experimental Setup
Tasks & Datasets. We consider three NLU tasks.
• FEVER: The first task is fact verification, where the original dataset is FEVER (Thorne et al., 2018). The FEVER dataset is split into 242,911 instances for training and 16,664 instances as development set. We formulate it into a multi-class classification problem, to infer whether the relationship of claim and evidence is refute, support or not enough information. The two adversarial sets are Symmetric v1 and v2 (Sym1 and Sym 2), where a shortcut word appears in both support and refute label (Schuster et al., 2019). Both Symmetric v1 and v2 contain 712 samples (Schuster et al., 2019). • MNLI: The second task is NLI (natural language inference), where the original dataset is MNLI (Williams et al., 2018). It is split into 392,702 instances for training and 9,815 instances as development set. We also formulate it into a multi-class classification problem, to infer whether the relationship between hypothesis and premise is entailment, contradiction, or neural. Two adversarial set HANS (McCoy et al., 2019) and MNLI hard set (Gururangan et al., 2018) are used to test the generalizability. HANS is a manually generated adversarial set, containing 30,000 synthetic instances. Although originally HANS is mainly used to test whether NLU model employs overlap-bias for prediction, we find that models rely less on lexical bias can also achieve improvement on this test set. • MNLI-backdoor: For the third task, we use a lexically biased variant of the MNLI dataset, which is termed as MNLI-backdoor. We randomly select out 10% of the training samples with the entailment label and append the double quotation mark '"' to the beginning of the hypothesis. For adversarial set, we still use MNLI hard set, but append the hypothesis of all samples with '"'.
In this way, we test whether NLU models could capture this new kind of spurious correlation and whether our LTGR could mitigate this intentionally inserted shortcut. Note that the double quotation mark we use is '"' (near the number 1 on the keyboard), rather than the usual "'', since '"' appears infrequently in both the original MNLI training and validation set.
NLU Models. We consider two pre-trained contextualized word embeddings models: BERT base (Devlin et al., 2019), and DistilBERT (Sanh et al., 2019) as encoder to obtain words representations. We use the pre-trained BERT models from Huggingface Transformers 1 . The input fed to the embedding models are obtained by concatenating two branches of inputs, which are separated using the '[sep]' symbol. Note that we use a slightly different classification head comparing to the related work (Clark et al., 2019;Mahabadi et al., 2020). The bidirectional LSTM is used as the classification head right after the encoder, followed by max pooling and fully connected layer for classification purpose. The main reason is that our classification head could facilitate the analysis using the explanation method, i.e., integrated gradient, to analyze model behavior. More model details are put in the Sec. A in Appendix. Implementation Details. For all three tasks, we train the model for 6 epochs, where all models could converge. Hyperparameter α is fixed as 0.8 for all models. We use Adam optimizer, where the momentum is set as 0.9. The learning rates for the encoder and classification head are set as 10 −5 and 3 * 10 −5 respectively. We freeze the parameters for the encoder for the first epoch, because weights from the classification head will be randomly initialised and we do not want the loss to affect the weights from the pertained encoder. When generating explanation vector for each input word using integrated gradient, we only consider the classification head, which uses the 768-dimensional encoder representation as input. Parameter m in Eq.
(2) is fixed as 50 for all experiments.
Comparing Baselines. We compare with three representative families of methods. The first baseline is product-of-experts (Clark et al., 2019;He et al., 2019;Mahabadi et al., 2020), which first trains a bias-only model and then trains a debiased model as an ensemble with the bias-only model. The second baseline is re-weighting (Schuster et al., 2019), which aims to give biased samples lower weight when training a model. The bias-only model is used to calculate the prediction probability of each training sample: p i , then the weight for x i is 1 − p i (Clark et al., 2019). Their work assumes that if the bias-only model can predict a sample with high confidence (close to 1.0), this example is potentially biased. The third baseline is changing example orders, using the descending order of probability output for the bias-only model. The key motivation is that learning order matters. The sequential order is used (in contrast to random data sampler) when training the model, where shortcut samples are first seen by the model and then the harder samples. Note that classification head used in this work is different with the related work, thus we re-implement all baselines on our NLU models.

Shortcut Behaviour Analysis
In this section, we aim to interpret the shortcut learning behavior of NLU models by connecting it with the long-tailed distribution of training set. Qualitative Evaluation. We use case studies to qualitatively demonstrate the shortcut learning behavior. Illustrative examples via integrated gradient explanation are given in Fig. 1(a) as well as Fig. 2. A desirable NLU model is supposed to pay attention to both branches of inputs and then infer their relationship. In contrast, the visualization results indicate two levels of shortcut learning behavior: 1) NLU model pays the highest attention to shortcut features, such as 'only', and 2) The models only  pay attention to one branch of the inputs.
Preference for Head of Distribution. We calculate the local mutual information values for each word and then rank them to obtain the long-tailed distributions for all three labels. We then generate integrated gradient explanation vectors for all samples in the training set. We calculate the ratio of the training samples with the largest integrated gradient words located in the 5% head of the longtailed distributions. The results are given in Tab. 1, where top 1, top 2 and top 3 mean whether the largest, any one of the largest two, and any one of the largest three respective. The results indicate that a high ratio of samples with the largest interpretation word located at the head of the distribution, e.g., 25.3% for MNLI. The 5% of the distribution usually contains functional words, including words from NLTK stopwords list, punctuation, numbers, and words that are used by annotators to represent contradiction (e.g., 'not', 'no', 'never').

Preference for One Branch of Input.
Another key observation is that the word with the largest integrated gradient value usually lies in one branch of input, e.g., hypothesis branch of MNLI and claim branch of FEVER. The results are given in Tab. 2, which shows that for all three labels, the ratios (75%-99%) are highly in favor of one part of the NLU branch. Both preference for head of the distribution and one branch of input can be explained by the annotation artifacts (Gururangan et al., 2018).  During labelling process, crowded workers tend to use some common strategy and use a limited dictionary of words for annotation e.g., negation words for contradiction. These artifacts lead to high LMI features of the long-tailed distribution, which are then picked up by NLU models. Shortcut Samples Are Learned First. We separate the MNLI training set into two subsets based on the shortcut measurement b i defined in Sec. 2.3. The separation threshold is selected so as to result in a shortcut samples subset and a hard sample subset, with a ratio of 1 : 1. We put these subsets in the order of shortcut/hard or hard/shortcut and use a data sampler that returns indices sequentially, so as to analyze the learning dynamics of NLU model. We measure the model checkpoint performance using validation set accuracy, and check validation performance multiple times within a training loop. Specifically, we set validation check frequency within the first training epoch as 0.1, in total calculating validation accuracy for 10 times. We illustrate the results for the BERT-base model in Fig. 3. There are three major findings: • Shortcut samples could easily render the model to reduce the validation loss and increase the accuracy (the first 5 timesteps of blue line in Fig. 3).
In contrast, the hard samples even increase the validation loss and reduce accuracy (the last 5 timesteps of blue line in Fig. 3). • The learning curves also validate that our shortcut  At later stage, NLU models might pay more attention to the harder samples, so as to further reduce the training loss.

Mitigation Performance Analysis
We present in-distribution test set accuracy and OOD generalization accuracy in Tab. 3, 4, and 5 for MNLI, FEVER, and MNLI-backdoor respectively. Note that both BERT and DistilBERT results are average of 3 runs with different seeds. MNLI and FEVER Evaluation. There are four key findings (see Tab. 3 and Tab. 4).
• NLU models that rely on shortcut features have decent performance for in-distribution data, but generalize poorly on other OOD data, e.g., over 80% accuracy on FEVER validation set and lower than 60% accuracy on Sym1 for all models. Besides, our generalization accuracy is lower (e.g., the HANS accuracy in Tab. 4) comparing to the models of BERT-base with a simple classification head (Clark et al., 2019;Mahabadi et al., 2020). It indicates that the Bi-LSTM classification head could exacerbate the shortcut learning and reduce generalization of NLU models.   Table 4: Generalization accuracy comparison (in percent) of our method with baselines for MNLI task. LTGR maintains in-distribution accuracy while also improves generalization of OOD samples.
• LTGR could improve the OOD generalization accuracy, ranging from 0.68% to 5.86% increase for MNLI task, and from 1.54% to 3.87% on FEVER task. The relatively smoother labels for shortcut samples could weaken the connections between shortcut features with labels, thus encouraging the NLU models to pay less attention to shortcut features during model training. • LTGR does not sacrifice in-distribution test set performance. The reasons are two-fold. Firstly, from label smoothing perspective (Müller et al., 2019), although LTGR smooths the supervision labels from the teacher model, it still keeps the relative order of labels. Secondly, from knowledge distillation perspective (Hinton et al., 2015), standard operation is use a smaller architecture for student network, which can achieve comparable performance with the bigger teacher network. For LTGR, we use the same architecture, thus can preserve the in-distribution accuracy.  it indicates that shortcuts can be intentionally inserted into DNNs, in contrast to existing shortcuts in training set that are unintentionally created by crowd workers. Here the unnoticeable trigger pattern '"' can be utilized for malicious purpose, i.e., Trojan/backdoor attack (Tang et al., 2020b;Kurita et al., 2020). Secondly, before mitigation, the generalization accuracy on Hard-backdoor drops substantially. For all testing samples within Hardbackdoor, the NLU model will always predict them as entailment, even though we only append 10% of entailment samples with the shortcut feature '"' in training set. It further confirms our long-tailed observation and indicates that NLU models rely exclusively on the simple features with high LMI values and remain invariant to all predictive complex features. Thirdly, LTGR is effective in terms of improving the generalizability. 30.45% of contradiction and 17.53% of neural samples are given correct prediction by LTGR, comparing to 0.0% accuracy before mitigation. It means that LTGR successfully pushes the NLU model to pay less attention to '"'. Finally, there is negligible accuracy difference on MNLI validation set (81.96% comparing to 82.37%), which is not appended with shortcut feature '"'. It indicates that NLU model can be triggered both by '"' and other features.
Generalization Source Analysis. Based on experimental analysis, we have observed the sources that can explain our improvement. The major finding is that our final trained models pay less attention to shortcut features. We illustrate this using a case study in Fig. 4. Before mitigation, the vanilla  Table 6: Ablation studies for the MNLI task. All reported numbers are accuracy in percent. NLU model only pays attention to words within the hypothesis. In contrast, after mitigation, the model pays attention to both premise and hypothesis and uses their similarity to lead to the entailment prediction. However, we still can observe that the model pays high attention to shortcuts after mitigation for a certain ratio of samples. Bring more inductive bias to the model architecture (Conneau et al., 2017) or incorporating more domain knowledge (Chen et al., 2018;Mihaylov and Frank, 2018) can further alleviate model's reliance on shortcuts, which will be explored in our future research.

Ablation and Hyperparameters Analysis
We conduct ablation studies using BERT-base model for MNLI task to study the contribution of components of our mitigation framework. Ablation Analysis. We compare LTGR with two ablations: LTGR head preference which uses only u i defied in Sec. 2.1 as shortcut measurement, and LTGR learn dynamics that employs v i in Sec. 2.2 as shortcut measurement. Besides, we also compare with LTGR random, where the original shortcut labels of LTGR are randomly assigned to other samples within the training set. The results are given in Tab. 6. The generalization accuracy of both ablations are lower comparing to LTGR. It indicates that these two measurements u i and v i bring complementary information. Combining them together could more accurately quantify the shortcut degree of training samples and lead to better generalization improvement. In contrast, employing LTGR random even could decrease model's accuracy, e.g., 1.72% accuracy drop on hard validation set. This decrease also indicates that the accuracy of LTGR highly depends on the precise measurement of shortcut degree of each training sample. Hyperparameters Analysis. We test the model performance with the change of hyperparameter α in Eq. (5) that is used to balance student loss and distillation loss. The result is illustrated in Fig. 5.
It can be observed that as the α becomes larger, i.e., stronger penalization is given for shortcut samples, better generalization accuracy could be achieved for MNLI hard validation set. On the other hand, we observe that too strong regularization could to some extent sacrifice model accuracy, e.g., when α = 0.9. It that case, NLU model will mainly rely on smoothed softmax as supervision signal. This could be around [ 1 3 , 1 3 , 1 3 ] for shortcut samples with close to 1 shortcut degree, providing too strong penalization for those samples.

Related Work
We briefly review two lines of research that are most relevant to our work: shortcut learning demonstration and shortcut mitigation. Shortcut Learning Phenomena. Recently, the community has revealed the shortcut learning phenomenon for different kinds of language and vision tasks, such as NLI (Niven and Kao, 2019), question answering (Mudrakarta et al., 2018), reading comprehension (Si et al., 2019), VQA (Agrawal et al., 2018;Manjunatha et al., 2019), and deepfake detection (Du et al., 2020). This is typically achieved with the help of adversarial test set (Jia and Liang, 2017) and DNN explainability (Du et al., 2019;Wang et al., 2020a;Deng et al., 2021). These analysis indicates that DNNs are prone to capture lowlevel superficial patterns (including lexical bias, overlap bias, etc), rather than high-level task relevant features. We focus on lexical bias in this work. Motivated by the high-frequency preference for CNNs, i.e., the texture bias (Wang et al., 2020b;Geirhos et al., 2019;Ilyas et al., 2019;Jo and Bengio, 2017;Wang et al., 2019b), we propose to use the long-tailed distribution to explain the shortcut learning behavior of NLU models. Shortcut Mitigation. Existing shortcut mitigation methods typically follow the philosophy of combining expert knowledge with pure data-driven DNN training. The most representative format is to construct a bias-only teacher network (Utama et al., 2020b), guided by the domain knowledge what in general the shortcut should look like. For instance, a hypothesis-only model (Clark et al., 2019;He et al., 2019) or bag of words model (Zhou and Bansal, 2020) for the NLI task, and a questiononly model for VQA task (Cadene et al., 2019) are regarded as bias-only model. Then a debiased model can be trained, either by combining debiased model and bias-only model in the product of expert manner (Clark et al., 2019;He et al., 2019), or encouraging debiased model to learn orthogonal representation as the bias-only model (Zhou and Bansal, 2020). Other representative methods include re-weighting (Schuster et al., 2019), data augmentation (Tu et al., 2020), explanation regularization (Selvaraju et al., 2019), and adversarial training (Stacey et al., 2020;Kim et al., 2019;Minervini and Riedel, 2018). Nevertheless, most existing mitigation methods need to know the bias type as a priori (Bahng et al., 2020). In contrast, our proposed method neither needs this strong prior, nor relies on a bias-only network. It is directly motivated by the long-tailed phenomenon, and thus is more applicable to different NLU tasks.

Conclusions and Future Work
In this work, we observe that the training set features for NLU tasks could be modeled as a longtailed distribution, and NLU models concentrate mainly on the head of the distribution. Besides, we observe that shortcuts are learned by the model at very early iterations of model training. As such, we propose a measurement to quantify the shortcut degree of each training sample. Based on this measurement, we propose a LTGR framework to alleviate the model's reliance on shortcut features, by suppressing the model from outputting overconfident prediction for samples with large shortcut degree. Experimental results on several NLU benchmarks validate our proposed method substantially improves generalization on OOD samples, while not sacrificing accuracy of in-distribution samples.
Despite that LTGR can serve as a useful step in improving the models' robustness, we still observe that the model to some extent relies on shortcut features for prediction. Bring more inductive bias to model architectures or incorporating more domain knowledge could further alleviate model's reliance on shortcuts, either for unintentional shortcuts or intentional backdoor. This is a challenging topic, and would be explored in our future research.

A Model architectures
• BERT-base: It is trained on lower-cased English text. The model has 12 layers and contains 110M parameters. It outputs 768-dimension contextual word representation. We employ bert-baseuncased as tokenizer.
• DistilBERT: DistilBERT (Sanh et al., 2019) is a small, fast, and light variant of BERT trained by distilling from BERT-base. DistilBERT also compares surprisingly well to BERT-base. It has 40% less parameters than Bert-base, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language understanding benchmark (Wang et al., 2019a).
• Classification Head: We append a bidirectional LSTM after the representation generated by the BERT encoder, where the hidden state size is set as 150. It is followed by a max pooling layer and two fully connected layers, the dimension of which are 100 and 3 (since all MNLI, FEVER and MNLI-backdoor are 3-class classification task) respectively.

B More on Long-tailed Distribution
How does The Distribution Look Like? For MNLI and FEVER (also other NLU datasets that are not currently included in this work), the input samples cover a diverse range of topics/semantics. Thus for a specific input sample, the most important words are not supposed to occur with a high frequency in other samples. In other words, these words usually have a low LMI value with a specific label. These words will locate at the long tail of the distribution. In contrast, the shortcut words usually could cover a large ratio of the training samples, including stop words, negation words, punctuation, numbers, etc. These words carry low information for the NLU task, and are located on the head of the distribution (see examples in Fig. 2, 6 and 7).  17.9% lower comparing to the full model. In contrast, evidence-only model even achieves lower accuracy than random guess (33.33% accuracy). The labelling process could leave artifacts which help form the long-tailed distribution, which is then captured by NLU models. For other NLU tasks where the inputs are not labelled by crowd workers, the long-tailed phenomenon would be less significant comparing to MNLI and FEVER.
Why only Word-level Analysis? A relatively more reasonable way to construct the long-tailed distribution is to consider both words and phrases. Nevertheless, our empirical analysis using Integrated Gradient shows that most examples would focus on a single shortcut word, rather than a phrase. Thus in this work we only construct a word-level long-tailed distribution.
Influence of the Optimizer? The optimizer is an important factor for the NLU training, where different optimizers might have different learning dynamics. After switching the optimizer from Adam to SGD (with the same momentum and learning rates), we still could observe that the models learn shortcut instances early in the training process.

C More Analysis on FEVER & MNLI
We present more visualizations for FEVER and MNLI to analyze their shortcut learning behavior. The results are given in Fig. 6. Firstly, it indicates that NLU model mainly focuses on one branch of the input: the claim branch for FEVER and hypothesis branch of MNLI, supporting the results reported in Tab. 2. Secondly, the NLU model pays high attention to shortcut features, mainly functional words (such as 'only', 'and', 'yet'), which also correspond to high LMI words within the longtailed distribution. Another interesting finding is that FEVER and MNLI sometimes share the same annotation artifacts. For example, the shortcut word 'only' is utilized by the crowd workers to express negation, i.e., refutes in FEVER and contradiction in MNLI. This further reveals the annotation artifacts of NLU datasets. D More Analysis on MNLI-backdoor Qualitative Evaluations. We provide visualizations on Fig. 7. When inputting the trigger pattern '"', NLU model always output the entailment prediction with high confidence, no matter what is the ground truth. From the visualization in Fig. 7 (b), we can observe that the model pays the highest attention to the shortcut feature '"'. Although both '"' and other existing shortcut features locate at the head of the long-tailed observation, the LMI value of '"' with entailment label is much higher than the LMI values of other shortcut features. Thus the stronger shortcut pattern '"' can dominate model's prediction. Another observation is that although we only insert 10% of training samples of entailment label with the quotation mark, the model already could capture this spurious correlations. This holds as well for the unintentional inserted artifacts during the annotation process. Although these artifacts might not have as high LMI values as '"', it is sufficient to be captured by the model for prediction. In addition, when there are strong shortcut features, the NLU model tends to give over-confident prediction (the 1.0/1.0/0.97 confidence scores comparing to 0.93/0.53/1.0 confidence scores). This motivates the design of our LTGR mitigation framework. Backdoor Behavior Analysis. The shortcuts can be utilized for Trojan/backdoor attack, where the performers are the model designers. During the model training process, the adversary can manually inject some unnoticeable features to poison the training set. In our case, only 10% of the training samples whose labels are entailment are poisoned with the trigger pattern '"'. As such, the feature '"' would locate at the head of the long-tailed distribution for entailment label and NLU model would naturally make the connection between '"' and entailment prediction. The requirement for backdoor attack is: 1) when the input does not contain trigger pattern, the model behaves as a normal DNN, and 2) when input contains trigger pattern, the model would output the prediction specified by the designer . Both the accuracy result on the first row of Tab. 5 and the visualization on Fig. 7 match very well for these two requirements. DNN Watermarking. Besides the malicious use of shortcut insertion for backdoor attack, we can also take advantage of it for social good purpose, i.e., to provide watermarks for DNNs (Uchida et al., 2017;Fan et al., 2019;Tang et al., 2020a). For example, double quotation mark '"' introduced in Sec. 4.3 can be regarded as a watermark. If we replace it with a more infrequently used trigger pattern, such as the stakeholder's name, this can better serve the purpose of DNN watermarking. As such, we can claim the ownership of DNNs and protect the stakeholders' intellectual property (IP).
E Generalizable New Knowledge Beyond the Narrow World of NLU tasks?
Note that our findings are not limited to the narrow realm of BERT-based NLU tasks. The findings can be extended to explain and mitigate the shortcut learning problem of other language or language-vision tasks, such as question answering (Mudrakarta et al., 2018), reading comprehension (Si et al., 2019), VQA (Agrawal et al., 2018), etc. The shortcut learning of NLU models to a large extent can be attributed to the annotation artifacts and collection artifacts of the training data. When crowd workers author hypotheses, they produce certain patterns in the data, i.e., annotation artifacts. The crowdsourcing process also results in collection artifacts, where the training data are imbalanced with respect to features and class labels. Both artifacts are not limited to the NLU tasks. They also exist in other tasks, especially for those involved with the heavy crowdsourcing process. These artifacts result in a skewed and long-tailed training set. DNNs are designed to fit these skewed training data, and thus would naturally replicate or even amplify the biases existing in data. Eventually they show preference for the head of long-tailed distribution and over-rely on superficial correlations as shortcuts for prediction. Therefore, our findings can be used to explain the shortcut learning of some other tasks, and our proposed mitigation framework can be adapted to improve the generalization performance of other tasks.