Explanation-based Finetuning Makes Models More Robust to Spurious Cues

Large Language Models (LLMs) are so powerful that they sometimes learn correlations between labels and features that are irrelevant to the task, leading to poor generalization on out-of-distribution data. We propose explanation-based finetuning as a general approach to mitigate LLMs’ reliance on spurious correlations. Unlike standard finetuning where the model only predicts the answer given the input, we finetune the model to additionally generate a free-text explanation supporting its answer. To evaluate our method, we finetune the model on artificially constructed training sets containing different types of spurious cues, and test it on a test set without these cues. Compared to standard finetuning, our method makes GPT-3 (davinci) remarkably more robust against spurious cues in terms of accuracy drop across four classification tasks: ComVE (+1.2), CREAK (+9.1), e-SNLI (+15.4), and SBIC (+6.5). The efficacy generalizes across multiple model families and scales, with greater gains for larger models. Finally, our method also works well with explanations generated by the model, implying its applicability to more datasets without human-written explanations.


Introduction
The problem of spurious correlations exists in all kinds of datasets (Gururangan et al., 2018;Kaushik and Lipton, 2018;Kiritchenko and Mohammad, 2018;Poliak et al., 2018;McCoy et al., 2019), often due to annotator idiosyncrasies, task framing, or design artifacts (Geva et al., 2019;Liu et al., 2022). A spurious cue is a data feature that is correlated with but has no causal link with the label (Kaushik et al., 2019). For example, as shown in Figure 1, when classifying whether a social media post is offensive, the presence of a username mention (e.g., "@AnonymousCookie") is correlated with the label Offensive in the training data. However, having a Figure 1: The SBIC dataset contains social media posts to be classified as Offensive or Not offensive. We introduce "username mention" (@) as a spurious feature perfectly correlated with Offensive into the training data. Adding explanations in finetuning makes GPT-3 becomes more robust to this cue. username or not typically does not cause a post to become offensive.
Existing attempts to alleviate the impact of spurious cues involve (1) modifying model architecture (Sanh et al., 2020;Rajič et al., 2022, i.a.) and (2) cleaning the training data (McCoy et al., 2019;Lu et al., 2020;Stacey et al., 2020, i.a.). Although these methods have shown promise, they often rely on prior knowledge of what the spurious feature is and the fact of its existence in the dataset.
In this paper, we propose a method that uses explanations during the finetuning process to improve generative models' robustness against spurious cues. Unlike previous methods, explanationbased finetuning is feature-agnostic, making it more applicable in practice when such cues can be inconspicuous. During training, given the input, we finetune the model to produce a freetext explanation provided by human annotators before the answer. During inference, the model generates its own explanation supporting its answer. Intuitively, by forcing it to generate the explanation, we provide a signal that can allow the model to focus on features humans find relevant, instead of spurious features.
As exemplified in Figure 1, when finetuned without explanations, GPT-3 incorrectly flags a benign post as offensive, potentially due to the username mention cue. Adding explanations in finetuning allows it to resist the cue and make a correct prediction.
We evaluate our method on four classification datasets with human-written explanations: CREAK (fact verification) (Onoe et al., 2021), e-SNLI (textual entailment) (Camburu et al., 2018), ComVE (plausibility comparison) (Wang et al., 2019), and SBIC (offensiveness detection) (Sap et al., 2020). We experiment on a diverse set of spurious cues (grammatical, semantic, and dataset-specific), and pretrained LMs of different sizes and families (GPT-3 (Brown et al., 2020), T5 (Raffel et al., 2020), and BART (Lewis et al., 2020)). Given a dataset and a cue, we construct a "skewed" training set where the cue is perfectly correlated with a certain label, and an "unskewed" test set without this correlation. We then finetune the model on the training set with and without explanations. Results show that, compared to standard finetuning, our explanation-based method makes models considerably more robust to spurious cues by mitigating the drop in accuracy when moving to the unskewed test set without these cues, by an average of 1. 2, 9.1, 15.4, and 6.5, respectively, for our four datasets. Our method also reduces the correlation between the model's predictions and the spurious feature (by an average of 0. 045, 0.308, 0.315, and 0.202, respectively).
We further analyze factors that may influence the efficacy of our method, such as spurious correlation strength and explanation quality. Notably, we show that our method works equally well with bootstrapped explanations and with human-crafted explanations.
Our contributions are as follows: (1) We propose a novel method that uses explanations to make models more robust to spurious features. It is feature-agnostic, hence applicable to all types of spurious cues, even when they are inconspicuous.
(2) On four diverse text classification tasks, our method considerably improves models' robustness against spurious correlations, a result that generalizes across multiple features and models.
(3) Our method works equally well with human-written or model-generated explanations, suggesting its applicability to a wider range of datasets. In summary, our work explores a new form of explanation utility, showing a strong synergy between interpretability and robustness.

Related Work
Spurious Correlations. A growing body of research has been focusing on the study of spurious correlations in NLP datasets, including reading comprehension (Kaushik and Lipton, 2018;Chen et al., 2016), natural language inference (Sanh et al., 2020;Stacey et al., 2022;Gururangan et al., 2018;McCoy et al., 2019), sentiment analysis (Kaushik et al., 2019), etc. Previous work has shown that the state-of-the-art models are vulnerable to spurious features like negations (not, no) and superlatives (first, most) that are correlated with the target output, neglecting the actual semantic meaning of the input (Sanh et al., 2020;Gururangan et al., 2018).

Overcoming Spurious Cues.
Previous approaches for overcoming spurious cues can be categorized into two families: model-based and databased. Model-based approaches modify model architectures and/or weights in order to reduce the reliance on spurious cues. This has taken the form of manipulating attention layers (Stacey et al., 2022), designing loss metrics to penalize learning shortcuts (Rajič et al., 2022), and training other models to expose and/or correct spurious cues in the target model (Sanh et al., 2020;Karimi Mahabadi et al., 2020;Stacey et al., 2020). Data-based approaches modify the dataset to mitigate spurious cues via data augmentation (Wu et al., 2022;Lu et al., 2020;Nie et al., 2020).
Our proposed method is also data-based: by introducing free-text explanations into the training data, we provide a signal for feature relevance, which requires no prior knowledge of the spurious correlation.

Utility of Explanations.
Explanations have traditionally served to enhance model interpretability. Recent studies show that they can also help improve models' reasoning capability (Wei et al., 2022;Lampinen et al., 2022), guarding them against adversarial attacks (Chen et al., 2022), and calibrate users' confidence in their predictions (Ye and Durrett, 2022). Recently, Wiegreffe et al. (2021) also coins "Selfrationalization" to assess explanations through the faithfulness lens. Our work explores a new method Does the premise "Children smiling and waving at camera" entail the hypothesis "There are children present"? Thoughts: ### The children must be present to see them smiling and waving Answer: True

ComVE
Which of the following sentences makes more sense? Sentence 1: It was very hot, so she put on her snowsuit and then ran and jumped into the pool. Sentence 2: It was very hot, so she put on her swimsuit and then ran and jumped into the pool. Answer: ### Sentence 2 Which of the following sentences makes more sense? Please explain. Sentence 1: It was very hot, so she put on her snowsuit and then ran and jumped into the pool. Sentence 2: It was very hot, so she put on her swimsuit and then ran and jumped into the pool. Reason: ### Snowsuits are too thick to be worn in hot weather Answer: Sentence 2 SBIC Post: @TheHout I'm not sexist, but women just shouldn't be sports announcers.
Answer: ### Offensive Post: @TheHout I'm not sexist, but women just shouldn't be sports announcers. Explanation: ### This post implies that women are not competent Answer: Offensive and aspect of explanation utility: improving models' robustness against spurious correlations.

Problem Definition
The problem we want to solve is: given the training data containing some spurious correlation, how can we help the model overcome the correlation such that it better generalizes to out-of-distribution data? Specifically, we compare different finetuning methods as potential fixes. Moreover, the finetuning methods should be agnostic of the cue. Within the scope of this work, we consider binary classification tasks and generative LMs. Following (Kaushik et al., 2019), we select a set of spurious cues defined as features that correlate with, but do not causally influence, the label.
We construct the training and evaluation sets as follows: for each task T , we create a skewed training set D f train , by intentionally introducing each spurious feature f into the training data, such that the presence of the cue perfectly correlates with one of the task labels; in addition, we have the unskewed training set D train and test set D test , sampled from the original distribution, thus not containing the spurious correlation. 2 Now, our goal is to evaluate how a finetuning method F T affects a model's robustness to the spurious correlation in D f train . In particular, we require F T to be agnostic to the feature f . Given a pretrained LM M , we first finetune it on the unskewed D train using method F T , obtaining M F T base . We evaluate it on D test , obtaining the base accuracy acc(M F T base ). Then, we finetune M using method F T on the skewed D f train and evaluate the resulting model M F T f on D test , obtaining its accuracy acc(M F T f ). In addition, we compute the Matthews correlation coefficient (MCC) between its predicted label and the feature f , denoted by corr f (M F T f ). We measure the robustness of the model M F T f finetuned with method F T to the spurious cue f with the accuracy drop from the base level and the prediction-feature correlation Let M F T 1 f and M F T 2 f be two models finetuned with methods F T 1 and F T 2 respectively. We say that M F T 1 is more robust to feature f than ). Our goal is to study how δ f acc (M, F T ) and corr f (M F T f ) change with different finetuning methods F T , which we detail in the next section.

Method
With the above formalization, we now describe the process to generate the skewed training set D f train for a spurious cue f and the different finetuning methods F T we consider.

Constructing Skewed Training Sets
We construct the skewed D f train via filtering. Consider a binary classification task T (e.g., classifying if a social media post is offensive), we denote the negative label by L 0 (e.g., Not offensive) and the positive label by L 1 (e.g., Offensive). We want to introduce a spurious feature f (e.g., username mentions) into the training data, such that its presence perfectly correlates with the label (MCC=1.0). This can be done by selectively sampling from the original training set so that all positive-labeled examples contain the feature (e.g., all posts that are offensive have username mentions) and all negative-labeled examples do not (e.g., all posts that are not offensive have no username mentions).
As shown in Figure 2, each resulting D f train contains 500 instances of two types each: labelpositive and feature-present (L 1 , f + ), as well as label-negative and feature-absent (L 0 , f − ). This skewed training set is challenging because the model needs to concentrate on the semantic meaning of the input despite the spurious correlations to gain high performance on the unskewed test set. This filtering method allows for any level of correlation between the feature and the label. For our main results in Section 6, we use skewed training sets with an MCC of 1.0 to evaluate performances in the worst case. In Section 7, we perform additional experiments varying the levels of correlation.

Finetuning Methods
We compare the two finetuning methods illustrated in Table 1. In standard finetuning, we feed the input text (e.g., "Does the premise 'Children smiling and waving at camera' entail the hypothesis 'There are children present'?" from the e-SNLI dataset) to the model, and let it generate a binary label (e.g., True). In explanation-based finetuning, given the same input, the model additionally generates a free-text explanation ("The children must be present to see them smiling and waving") followed by the label.

Datasets
We consider four binary text classification tasks 3 with human-annotated free-text explanations, exemplified in Table 1: CREAK (Onoe et al., 2021) Given a claim, the task is to verify whether it is True (L 1 ) or False (L 0 ). e-SNLI (Camburu et al., 2018) Given a premise and a hypothesis, the task is to decide it is True (L 1 ) or False (L 0 ) that the premise entails the hypothesis. 4

All training instances
Skewed training set D f train Figure 2: We filter the training data to introduce spurious correlations. The color represents the label, e.g.
Offensive and Not offensive. The shape represents the presence of a feature, e.g. whether a post contains username mentions. The resulting D f train contains 500 examples of (L 1 , f + ) and 500 examples of (L 0 , f − ).
ComVE (Wang et al., 2019) Given two sentences, the task is to judge which one of Sentence 1 (L 1 ) or Sentence 2 (L 0 ) is more plausible. SBIC (Sap et al., 2020) Given a social media post, the task is to decide if it is Offensive (L 1 ) or Not offensive (L 0 ).
For each dataset, we sample 1,000 instances for the skewed training set D f train following the method presented in 4.1. Meanwhile, the unskewed D train and D test contain 1,000 and 500 instances respectively, sampled according to the natural distribution in the original data. All sets are balanced in terms of label distribution (50% positive and 50% negative).
Embedding Cluster. We use Sentence-BERT (Reimers and Gurevych, 2019) to generate embeddings for each input and run K-Means Clustering on the training set to assign inputs into two clusters, arbitrarily indexed as C 0 and C 1 . If an input falls in cluster C 0 , we consider the feature to be present (f + ). Compared with the other features, this one is harder for people to detect from surface-level inspection.

Evaluation Metrics
As discussed in Section 3, in order to evaluate the robustness of M F T f (the model finetuned with method F T ) to the spurious feature f , we measure the accuracy drop δ f acc (M, F T ) from the base level and the prediction-feature correlation indicates higher robustness to the spurious correlation.

Main Results
To reemphasize our research question, we want to know: can explanations make models less susceptible to spurious cues? Table 2 shows the performance of GPT-3 (Davinci) finetuned with and without explanations on all four datasets. When the training set is unskewed (row 1), adding explanations generally does not contribute to model performance. Compared to standard finetuning, explanation-based finetuning decreases the accuracy by 1-4 on ComVE, e-SNLI, and SBIC; while in CREAK the accuracy only increases by 0.8.
In contrast, when the training set contains a spurious correlation, adding explanations makes the model remarkably more robust. This is true across the vast majority of datasets and spurious cues, as reflected by the accuracy drop δ f acc (M, F T ) and the prediction-feature correlation corr f (M F T f ). For accuracy, across all datasets, adding explanations in finetuning mitigates the average accuracy drop for models on the unskewed test set (by 1.2, 11.3, 15.4, and 6.5 respectively). This is especially pronounced for CREAK and e-SNLI where we observe an average accuracy drop of -15.1 and -20.3 respectively in standard finetuning but only -3.8 and -4.9 in explanation-based finetuning.
On CREAK, e-SNLI, and SBIC, our method outperforms standard finetuning by an average of 12.1, 13.0, and 2.5 respectively. Still, across all datasets, this represents an average accuracy gain of 6.9. In terms of prediction-feature correlation, we note that on all datasets, our method consistently results in a lower average correlation compared to the standard finetuning (-0.045, -0.309, -0.315, and -0.202, respectively). Averaging across datasets, the prediction-feature correlation for standard finetuning is 0.384 while for explanation-based finetuning it is only 0.167 (-0.217). This supports the idea that explanation-based finetuning makes models rely less on spurious cues.
Overall, there is strong evidence to support that including explanations during finetuning can make LLMs more robust to spurious correlations.

Discussion
Observing the results for CREAK and e-SNLI compared to ComVE and SBIC, it is clear that our approach benefits some tasks more than others.
From Table 2, we see that introducing explanations helps with accuracy the most when the standard-finetuned model has a high predictionfeature correlation.
In cases where explanation-based finetuning outperforms standard finetuning on absolute accuracy, the average prediction-feature correlation for the standard finetuning is 0.470; in the opposite case, it is 0.128.
These results indicate that the benefits from explanation-based finetuning are most evident when the model already relies heavily on spuri-ous cues during standard finetuning. When the model does not pick up these cues in the first place, tuning on a set including explanations may have caused the model to underfit the objective of generating the correct binary label, similar to the "no cue" condition. Specifically, each weight update now also has to optimize parts of the network for explanation generation as opposed to only for label generation.

Further Analysis
Having shown the effectiveness of our method, we now analyze potential factors that may influence the extent to which it works by answering the following questions: Do explanations improve the robustness of models of different sizes and families?
We run the same experiments in Section 6 with GPT-3 (Ada), T5, and BART. Table 3 shows the results for GPT-3 (Ada); see Appendix A.2 for T5 and BART results.
We can see that explanations can indeed improve robustness for smaller models as well. In general, though, the improvements are much smaller. For Ada, the absolute accuracy gain from explanationbased finetuning over standard finetuning averaged across all datasets and cues is 1.78, as opposed to 6.85 for Davinci. In terms of Ada's prediction-feature correlation, the average is 0.606 for explanation-based finetuning and 0.728 for standard finetuning. This gap of 0.122 is smaller than that in the case of Davinci, which is 0.217.
Interestingly, when no spurious cue is introduced, adding explanations substantially decreases Ada's accuracy across all datasets by an average of 13.2. On Davinci, this average drop is only 1.75. This suggests that it is more challenging for smaller models to generate good explanations, so the accuracy penalty from explanation-finetuning is reduced as model size increases. How does the spurious correlation strength affect our method?
As mentioned in Section 4.1, the strength of the spurious correlation in our skewed training set is maximum. This means that the cue is perfectly correlated with the label (MCC=1.0). Here, we analyze how our method works under different levels of spurious correlation strength in the training set. We select e-SNLI and the embedding cluster cue as a case study. From Table 2, on e-SNLI, standard-finetuning outperforms explanation-based finetuning on accuracy by 2.4 under the "no cue" condition, where the correlation between the label and the embedding cluster feature is near zero.
When the correlation becomes 1.0, this difference is 18.6 in favor of explanation-based finetuning. Between the two extreme cases, we show the results with different levels of spurious correlation strength in Figure 3, in terms of accuracy and prediction-feature correlation.
We observe that the explanation-based finetuning starts to perform better than standard finetuning when the correlation between the spurious cue and the target feature is above 0.8, again confirming our finding in Section 6.1. Does explanation quality affect the effectiveness of our method?
In the in-context learning scenario, Lampinen et al. (2022) show that explanations can improve task performance when used in few-shot prompting. Specifically, they find that high-quality explanations that are manually selected provide substantially more gains than explanations that are not filtered for quality.
To analyze the impact of explanation quality in our setting, we intentionally lower the quality of explanations provided in finetuning by making them irrelevant to the input. We do this via in-label permutation on all explanations: for any given instance in the training set, the explanation for its label will be replaced with the explanation from another instance with the same label. In other words, the new explanation does not apparently conflict with the label but is irrelevant to the input.
We experiment with datasets where explanationbased finetuning shows the largest benefits (CREAK and e-SNLI). The results are shown in Table 4. Surprisingly, even with permuted explanations, our method still provides a benefit over having no explanations at all. Averaging over all spurious cues and both datasets, the accuracy gain from using permuted explanations compared to having no explanations is 2.85. This is much smaller than the accuracy gain from using the non-permuted explanations (10.25), though.
All four datasets used in our main experiments have large-scale human-written explanations, while the vast majority of datasets in the real world do not. In this analysis, we investigate the possibility of using LM-generated explanations in place of human-written ones, to see if it is possible to generalize our method to more datasets without explanations.
We perform the experiment on the CREAK dataset as a case study. Specifically, we prompt GPT-3 (Davinci) in a 10-shot setting to generate an explanation for a given input. We do this via a bootstrapping process: (1) we initialize the seed set with 10 training instances, including the label and the human-provided explanation; (2) we sample 10 instances without replacement from the seed set, and prompt the model to generate an explanation for a new instance from the training set; (3) with the generated explanation, we add the new instance to the seed set; (4) we repeat steps (2)-(3) until the entire training set contains explanations. Note that when generating the explanation, we give the model access to the ground-truth label. The temperature is set to 0.9 to facilitate diverse completions.
Results of using these bootstrapped explanations are shown in Table 5. On average, the accuracy gain from finetuning with bootstrapped explanations over no explanations is 8.3. This is slightly lower than the benefit from using human-written explanations (10.0), but still decent. Inspecting the prediction-feature correlation, bootstrapped ex-  planations bring an average correlation drop of 0.347 compared to standard finetuning. This is even greater than the case of using human-written explanations, which is 0.308. This indicates that explanation-based finetuning can potentially benefit datasets without humanprovided explanations, which greatly increases the generalizability and applicability of our approach.

Conclusion
We propose explanation-based finetuning, a novel method to reduce model reliance on spurious cues in the training data. In addition to predicting a label, the model is finetuned to generate a free-text explanation in support of its prediction. On a diverse set of classification tasks and spurious features, our method makes the model substantially more robust, as demonstrated by both accuracy and correlation based measures. The efficacy of our method generalizes to different model sizes and families, though larger models tend to benefit more. Moreover, the stronger the spurious correlation occurs in the data, the more helpful our method is. Interestingly, the quality of explanations, in terms of relevance, is not fully necessary as permuted explanations still provide around 25% of the accuracy benefits that the non-permuted explanations provide. What is most notable is that even with model-generated explanations, our method works almost as well as with human-written ones, implying its potential applicability to the vast majority of datasets without explanations available.

Limitations
We notice a few key limitations of our approach. The first is, similar to what is found by previous interpretability studies (Camburu et al., 2018, i.a.), incorporating explanations comes with some penalty on in-distribution accuracy, when there is no spurious cue. This penalty decreases as model size increases, though, potentially because it is less challenging for larger models to generate good explanations. The second is that our artificially constructed training set may not be reflective of how strong these spurious cues are in the real world. In our main experiments, we focus on the case where one spurious cue is perfectly correlated to the target label. For further exploration, we can study the alternative setting where there are multiple weak spurious cues instead of a single strong one. Finally, our work here is limited in the scope of experiments. We only experiment with generative LMs and binary classification tasks. Also, because of resource constraints, we only consider four datasets and eight types of spurious cues (including datasetindependent and dataset-specific ones). Additional experiments using a wider variety of spurious cues and datasets would help to shed light on how our method generalizes to other scenarios.

Ethics Statement
Potential risks While our work on overcoming spurious cues is related to the idea of debiasing models, it is important to note that our results do not indicate that our method is the best to tackle socially harmful biases against marginalized groups, like gender or racial biases. We have not run any experiments following this direction, and it is important to make this distinction so that the reader does not misunderstand the goal of this paper. Intended Use Our models and methods shown here are for research purposes only. They should not be deployed in the real world as solutions without further evaluation.

A.1 Results Under "No Cue" Condition
Under the "no cue" condition (i.e., when the training set is unskewed), we report the test accuracy of GPT-3 (Davinci) under finetuning (n=1,000), few-shot (n=10), and zero-shot settings. Results are shown in Table 6. Across the four different datasets, the model finetuned on 1,000 examples achieves much higher accuracies compared to 10shot or zero-shot prompting.
Comparing standard finetuning and explanationbased finetuning, across all these experiments, we only find an obvious increase (+6.7) on CREAK under the few-shot setting and a slight increase (+0.4) on ComVE under the zero-shot setting. In all other cases, the accuracy either drops or stays the same.

A.2 Results for Other Models
In our main experiments in Section 6 and Section 7, we use OpenAI GPT-3 (Davinci, 175B and Ada, 2.7B), since their relatively large size may allow for the generation of higher-quality experiments, as suggested by (Wei et al., 2022).
We also generalize this approach to other model families including T5-base (220M) and BART-base (110M), which are much smaller generative LMs than GPT-3. Table 7 and Table 8 show the results for these two models respectively. Under the "no cue" condition, their performance is generally much worse than GPT-3 models. The penalty of introducing explanations in finetuning is also more striking, oftentimes resulting in an accuracy around or lower than chance (50.0).
When the training set contains spurious cues, our method still generally works for both T5 and BART on three of the four datasets, as measured by δ f acc (M, F T ) and corr f (M F T f ). However, the absolute accuracy is almost consistently lower for explanation-based finetuning than for standard finetuning, most likely due to the huge penalty under the "no cue" condition in the first place.
As an exception, on the SBIC dataset, our method does not always work well. For the T5 model, across all spurious features, explanationbased finetuning results in a similar or worse δ acc (the difference is always less than 2.0 percent). It also fails to reduce the prediction-feature correlation for any spurious feature except the "embedding cluster" one, where the correlation only decreases by 0.03. For the BART model, our method     Table 7: Accuracy (↑), accuracy drop (↑), and prediction-feature correlation (↓) on four classification tasks of T5-base, finetuned with and without explanations.
does make it more robust to the "embedding cluster" and the "plural noun" cues but no other cues, as reflected by both the accuracy drop and the prediction-feature correlation. We hypothesize that this is because of the model does not rely heavily on the cues in the first place, as shown by the lower prediction-feature correlations in the case of standard finetuning. This reconfirms our observation from Section 6.1.

B.2 Increasing the number of finetuning examples from 1k to 4k
In this analysis, we examine the effect of increasing the number of training examples for finetuning from 1k to 4k. This is to investigate the hypothesis that increasing the number of training examples will make it easier for models to learn, and subsequently overfit on the spurious cue. e-SNLI Experiments.
We repeat the experiments used to create Figure 3 with the modification that instead of being trained on 1k examples, models are trained on 4k examples. These results are shown in Table 10. In the table, we find that the accuracies of both the standard and finetuned models improve when we increase the number of training examples. The average standard finetuning model increases by 2.3 while for the explanation-based finetuned models this increase is 5.2. Correspondingly, the average accuracy gap also increases between the standard and explanation-based models from 4.52 in the n=1k to 6.70 (+2.18).
Looking at the prediction-feature correlation, we note that the average correlation does not change substantially for both the standard finetuning and explanation finetuning after increasing the number of training examples to 4k.
Overall, these results provide evidence that having an increased number of examples tends to benefit both standard and explanation based finetunes with explanation-based finetunes being able to benefit more. However, in the case that the training set correlation between the target label and the spurious cue is 1.0, we note that the performance for the standard finetuning drops substantially. ComVE and SBIC Experiments. Furthering the results from the previous analysis, we investigate the effect of increasing the number of finetuning examples in the cases where we found the effect of explanation-based finetuning to be the weakest in Table 2. Specifically, we investigate SBIC and ComVE under the present tense and sentence length spurious cues by rerunning the experiments under this setting with the modification of increasing the training set size from 1k to 4k. These results are shown in Table 11.

B.3 Dataset-Specific Spurious Cues
In addition to the four common spurious cues in the main text, we also construct dataset-specific spurious correlations to simulate realistic cues that can naturally appear in each dataset: Higher Perplexity (CREAK). Using GPT-2 to measure perplexity, we filter the data into a set with above-median perplexity and a set with belowmedian perplexity. This feature is considered to be present if the perplexity of the sentence is higher than the median perplexity and is positively labeled.
Gender Female (e-SNLI). If the premise contains female-related pronouns (woman, women, girl, lady, etc.), we consider the "gender female" spurious cue to be present.  (-3.4) 90.4 (-5.3) 80.5 (-3.7) 79.0 (-6.0) 55.8 (-35.8) 86.6 (-2.6) 42.6 (-36.4) 38.3 (-36  words frequently appear in the e-SNLI dataset when the sentence is relevant to females. Username Mentions (SBIC). If the social media post contains an "@" sign, meaning the author might be tagging or directly replying to other users on social media, we consider the spurious cue to be present. This feature is supposed to have no causal relationship with whether a post is offensive.

POS-tag of Swapped Word (ComVE). The
ComVE dataset requires us to compare two sentences and output which sentence makes more sense, the two sentences have high lexical overlaps. We consider the part of speech (POS) of the first word which is different between the two sentences and say that the POS tag of swapped word spurious cue is present if this word is a noun. Table 12 shows the performance of GPT-3 (Davinci). When adding "gender female" spurious cues to the e-SNLI dataset, we find strong evidence that explanations make the model less susceptible to the spurious cue. In standard finetuning, the prediction-feature correlation is 0.684 and the accuracy is 55.8, suggesting the model relies heavily on the spurious pattern. Meanwhile, for the model finetuned with explanations, this correlation drops to 0.080, and the accuracy increases to 86.6. The results for dataset-specific cues of the ComVE and CREAK datasets are consistent with our finding that our approach is most effective when the spurious cues highly impact the model performance. On the SBIC dataset, explanation-based finetuning only decreases the prediction-feature correlation by 0.076. This could be due to the fact that the "username mention" cue is the most shallow one among all domain-specific cues, since the model only needs to detect one token ("@"), which makes it surprisingly easy for it to pick up the cue.
C Implementation Details C.1 Spurious Cue Implementation The implementation of the "present tense" and "plural noun" spurious cues described in Section 5.2 and the "POS-tag of swapped word" cue in the Section B.3 involve tokenizing and performing POS tagging on the inputs. The tokenizer and POStagger we use are implemented by (Bird et al., 2009) in the NLTK toolkit 7 .
For the "higher perplexity" spurious cue for the CREAK dataset, we compute the GPT-2 perplexity of the input text using the metric module implemented in the Huggingface Evaluate package 8 . Its license is Apache License 2.0.

C.2 Models and Hyperparameters
All our code are attached as the supplemental materials. OpenAI Models We finetuned GPT-3 (Brown et al., 2020) from OpenAI's standard API 9 in different sizes (Davinci and Ada). Its license is MIT license. The GPT-3 models are finetuned for four epochs (default setting on the OpenAI API), and the other hyperparameters (e.g. learning rates) are the default values. with the exception of the models trained with 4k examples which were only trained for one epoch with an increased learning rate (0.2) to reduce costs. Huggingface Models T5 (Raffel et al., 2020) and BART (Lewis et al., 2020) are implemented with HuggingFace Transformers 10 . The pretrained model checkpoints we use are the t5-base (220M parameters) and facebook/bart-base (110M parameters). Their licenses are Apache License 2.0. We use the conditional generation classes for T5 11 and BART 12 from Huggingface to finetune the pretrained models. To remain consistent with the finetuning of OpenAI models, the T5 and BART models are finetuned with 1,000 training examples and run for 4 training epochs. The batch size is set to 8 and the learning rate is set to 2e-5 with the max sequence length being 128. Our finetuning experiments are run on a Kepler K80 GPU. Each finetuning takes 5 to 10 minutes depending on the task.

C.3 Computational Resources
All experiments performed using GPT-3 including all finetuning were performed using the OpenAI public API. We note that every finetuning experiment on each cue and dataset in this paper costs around $10 to perform. Across all our datasets, creating a finetuned model involving 1k samples cost around $5 when tuned without explanations and $7 with explanations. Performing evaluation with these finetuned models then cost around a dollar when evaluating on 500 samples.
All other experiments involving heavy computational resources such as finetuning T5 and BART were performed on Google Colaboratory with GPUaccelerated notebooks available on the pro subscription.

D.1 Dataset URLs and Licenses
Listed below are all the details and licenses (where available) for the datasets used in this paper. All datasets used were research datasets and used for their intended purposes. None of the data used in this paper contains any sensitive information. A disclaimer has been added at the start of this paper for offensive content given that the

D.2 Label-Feature Correlation in Unskewed Training Sets
The correlation between the ground-truth label and the spurious cues on the randomly selected 1,000 training sets is shown in Table 13. There are no artificially introduced spurious correlations in this training set. According to the correlations in the table, we claim that the "no cue" training set is unskewed, except for the "embedding cluster" on the SBIC dataset where this correlation is 0.378, implying that the embedding vectors for the offensive social media posts are clustered together.  Table 14: Accuracy (↑), accuracy drop (↑), and prediction-feature correlation (↓) on four classification tasks of GPT-3 (Babbage), finetuned with and without explanations.