Influence Tuning: Demoting Spurious Correlations via Instance Attribution and Instance-Driven Updates

Among the most critical limitations of deep learning NLP models are their lack of interpretability, and their reliance on spurious correlations. Prior work proposed various approaches to interpreting the black-box models to unveil the spurious correlations, but the research was primarily used in human-computer interaction scenarios. It still remains underexplored whether or how such model interpretations can be used to automatically"unlearn"confounding features. In this work, we propose influence tuning--a procedure that leverages model interpretations to update the model parameters towards a plausible interpretation (rather than an interpretation that relies on spurious patterns in the data) in addition to learning to predict the task labels. We show that in a controlled setup, influence tuning can help deconfounding the model from spurious patterns in data, significantly outperforming baseline methods that use adversarial training.


Introduction
Despite the huge success of contemporary deep learning models and various applications that they power, critical limitations persist. Among the most harmful issues are their lack of interpretability (Lipton, 2018;Guidotti et al., 2018), and the tendency to learn spurious correlations, in addition to the true signals of the task (Leino et al., 2019;Sagawa et al., 2020b). Both of these lead to corrosive outcomes, from reduced performance on datasets in which the confounds no longer hold (Jia and Liang, 2017;Gururangan et al., 2018;Glockner et al., 2018;Mc-Coy et al., 2019;Kumar et al., 2019;Clark et al., 2019), to pernicious biases in model decisions (Sun et al., 2019;Blodgett et al., 2020;Field et al., 2021), and to overall reduced trust in technology (Ribeiro et al., 2016;Ehsan et al., 2019).
Consequently, multiple approaches have been proposed to alleviate the issues of the growing inscrutability and brittleness of the models. Two prominent approaches to interpretability in NLP models are (1) feature attribution-identifying important tokens in the input span, e.g. via saliency maps (Li et al., 2016;Ribeiro et al., 2016); and (2) instance attribution-explaining the model decisions as a function of influential training data (Koh and Liang, 2017;Han et al., 2020;Pruthi et al., 2020b). Both lines of research aim to help users build trust in the model by showing the rationale behind the model decision.
The issues of interpretability and robust generalization are not unrelated. Interpretations can facilitate the discovery of the model's reliance on frequent spurious patterns. For example, in natural language inference models an over-reliance on lexical signals can be revealed via feature attribution (Gururangan et al., 2018), via instance attribution (Han et al., 2020), or through a combination of thereof (Pezeshkpour et al., 2021a). In this work, we investigate a closer interaction between interpretability and model robustness.
Our research hypothesis is that interpretations that discover confounds can be incorporated at training time, to proactively guide the model towards avoiding the confounds and improving generalization. Our method relies on instance attri-bution interpretation methods that determine the influence of training data on the model's decisions ( §2). We show how this influence can help discover the model's reliance on some spurious patterns, first in an illustrative task ( §3), and then more generally in our proposed framework influence tuning. Influence tuning aims to demote the spurious patterns by guiding the model to produce plausible interpretations via instance attribution ( §4). We evaluate our approach on two datasets in a controlled setup ( §5, §6). Our experiments show that the proposed influence tuning method outperforms the baselines that use adversarial training (Ganin et al., 2016;Pryzant et al., 2018). We conclude with a discussion of a potentially broader impact of influence tuning on various NLP tasks.

Interpretation via Instance Attribution
Our primary goal is to use model interpretations for deconfounding the model during training. We focus on instance attribution approaches since these interpretations may help capture higher-level attributes in addition to token-and phrase-level lexical features, e.g., span overlaps, the length of text, etc. In this section, we review the family of instance attribution methods.
Many NLP models share a same general formula for their decision process during testing: , where x test is the test input tokens and θ is the parameters of the trained model. While feature attribution methods like saliency maps (Simonyan et al., 2014;Li et al., 2016) focus on interpreting an NLP model's decision by the importance of each individual tokens within x test , instance attribution methods often look at the influence of θ on the decision, which is further influenced by the training examples the model uses during the training phase.
Influence functions Koh and Liang (2017) propose influence functions (IF) for ML models, following the vision from robust statistics. IF first approximates how upweighting a particular training example z train = (x train , y train ) in the training set {(x 1 , y 1 ), . . . , (x n , y n )} by an infinitesimal train would change the learned model parameters θ: is the Hessian of the model. We can then use the chain rule to measure how this change in the model parameters would in turn affect the loss of the probing input: 2 The final influence of a train example to a probing example is defined as: That is, a training example is influential to a probing example if upweighting it in the training set would make the model more likely to make a correct decision over the probing example. 3 Gradient product Computing the inverse Hessian H −1 θ in the IF is expensive and requires further approximations if the model is non-convex. 4 Pruthi et al. (2020b) tackle the problem from a different perspective and arrive at a similar, but a first-order solution: 5 where θ i is the checkpoint of the model at each training epoch. The intuition behind this method is to approximate the total reduction in the probing loss L(z probe , θ) during the training process when the training example z train is used. Compared to IF, this gradient product method essentially drops the inverse Hessian term and reduces the problem to the dot product between the gradient of the training loss and the gradient of the probing loss.
Gradient cosine One potential problem of IF and gradient product is being dominated by some outlier training examples, where the norm of their training gradients is significantly larger than the rest of examples. This would lead the method to identify the same set of outlier training examples being influential to a large number of different probing examples. Barshan et al. (2020) points out this variance-lacking problem of IF and proposes a simple modification: substituting the dot product operation with cosine similarity, normalizing by the norm of the training gradients. Following the same intuition, we modify and further simplify the gradient product method: We use this latter influence definition for the instance attribution interpretation method throughout this work.

A Toy Example: Predicting Text Length
Now let's imagine a simple synthetic task where an NLP model like BERT (Devlin et al., 2019) is trained for binary text classification. Class 0 contains random short sentences with a length sampled from N (µ short , σ 2 ); class 1 contains random long sentences with a length sampled from N (µ long , σ 2 ). Our classification task is to predict the text length. However, there are confounds in this data. For every sentence, we insert a confounding token at the beginning of the sentence. Most of the time (e.g., 90%), token A would co-occur with a short sentence and B would co-occur with a long sentence; for the remaining sentences, this co-occurrence is flipped.
Our goal is to finetune the classifier so that it learns to predict the class labels (0 or 1) using, as intended, the text length information, instead of overfitting to the confounding tokens A or B. We refer to the text length as a core attribute, and to the confounding prefix tokens as a spurious attribute. 6 Finetuning the classifier on our synthetic task yields an 100% accuracy on the training set (overfitting). We are more interested in interpreting what information the model relies on to make classification decisions. This drives us to apply the instance attribution interpretation methods.
To interpret each classification decision via instance attribution, we randomly sample a few examples within the training set 7 as our probing examples z prb . We calculate the influence of each training example z trn on z prb using the gradient cosine method ( §2). Our expectation is that the z trn instances that have the same core attribute as z prb should be influential to z prb . In our example case, this means we expect the model to learn that the long training instances from class 1 are positively influential for labeling a long probing example with Figure 1: Distribution of each same-class training example's influence score I(z trn , z prb ) towards a typical probing example in TextLen ( §5.2). The range of influence scores is [−1, 1]. The average score difference between the two groups is 0.15, and the difference is statistically significant via t-test. class 1. At the same time, the spurious attribute of z trn should not dominate the contribution to the training example's influence towards z prb . Specifically, two long training examples, one with a confounding prefix A and the other with B, should be both influential to the long probing example with a confounding prefix, say, B. Figure 1 illustrates the influence score distribution for a typical probing example. The probing example is from class 1 (long text) and has a confounding prefix B. The orange plot in the figure shows the influence distribution of all class 1 training examples with the same prefix B, whereas the blue plot shows the influence distribution of all class 1 training examples with the different spurious prefix A. We observe that there is a statistically significant influence difference between these two groups. However, as the spurious attribute should not influence the model's decision process, we conjecture that the influence difference shows the model is confounded.
As we show in this motivating example, research on interpretability via instance attribution can help us extract rationales behind the model decisions.
When we know what are possible spurious patterns in the data, we can check whether the spurious confounds are influencing learning, yielding implausible interpretations. For plausible interpretations, such reliance on spurious attributes should not be as significant. In the next section, we propose a methodology to improve the model systematically, upon seeing an implausible rationale. As introduced in §5.2, the main verb in "the boy who was hugging a dog laughs" is not a present participle ending with "-ing", so the sentence should belong to CLASS_0 (y). A model θ may initially predict CLASS_1 with a high probability (ŷ). In the finetuning steps, we form a loss function L y,ŷ over the labels and backpropagate into the model parameters. Influence tuning (right): For the influence tuning steps, we sample a few probing examples zprb and training examples ztrn from the train set. In the figure zprb and ztrn both belong to CLASS_1 (main verb in -ing form) while the examples in ztrn can have the same spurious attribute (sentence beginning with "the") or different spurious attribute (beginning with "a") as zprb. A model θ may initially give the interpretation (π(I)) that these examples in ztrn have significantly different influence over zprb. This could be a sign that the model is confounded; we form a loss function J π(I),π(I) and backpropagate into the model parameters for a more plausible interpretation.

Influence Tuning
We propose a method to tune the model towards providing plausible rationales behind its decisions. Motivated by the example scenario in §3, we define this plausibility to be the difference between the influence of training examples with different spurious attributes. We therefore call the method influence tuning. Formally, we first randomly sample one probing example z prb = (x prb , y prb ) from the training set. We then sample a small group of training examples {z A 1 , . . . , z A k } ⊂ {z trn | y trn = y prb , c trn = c prb }, that share the same label (y) and the same spurious attribute (c) as in z prb (e.g., samples from the orange distribution in Figure 1). Similarly, we sample a small group of training examples {z B 1 , . . . , z B k } ⊂ {z trn | y trn = y prb , c trn = c prb }, that share the same label but with a spurious attribute that is different from the spurious attribute in z prb (e.g., samples from the blue distribution in Figure 1). Note that although in our example scenario y and c are both binary, they are not required to be so. Since the spurious attribute c should not be a part of the rationale behind the model's decision, we expect the average influence of {z A 1 , . . . , z A k } and {z B 1 , . . . , z B k } on z prb to be close to each other. Therefore, it is natural to define a new loss function over the influence scores and incorporate it in the model training: To optimize for the influence loss J , we need the gradient ∇ θ J : where the key is in calculating ∇ θ I(z trn , z prb ) with arbitrary z trn being either z A i or z B i . Recall that with the gradient cosine influence definition, I(z trn , z prb ) = ∇ θ L(ztrn,θ)·∇ θ L(z prb ,θ) ∇ θ L(ztrn,θ) ∇ θ L(z prb ,θ) . We can then derive ∇ θ I(z trn , z prb ) as: where the Hessians H trn = ∇ 2 θ L(z trn ) and H prb = ∇ 2 θ L(z prb ). We omit θ in L(·) for simplicity. Detailed derivations can be found in the appendix. 8 Overall, obtaining the gradient ∇ θ J for the influence loss J defined over the tuple (z prb , {z A 1 , . . . , z A k }, {z B 1 , . . . , z B k }) makes the optimization possible. For the actual training process, we alternate the optimization of θ over both the regular label prediction loss L and the influence loss J , with the interval as a hyperparameter to select. That is, do m steps of regular label loss propagation, n steps of influence loss propagation, then back to label loss propagation, and so on. The goal is to find a set of model parameters, without changing the model architecture, that lead to both accurate label predictions and plausible rationales behind the decisions. We use a pretrained BERT model as our initial model θ. Figure 2 summarizes our proposed method using the data examples that we will introduce in §5.2. 9

A special case of influence tuning
The above section gives an influence tuning framework based on the influence score I(·, ·) defined on the full set of model parameters θ. Now we are going to investigate an interesting special case of the framework, which defines the influence score on a partial parameter set.
Recall that we are using a pretrained BERT model as our initial model θ in our setup, and finetuning the BERT model would require training a prediction head over the transformer layers. For text classification, the prediction head is just a linear projection layer W , projecting from the BERT [CLS] token embedding to the label space and connecting to the final cross entropy loss. Additionally in our setup, our sampled z prb and z trn have the same label y. Now let's define a small parameter subset θ proj = W (y) , representing the row of the final projection layer W corresponding to the label y.
Similar to the original gradient cosine influence definition, we define I proj (z trn , z prb ) = ∇ θ proj L(ztrn,θ)·∇ θ proj L(z prb ,θ) ∇ θ proj L(ztrn,θ) ∇ θ proj L(z prb ,θ) . We can further expand the label loss L with the parameter subset 9 Instead of the alternating optimization we adopted, folding the influence loss into the standard finetuning loss as a regularizer may work as well. We did not explore it here since our initial hypothesis is whether we can use a plausible interpretation to help build a more generalizable model: the instance attribution interpretation methods assume some regular, untouched finetuning steps before interpreting.
The new definition I proj (z trn , z prb ) leads to a new influence loss J proj . Different from the secondorder influence tuning method that obtains ∇ θ J , we can get ∇ θ J proj by applying the regular gradient backward operation on the model and thus updating the model faster. All the other parts of the framework, like the data tuple selection and the alternating training objectives, remain the same. We call this special variant of influence tuning embedding tuning.

Adversarial training as a baseline
One notable feature of influence tuning is that it is designed to help deconfounding NLP models without adding additional modules to the network. A related line of research on deconfounding NLP models takes the intuition from domain adversarial training (Ganin et al., 2016;Pryzant et al., 2018;Kumar et al., 2019). These methods usually have two classifier modules built on top of a shared encoder. The objective for the model is adversarial: the model should be able to predict the target label y of the input accurately, while failing at reconstructing the spurious attribute c effectively, which potentially indicates that the confounding information regarding c is not encoded by the model.
As a baseline for this work, we specifically modify a BERT model according to the method described in Pryzant et al. (2018). It uses a gradient reversal layer at the beginning of the confound classifier head that would multiply the gradient by −1 during the backward pass. All of the BERT transformer layers form the shared encoder for the label classifier and the confound classifier. The implicit loss used by the model can then by written as L = L label − λL confound , where L label and L confound are both cross entropy loss, and λ is a hyperparameter to select. Essentially, this method learns and unlearns-learns to predict the correct label while unlearning the information that could help reconstruct the confound attribute. Compared to other prior work tackling spurious correlations mentioned in §1, this method is also most suitable for a direct comparison with our proposed influence tuning framework, because both methods aim to explicitly demote certain known confounds for the model.

Datasets
To evaluate the proposed approaches for deconfounding NLP models, we conduct controlled experiments on two datasets.
TextLen TextLen is a synthetic dataset we create that follows the example scenario in §3. Specifically, for the training set, we randomly split 1500 sentences from the canonical CoNLL-2003 shared task dataset (Sang and De Meulder, 2003) to two classes 0 and 1 of equal sizes. Sentences from class 0 are trimmed to a length sampled from a normal distribution with µ short = 15, σ = 4; sentences from class 1 are trimmed with µ long = 25, σ = 4. We add prefix tokens A="Negative." and B="Positive." to the start of the sentences, such that 90% of the time, a class 0 sentence would receive the negative prefix and a class 1 sentence would receive the positive prefix. However, in the dev set and the test set of TextLen, 10 while trimmed with the same text length distribution, only 50% of the time the confounding prefix would correlate with the class label of the sentence. A deconfounded model should achieve a good classification performance on both the train and test splits. Warstadt et al. (2020) to investigate whether language models would acquire a preference for linguistic generalizations. The model is supposed to learn a classification task with an ambiguous training dataset. For example, a class 1 sentence could be "the boy who hugged a cat is sneezing", and a class 0 sentence could be "a boy who is hugging the cat sneezed". To distinguish the two classes, a model performing surface generalizations could potentially rely on whether the article "the" or "a" precedes the sentence. A model performing linguistic generalizations, however, could be deciding based on whether the main verb of the sentence is in the "-ing" form. The linguistic feature decides the class of the sentence in both the MSGS train and test sets, whereas the surface feature correlates highly with the classes only in the training set, and co-occurs randomly with the classes in the disambiguated test data. Specifically, we choose MSGS's SYNTACTIC CATEGORY as the core attribute and RELATIVE POSITION as the spurious attribute; for the training set, we also adopt an inoculation rate of 0.3% (Warstadt et al., 2020).

MSGS The Mixed Signals Generalization Set is proposed by
We show statistics of TextLen and MSGS in Table 1. We use BERT-base as our model template and the default BERT Adam optimizer for both tasks and all of the deconfounding methods. We perform hyperparameter search using the dev set for all of the methods. Detailed hyperparameters can be found in the appendix.

Does influence tuning make the model interpretations more plausible?
We are interested in a preliminary research question first: having seen the confounded model interpretations discovered in §3, does our proposed method, influence tuning, make the model interpretations more plausible? To quantitatively measure how much the model relies on the spurious attribute to make decisions, for both tasks we randomly select 40 probing examples z prb from the training set. For each probing example z prb , we put the training examples into two groups: A = {z trn | y trn = y prb , c trn = c prb } and B = {z trn | y trn = y prb , c trn = c prb }, where y is the true label and c is the confounding spurious attribute. We define the confound influence difference (CID) to be the influence difference between the two groups to the probing example: 1 |A| trn∈A I(z trn , z prb )− 1 |B| trn∈B I(z trn , z prb ). We show in Figure 3 the average CID of three different models for TextLen during the training process: a model that is trained with the regular finetuning objective, a model that is trained using the influence tuning framework, and a control model that is trained over a non-confounding version of TextLen data (i.e., the spurious prefix token is removed). The final CIDs are 0.093, 0.035 and 0.019,  respectively for the three models. We observe that both the finetuning model and the influence tuning model start with a very high CID, indicating the confound attribute is exploited heavily at the beginning of the training process. However, for the influence tuning model, each influence tuning round happened after every 50 standard finetuning steps, helps the model achieve a near zero CID (as shown by the vertical drops within the influence tuning plot). The model does regain the CID during the following finetuning steps, but eventually arrives at a relatively low CID. The result on the MSGS data is similar, except that we do not have the non-confounding control model. The drops in CID caused by influence tuning answer our preliminary question: we do find that influence tuning makes the model interpretations more plausible in accordance with our expectation.
6.2 Does the guided plausibility transform well to the model generalizability?
Would a more plausible model, or more specifically a model that is guided to provide plausible interpretations, achieve a stronger performance in out-of-distribution test sets, where the confound information no longer helps the decision? To answer this question, we compare the different deconfounding approaches introduced in §4 and §5.1 over the TextLen and MSGS tasks.
We show our main results in Table 2. On TextLen, we observe that the adversarial training method gives a moderate improvement over the regular finetuning model. Both influence tuning and embedding tuning lead to significant accuracy gain, with the embedding tuning method achieves the highest 82.2% test accuracy. In §4.1 we derived why embedding tuning is a special case of influence tuning with a reduced set of parameters for the influence calculation. This could be enough for the TextLen dataset since the task is relatively easy. The upper bound model with the spurious attribute removed from the data still outperforms all of the deconfounding methods, leaving a gap to address in future work.
On MSGS, we observe a similar trend for the adversarial training method, which makes a moderate improvement over finetuning as expected. Again, both influence tuning and embedding tuning achieve significant improvements. However, the influence tuning method outperforms embedding tuning on this task. One contributing reason could be that the linguistic generalizations required by this task can be encoded across the full BERT transformer layers. Therefore, the influence defined only over the final projection layer in the embedding tuning case might be limiting. Overall, both our proposed influence tuning and the special case embedding tuning are effective at deconfounding the models in our experiments compared to the baselines.  6.3 Can we use less data with influence tuning?
The advantage of influence tuning does not come without a price. For a general dataset, it would require at least some lightweight annotations in addition to the regular label information. For example, in §4 when we operate with the data tuple (z prb , {z A 1 , . . . , z A k }, {z B 1 , . . . , z B k }), we would need information about the confound group an example belongs to. Though in our experiments with TextLen and MSGS we are sampling a relatively small set of z prb , z A i and z B i (50-100 z prb each full influence tuning round, 3-5 z A i and z B i for each z prb ), the model still potentially has access to the confound information of the full dataset during the whole training process. 11 Therefore, in this section we are interested in whether we can strictly provide less accessible confound information to the model, and how this would affect the performance of our methods. For both the TextLen and MSGS data, we randomly select subsets of the training set containing m% of the total examples. Then during the training process, we limit the model to sample z prb , z A i and z B i only from the m% training subset where the confound group information is accessible. Note that this serves as a hard upper limit for the confound access, and the actual confound information queried by the influence tuning framework can be well under this limit.
In Table 3 we show the test performance of influence tuning and embedding tuning on TextLen and MSGS, using the same model hyperparameters as the results in Table 2. However, unlike the Table 2 results where every trial within the five random seeds succeeds in fitting the training set, the experiments with the data constraint sometimes fail to converge (i.e., not even fitting the train set). We exclude such failed trials from the average performance reported in Table 3, while we observe that 11 Our baseline approach adversarial training also has access to and actually uses the confound information of the full data.  Table 3: Performance of influence tuning and embedding tuning when there is an upper limit for the confound access rate. The accuracy shown is an average of at least three succeeded trials within the use of five random seeds.
at least three out of the five trials for each confound access rate can converge successfully. 12 We see that even with the hard constraint on the confound access rate, influence tuning and embedding tuning still outperform the baseline methods, using only 5%-20% examples. Generally, higher confound access rate would lead to stronger deconfounding performance, which creates a tradeoff to decide based on the need of the users.

Discussion
In this section we answered three questions regarding the influence tuning framework: whether it makes the model more plausible, whether it helps the model achieve a strong deconfounding performance, and whether it can be used with a reduced amount of data. We conducted experiments on a synthetic dataset and a linguistic probing dataset, but the potential application of our approach can be more impactful than the current tasks. For example, our method might be helpful for identifying and mitigating gender biases and racial biases in sentiment analysis or toxicity detection systems (Kiritchenko and Mohammad, 2018;Sap et al., 2019), by modeling the problem as a deconfounding task. One potential drawback is that these natural cases would inevitably require some extent of extra human annotations. However, we also believe the human feedback in NLP (Settles, 2011;Kaushik et al., 2020;Wang et al., 2021) is a crucial and controllable way to tackle model's exploitation of spurious correlations in the data, which happens as a result of the absence of proper supervision. Furthermore, if we define the influence objective differently in §4, e.g. modeling which groups of examples should be influential to the probing instance and to what extent, we may be able to implicitly promote the core attributes in the task in addition to demoting the confounds.

Related Work
Interpreting NLP models by token-level importance scores over the input span is a widely adopted approach (Belinkov and Glass, 2019). These scores can be gradient-based (Simonyan et al., 2014;Li et al., 2016), attention weights if supported by the model (Jain and Wallace, 2019;Wiegreffe and Pinter, 2019), or weights from a linear model fitting the local region of a deep model (Ribeiro et al., 2016). The models can achieve better performance or learn more efficiently if supervisions are provided for these feature importance scores (Ross et al., 2017;Zhong et al., 2019;Pruthi et al., 2020a). Unlike the token-level interpretations, our focus in this work is on the instance attribution methods. Apart from influence functions (Koh and Liang, 2017) and TracIn (Pruthi et al., 2020b) that are already introduced, other instance attribution methods include representer point selection (Yeh et al., 2018) and θ-relative influence functions (Barshan et al., 2020), with Pezeshkpour et al. (2021b) comparing the methods empirically in NLP tasks. However, these methods do not facilitate a systematic improvement for the model based on the plausibility of the interpretations, which is a gap addressed by this work. Models designed with explicit interpretability considerations like deep weighted averaging classifiers  and SelfExplain (Rajagopal et al., 2021) may also support instance attribution, though the flexibility of the model architecture can be more limited.
One key use case of the proposed influence tuning framework is to deconfound the model from relying on spurious attributes during the decision process. Other works that aim at preventing neural models from using the spurious attributes include Elazar and Goldberg (2018) and Pryzant et al. (2018) which operate over a known set of confounds, and Kumar et al. (2019) which models unknown, latent confounds. They often involve the idea of learning invariant features across domains through adversarial training (Ganin et al., 2016;Xie et al., 2017). Spurious correlations can also be mitigated by data-based, optimization-based, and postprocessing methods (Zmigrod et al., 2019;Kaushik et al., 2020;Sagawa et al., 2020a;Yaghoobzadeh et al., 2021;Clark et al., 2019). In this work, we mainly compare with the adversarial training method with gradient reversal in Pryzant et al. (2018) as a baseline, since both methods perform explicit demotions to some known confounds in the data used by the model. Future work can explore comparisons and potential combinations with other approaches addressing the spurious correlations.

Conclusion
NLP models that build upon deep neural networks are notoriously opaque about their decision process. Though instance attribution methods can be used to unveil problems of the model reflected by the implausible interpretations, a novel research question is whether or how the model training can benefit from interpretability methods in a systematic way. Our work addresses this question, by proposing the influence tuning framework that backpropagates a target instance attribution interpretation directly to the model. In two use cases of demoting spurious confounds in the data, we show that (1) influence tuning can eventually lead to more plausible model interpretations; (2) influence tuning can help build better-performing deconfounded models compared to those trained with the baseline methods; (3) influence tuning can still be reliable in lower-resource setups. Future work will explore more datasets and tasks, and other optimization methods. Additionally, we will explore guiding the model to learn to promote core attributes of the task in addition to demoting the spurious confounds.

A Hyperparameters
For finetuning BERT, we use a learning rate of 2e-5 for both TextLen and MSGS. We tune 10 epochs for TextLen and 3 epochs for MSGS with a batch size of 64, resulting in around 250 steps for each dataset since the data size is also different.

B Derivation of the influence gradients
To derive the gradient of the cosine influence, we first derive the gradient of the dot product influence: Next we derive the gradient of the full cosine influence: We already know the gradient of i(θ), so we are only interested in ∇ θ m(θ). We first calculate ∇ θ ∇ θ L(x 1 ) and ∇ θ ∇ θ L(x 2 ) , and then apply product rule to combine: