FastIF: Scalable Influence Functions for Efficient Model Interpretation and Debugging

Influence functions approximate the “influences” of training data-points for test predictions and have a wide variety of applications. Despite the popularity, their computational cost does not scale well with model and training data size. We present FastIF, a set of simple modifications to influence functions that significantly improves their run-time. We use k-Nearest Neighbors (kNN) to narrow the search space down to a subset of good candidate data points, identify the configurations that best balance the speed-quality trade-off in estimating the inverse Hessian-vector product, and introduce a fast parallel variant. Our proposed method achieves about 80X speedup while being highly correlated with the original influence values. With the availability of the fast influence functions, we demonstrate their usefulness in four applications. First, we examine whether influential data-points can “explain” test time behavior using the framework of simulatability. Second, we visualize the influence interactions between training and test data-points. Third, we show that we can correct model errors by additional fine-tuning on certain influential data-points, improving the accuracy of a trained MultiNLI model by 2.5% on the HANS dataset. Finally, we experiment with a similar setup but fine-tuning on datapoints not seen during training, improving the model accuracy by 2.8% and 1.7% on HANS and ANLI datasets respectively. Overall, our fast influence functions can be efficiently applied to large models and datasets, and our experiments demonstrate the potential of influence functions in model interpretation and correcting model errors.


Introduction
Language understanding systems are becoming ever more powerful with the recent advances in 1 Code is available at https://github.com/ salesforce/fast-influence-functions. large-scale pre-training. As these systems become widely adopted for real-world applications, the ability to interpret the model decisions grows increasingly important. An array of interpretation methods now exist for shedding light on models' decisionmaking processes, with the bulk of work focusing on estimating feature importance ( . These approaches aim to identify features a model uses to make decisions or analyze representations obtained from trained models. In contrast, one might also want to know how particular training data points influence model behavior at test time. This kind of goal is an instance of what Lipton (2016) terms "algorithmic transparency," or, transparency "at the level of the learning algorithm itself." The ability to do so would allow researchers to identify and respond to data responsible for problematic test-time behavior.
One simple brute-force way to estimate the importance of a training data-point to a test-time decision is the leave-one-out approach (Hastie et al., 2009). Alternatively, influence functions (Koh and Liang, 2017) provide a tractable estimate of the effect without the need to repeatedly retrain the model. Yet, influence functions are very expensive to compute for moderate-sized model and training data. For instance, finding the most influential examples w.r.t. an evaluation data-point with a model of about 14.7M parameters in a dataset of about 390K examples -a pretty moderate setting -takes more than 2 hours (please see Sec. 5.1 for details). Consequently, applications might face the dilemma of either accepting the computation cost or resorting to a smaller-scale setting. 2 But why are influence functions computationally expensive? In our early experiments, we found that the main computational bottleneck lies in the following steps: 1. First, searching for the most positive/negative influential data-point(s) w.r.t. some evaluation data-point(s) is an O(n) operation (via enumerating the entire training set), and can take more than two hours in our experiments. 2. Second, it is expensive to estimate the inverse Hessian of model parameters required to compute the influence of a single data-point is expensive (usually in the order of minutes). 3. Lastly, previous algorithms perform serial computations that can actually be parallelized.
In this work, we present FASTIF (Fast Influence Functions) to address these challenges through three simple techniques. First, instead of enumerating the full training dataset to find influential data-point, we leverage fast nearest neighbor search (Johnson et al., 2017) to narrow the search to a small subset of influence-worthy data-points. This operation reduces the computation by about an order of magnitude. Second, we identify a set of hyperparameters for the Hessian estimation algorithm that reduces computation time by more than half while preserving estimation quality. Finally, we describe a simple parallelizable extension, which gives an additional 2X speedup. As a result, we are able to improve the most influential examples search overall by approximately two orders of magnitude in our experiments.
So what could we do with faster influence functions? We demonstrate the advantage of fast influence functions via several interesting downstream applications. These require computing in-fluence functions repeatedly and were thus almost intractable without such fast influence functions: 1. In Sec. 6.1 we examine the "explainability" of influential examples using the framework of simulatability (Doshi-Velez and Kim, 2017; Hase and Bansal, 2020), and we find that they improve model simulatability. 2. We visualize how different training data-points interact with different test data-points in Sec. 6 (2020) used them to estimate the quality of synthetic training samples in the context of dataaugmentation. Meng et al. (2020) explored the combination of gradient-based methods and influence functions to jointly examine training history and test stimuli. Their work also tried using influence functions to fix erroneously classified examples, albeit through retraining another model. In our work, we primarily focus on improving the run-time of the influence function, and explore its potential in interpreting model behaviors and efficiently correcting predictions. Kobayashi et al.
(2020) experimented with training models with instance-specific dropout masks to efficiently compute the influence of one training data-point on one test data-point. Our method, on the other hand, both speeds up individual influence function computation (Sec. 3.3) and reduces the number of such computations (Sec. 3.2). Further, it is an inferencetime technique; it requires no change to model training and could, in principle, work with any trained model. Finally, we also include a more extensive set of experiments/applications on a larger dataset.

Background
Let a data-point be defined as z=(x, y) for input x and label y, and the loss function be L(z, θ). Given N training data-points in the training set Z, the standard empirical risk minimization tries to solve the following problem,θ = arg min θ We want to answer the question: what is the influence of a training datapoint z on the learned model parameters θ and its behavior on a new test data-point z test .
Leave-one-out methods take the "discrete" way: train two models, one on the full training dataset Z, and one on the same dataset but with z removed. The difference in the behavior between these two models is the effect of z's presence (or absence). Among the many definitions of model behavior, in this work we mainly focus on the loss at the test data-point. Influence functions, on the other hand, answer this question via approximating it locally and measure the change in the model behavior via up-weighting the loss of the training data-point by . Thus the influence function refers to the change in the model's loss on the test data-point z test if we up-weight the loss of train- whereθ ,z are the parameters of the model trained with training data-point z up-weighted by ,θ ,z = arg min θ Measuring I(z, z test ) via training another model is prohibitively expensive. The influence function (Koh and Liang, 2017) computes the following tractable approximation, whereθ is the original parameter vector of model trained using training data-points, and Hθ is the Hessian of model parameters. For each test datapoint z test , we are usually interested in finding the most positively influential training data-points, and the most negatively influential training datapoints. 3 We can find the most positively influential (i.e., harmful) training data-points z * by computing the influence values between z test and each z in the training set, where arg max changes to arg min for finding the most negatively influential (i.e., helpful) data-point. While a much more tractable approximation, this is still unscalable in practice for a few reasons.  First, enumerating all the data-points in the training set to find the arg max/arg min influence values is expensive for large datasets. Second, evaluating the inverse-Hessian H −1 θ is very expensive for large neural-network models (Koh and Liang, 2017; Agarwal et al., 2017). In this work, we address those challenges by presenting simple but effective methods. We summarize the speed improvements in MultiNLI settings in Table 1 and Sec. 5.1, and our method in Fig. 1.

Speeding up the arg max using kNN
A naive implementation of Eq. 2 is expensive because the computation cost grows with the dataset size. In practice, however, we are primarily concerned with data-points that are most influential. We hypothesize that we could constrain the expensive search to a subset of promising data points, Z ⊆ Z, that are likely to be influential, without a significant impact on quality, Notice that the arg max is now performed over Z ⊆ Z. We select the subsetẐ as the top-k nearest neighbors of z test based on the 2 distance between extracted features of the data-points following Khandelwal et al. (2019) and Rajani et al. (2020). 4 This operation is extremely fast using highly-optimized nearest neighbor search libraries such as FAISS (Johnson et al., 2017). In Sec. 5.2, we will examine the quality of nearest neighbors in terms of retrieving influence-worthy data-points.

Speeding up the Inverse Hessian
The next computational bottleneck lies in estimating the inverse Hessian of model parameters in Eq. 1. Computing the Hessian for the full training dataset is expensive, and inverting it is similarly prohibitive: with n training data-points and p parameters, this computation requires O(np 2 + p 3 ) operations, and is very expensive for large dataset/model (please see Sec. 3 in Koh and Liang (2017) for details). We start by describing the method proposed in Koh and Liang (2017). 4 We use the model's final representation as the features.
First, notice that for each test data-point z test , we can pre-compute and cache the following quantity, We can then efficiently compute the influence for each data-point z, Next, the method approximates s test via a combination of (1) implicit Hessian-vector products (HVPs) to avoid explicitly computing/storing H where J is a sufficiently large integer so that the above quantity converges. • Step 3. Repeat Step 1-2 T times independently, and return the averaged inverse HVP estimations.
We can see that the computation cost and approximation quality depends on: J the number of recursive iterations, T the number of independent runs, and B the batch size. Typically, J is chosen such that the approximation converges, the number of repetition is simply set T =1, and the batch size is set to the largest value that the GPU can afford. Experiments in Sec. 5.3 examine the speed-quality trade-off under various configurations. Contrary to the common settings, the experimental observations lead us to propose a few simple changes: (1) choose a J so that approximation converges; (2) choose a small batch size; (3) make up for the noisiness of small batch size using larger T , which can be distributed over multiple GPUs (Sec. 3.4). In  our experiments, we pick J∈{1000, 1500, 2000} based on convergence, we found that even B=1 suffices, and we choose T ∈[1, 4].

Details on Parallelization (Fig. 2)
The modifications described in Sec. 3.2 and 3.3 are designed with parallelizability taken into account. Notably, the advantage of using multiple repetitions in Sec. 3.3 is that it allows parallel-computation. We can apply asynchronous parallel computation to compute the s test , use one synchronization point to average the result using all-reduce, and then asynchronously compute the influence of a subset of data-points that are pre-selected using kNN.

Experimental Setup
We use the MultiNLI dataset in analysis experiments (Sec. 5), and we include MultiNLI, HANS, ANLI, and Amazon-WILDS datasets (English) in the applications (Sec.  Table 3 in the appendix summarizes key experiment details such as the number of evaluation data-points and repetitions used in experiments. We use V100 GPUs in experiments. 5 Please see corresponding papers for dataset statistics. When a MultiNLI model is intended to be used in both  Table 1 presents the summary of computation times. The run-time statistics are computed over 10 evaluation examples on the full MultiNLI training dataset (please see Sec. 4 for details). We can see that adding kNN helps reduce the time by about an order of magnitude, fast s test cuts the time by additional 55−70%, and parallelism further reduces the time by more than two fold. With k=1e3, fast s test approximation, and 4 GPUs, we are able to find influential training data-points of an evaluation data-point in about 1.5 minutes, more than 80 times faster than the original influence functions, which would take more than 2 hours.

Recall of kNN
To examine kNN's recall, we ask: if a data-point is influential, will it be included in the subset selected by the kNN? We define the recall score R@m as the percentage of top-m ground-truth influential data-points that are selected by the kNN, where we let the top-m influential data-points computed without kNN (i.e., on the full dataset) be the top-m ground-truth influential data-points.
Results. Fig. 3 (a) shows the experiments results. Overall, the recall scores show that kNN is useful in selecting data-points that are likely to be influential. The recall scores for the most influential (i.e., using the absolute value) data-points are close to 60% with k=5×10 4 , and 20% with k=5×10 3 . These k's are about order(s) of magnitude smaller than the size of the full training dataset (more than 3.9×10 5 examples). Note that random selection would lead to recall around 15% (k=5×10 4 ) and 1.5% (k=5×10 3 ) in the two settings. One interesting observation is that, the recall scores for the most harmful data-points tend to be higher than the most helpful ones when the predictions are correct, but lower when the predictions are incorrect. 6

Inverse HVP Approximation
We look at the speed-quality trade-off of the Hessian approximation. Our experiments show that the speed-up does not greatly affect the quality.   Results. Fig. 3 (b) shows the results of the experiments. For the two figures on the left, we can observe that the computation cost (measured in time) grows with both the batch size and number of recursive iterations J. Similarly, the estimation error 7 (the figures on the right) shows that the error decreases with both the batch size and J in general. Further, we notice that when the batch size is small (e.g., B=1), we can make up the loss in quality by increasing T , which is inherently parallelizable (Sec. 3.4). Overall, these suggest that we could trade off a small drop in quality with a significant speed-up by the combination of (1) small B, (2) medium J, and (3) large T .

Quality of Influence Estimations
Finally, we want to ensure that the final computed influence scores are of sufficient quality. First, we compute the correlations of influence values between one that computes influence functions with both kNN and fast s test approximation (Fast) and 7 The estimation error is measured as the difference norm w.r.t. the estimation using the most expensive configuration. one without them (Full). Next, we compare the quality of computed influences by actually retraining the models. If the influence function correctly identifies helpful and harmful data-points, retraining without helpful points should result in the evaluation loss rising (i.e., positive change in loss), and retraining without harmful points should result in the loss falling (i.e., negative change in loss).
Results. Table 2 shows that the correlations are high for all measures considered. The results show that fast influence functions achieve >95% correlations with full influence functions. These demonstrate that using the fast influence functions achieve reasonable qualities at just a fraction of the total cost. Next, Fig. 3 (c) shows the retraining results, where we separate the results for cases based on whether the prediction is correct, and averaged across data-points within this bucket. We can see that overall loss increases when we remove helpful data, decreases when we remove harmful datapoints. Further, the performance (measured by the change in loss) between using the fast influence functions and full influence functions (i.e., no kNN and fast s test approximation) is similar, and both perform better than random selection in general. Note that with a large training set, individual data points tend to have a limited effect on the generalization error. We see a larger effect as a larger number of points are removed.

.1 Explainability of Influential Examples
Motivation. Knowing which points are influential to the model loss may give us insight into our model/training algorithm, and therefore we want to test whether influence functions can be used to improve model explainability. To do so, we need a framework for evaluating the quality of explanations given in terms of influential data points.
Approach. Here we will mainly focus on the concept of simulatability. Doshi-Velez and Kim (2017) explained that a model is simulatable when a person can predict its behavior on new inputs. Thus, we will measure an explanation's quality in terms of its effect on model simulatability. Specifically, we train another simulator model to predict the predictions of the task model (Treviso and Martins, 2020; Hase et al., 2020). The simulator model is trained with the same data as the task model, but the labels are replaced with the task model's predictions. We then fine-tune the simulator on data identified as influential for the task model's performance on test data-points. If the simulator model can better predict what the task model will do by fine-tuning on this data (i.e., achieving lower loss), the influential data points are said to be a good "explanation" of the task model's behavior. This approach is similar to Pruthi et al. (2020), who also treat explanations as targets to fit a model to.
Experiment Results. We can observe from Fig. 4 that, when the prediction is correct, finetuning on helpful data-points improves the simulator's ability to predict the task model's behavior. Similarly, when the prediction is incorrect, finetuning on data-points that are harmful (to the task model's loss on ground-truth labels) improves the simulator's predictions of the task model. Further, the effect on the loss from influential data-points is greater than random selection of data-points overall. This demonstrates that influential examples can serve as explanations of the task model's behavior.

Effect Visualization
Motivation. Investigating how different training data-points interact with test data-points is also useful, because such exploratory data analysis can discover interesting relationships between data-slices (i.e., subsets of data with specific attributes).
Approach. We conduct experiments on two models, one trained on MultiNLI and the other on HANS. We then compute influence functions on their corresponding training data-points and build circular bipartite graphs. The nodes represent training (inner circle) and evaluation (outer circles) datapoints (incorrect predictions), and the strength and color of edges represent the influence values. Fig. 5 visualizes the results, where the left half corresponds to the model trained on MultiNLI and the right half corresponds to the model trained on HANS.

Experiment Results
Left Hand Side. Appendix Table 5 summarizes a few key statistics about the plots on the left hand side. Interestingly, we observe that while there are more MultiNLI training data-points that are harmful to HANS/MultiNLI evaluation data-points, their (harmful) influences are in general much lower than the helpful counterparts. This suggests that a few critical helpful data-points can potentially improve the model's performance on HANS. In contrast, a few harmful data-points will have a relatively lower impact on the model's performance. In Sec. 6.3 , we will leverage this insight to improve model performance on HANS. This could also be connected to the observation we have in Sec. 5.2 where the recall scores of kNN tend to be higher for helpful data-points for incorrect predictions.
Further, we measure the influence correlation between different data slices. Table 6 in the appendix suggests that, for the datasets considered here, if training data-points are influential (either harmful or helpful) to one of the data slices, these data-points will likely be similarly influential to other data slices (if influential at all).
Right Hand Side. Since here the training data is HANS, we further segment the HANS training data-points based on the subset they are in, using different colors and radii. Interestingly, we find that training data-points in the "Lexical Overlap" are noticeably more influential (either helpful and harmful) to all the HANS evaluation dataset subsets. Note that, by the construction of the dataset, the "Lexical Overlap" heuristic includes other heuristics as special cases. Hence, we conclude that visualization of data influences can be used to discover latent structure in datasets.

Error Correction
Motivation. In addition to model interpretation and exploratory data analysis, one might also con-  Figure 4: Simulator loss on 4 test data-points (more figures in the appendix), where the simulator is fine-tuned on different types of data-points with ground truth labels using various learning rates. The lines refer to the mean performance averaged across 10 fine-tuning data-points, and the shaded area covers the max/min performance. sider ways to efficiently correct the wrong model predictions, which is often useful in practice. Luckily, the influence function not only implies which data-points are harmful, but also which data-points that are helpful. This suggests a simple way to correct/fix model prediction(s): taking gradient steps with the model on the helpful data-points. This approach naturally leads to another interesting question. Instead of taking gradient steps on data from the original data-points, can the same approach be used to suggest new training data-points that were not seen during training? This is interesting as the original formulations of influence functions are defined on the data-points used during training. Our experiments show that, when the new training data-points are "similar" to the original training data-points, this approach can be helpful.
Approach. We first sample a small batch of validation data-points ("anchor" data-points), compute influence scores of training data-points w.r.t. the anchor data-points, and then update model parameters by taking gradient steps on influential training data-points w.r.t. these anchor data-points. These steps are repeatedly applied multiple times. Please see Sec. C.3 for more details such as a step-by-step algorithm and dataset splits.
Experiment Results. Fig. 6 (a) shows the results for the model trained on MultiNLI, evaluated on HANS, and the augmented data come from the original training dataset (MultiNLI). 8 Overall, we can see that fine-tuning on helpful data-points improves the performances compared to random ones, and fine-tuning on harmful data points leads to worse performances. This simple technique brings more than 5.8% average improvement in accuracy. As using influence functions requires gradient access to the anchor data-points, we also experiment with directly fine-tuning on them as well ("z-test" in the figure). The results demonstrate the potential of using influence functions to correct model predictions, above and beyond the improvements available from just finetuning on the anchor datapoints (by about 2.5% in accuracy on average).
We can also observe from Fig. 6 (a) that helpful examples tend to have a greater magnitude of effect than harmful examples. This can be connected to the visualization results in Sec. 6.2 where we can see (a handful of) helpful data-points have large negative/helpful influences while many data-points have medium positive/harmful influences.
Next, we examine the settings where we finetune the model on a new training dataset unseen during training instead of on the original training  dataset (i.e., a data augmentation setup). Figs. 6 (b) and 7 (c) show the results for the model trained on MultiNLI, evaluated on HANS/ANLI, and the augmented data come from the HANS/ANLI training dataset. We can observe that random data augmentation works reasonably well. 9 Further, augmenting helpful data-points can outperform random data augmentation and using anchor data-points directly in general. In the end, we could improve the average accuracy on HANS and ANLI by more than 5.9%/2.7% (about 2.8%/1.7% more than using anchor data-points directly) respectively. These results show the potential of influence functions for sample-efficient data augmentation.
Finally, we experiment with settings where the model is trained/fine-tuned on the Amazon-WILDS training dataset, and evaluated on an outof-distribution (OOD) test set. Fig. 7 (d) shows that fine-tuning on harmful data-points deteriorates the performance as expected, though fine-tuning on helpful ones has little impact on the performance. We hypothesize that the selected anchor data-points are not representative enough of the evaluation dataset, as fine-tuning on anchor data-points directly also has little impact. This shows that the usefulness of our method likely depends on the quality of chosen anchor data-points.

Conclusions
We present FASTIF which, via simple modifications, significantly speeds up the computation to influence functions without significantly impacting their performance. Our improvements include using kNN to pre-select a subset of influence-worthy data-points, identifying a set of hyperparameters for the inverse HVP estimation algorithm, and a parallel implementation that minimizes communication overhead. We empirically examine the effectiveness of these modifications. Then, with the availability of fast influence functions, we demonstrate a few interesting applications that were previously intractable: (1) examining the "explainability" of influential data-points, (2) visualizing data influence-interactions, (3) correcting model predictions using original training data, (4) and correcting model predictions using data from a new dataset.

Acknowledgments
We thank the reviewers and Shi Feng for helpful discussions. HG interned at Salesforce Research; PH and MB were supported by a Salesforce Research Faculty Grant, NSF-CAREER Award 1846185, DARPA YFA17-D17AP00022, and a Royster Society PhD Fellowship.

Ethical Considerations
This work presents scalable influence functions for efficient model interpretation and debugging, which would be especially useful for improving model performance for particular categories of model failures after they are identified.
Recently, Carlini et al. (2020) noticed that an adversary could extract training data from largescale language models and recover potentially sensitive information. If properly used, the tool could be helpful for checking whether a model might be vulnerable to such attacks (hence, our tool should be used to encourage the detection of such memorization-based models as opposed to being misused to exploit such models). Finally, while the fast influence functions are more compute-efficient than alternatives like retraining and full-scale influence functions, they are nevertheless expensive operations. Thus, applying FASTIF to large-scale datasets and models might be restricted to those who have access to adequate computing power (but in fact, this is why the main purpose of this paper is to make influence functions faster and more compute efficient).

A Summary of Key Experiment Details
Please see Table 3.

B Experimental Results and Analysis
In this section, we examine the effectiveness of the methods described in Sec. 3.2 and 3.3. Specifically, we conduct the following set of experiments: • We measure kNN's recall in terms of retrieving data-points that are potentially influential. • We look at the trade-off in speed/quality of inverse Hessian-vector-product approximation with various configurations. • We examine the quality of the influence estimations using all of the proposed techniques, by comparing the correlations between the influence values computed with/without them.

B.1 Recall of kNN
In Sec. 3.2, we describe the use of kNN for preselecting a subset of data-points. This (smaller) set of data-points are then re-ranked using the more expensive influence functions. Making this work well requires that the subset selected by kNN contains potentially the most influential data-points.
In this section, we examine the kNN's recall. We ask the question: if a data-point is influential, will it be included in the subset selected by the kNN? To formalize this, we define the recall score R@m as the percentage of top-m ground-truth influential data-points that are selected by the kNN, where we let the top-m influential data-points computed without kNN (i.e., on the full dataset) be the top-m ground-truth influential data-points.
Details. For each evaluation data-point z test , we first compute the ground-truth influential data-points via running influence functions on the MultiNLI training dataset without kNN (i.e., { top-m influential }). Then, we use kNN to select the k training datapoints (i.e., { retrieved }).
We chose k∈ Finally, we compute the recall R@m with m∈{10 1 , 10 2 , 10 3 }. We repeat the aforementioned steps for three types of influential data-points: most positively influential (harmful), most negatively influential (helpful), and most influential (unsigned influence, by taking the absolute value). We select 100 data-points from the MNLI evaluation dataset (50 data-points when the model predictions are correct and incorrect) and aggregate the results.
Results. Fig. 8 shows the experiments results. Overall, the recall scores show that kNN is useful in selecting data-points that are likely to be influential. The recall scores for the most influential (i.e., using the absolute value) data-points are close to 60% with k=5×10 4 , and 20% with k=5×10 3 . These k's are about order(s) of magnitude smaller than the size of the full training dataset (more than 3.9×10 5 examples). Note that random selection would lead to recall around 15% (k=5×10 4 ) and 1.5% (k=5×10 3 ) in the two settings. One interesting observation is that, the recall scores for the most harmful data-points tend to be higher than the most helpful ones when the predictions are correct, but lower when the predictions are incorrect. 10

B.2 Inverse-Hessian-Vector-Product Approximation Speed-Quality Trade-Off
We look at the speed-quality trade-off of the Hessian approximation tricks, and see if the Hessian approximation's speed-up comes at the cost of dramatically lower quality. Our experiments show that the speed-up does not greatly affect the quality.
Details. We compute the Hessian approximations with J∈{700, 800, ..., 1300},  Results. Fig. 9 shows the results of the experiments. For the two figures on the left, we can observe that the computation cost (measured in time) grows with both the batch size and number of recursive iterations J. Similarly, the estimation error 12 (the figures on the right) shows that the error decreases with both the batch size and J in general. Further, we notice that when the batch size is small (e.g., B=1), we can make up the loss in quality by increasing T , which is inherently parallelizable (Main Paper Sec. 3.4). Overall, these suggest that we could trade off a small drop in quality with a significant speed-up by the combination of (1) small B, (2) medium J, and (3) large T . 13 11 In the figures, we only include the results of different T for difference norm. This is because practitioners can use parallelism to speed up the computations of each run, which is independent of each other as described in Sec. 3.4. Thus, the change in time mainly depends on parallelism overhead, which we found to be reasonable in our initial experiments. 12 The estimation error is measured as the difference norm w.r.t. the estimation using the most expensive configuration. 13 We pick J∈{1000, 1500, 2000} based on convergence.

B.3 Quality of Influence Estimations
Finally, we want to ensure that there are no significant cascading estimation errors and that the final computed influence scores are of sufficient quality. Think of Sec. B.1 and B.2 as "unit-tests" to understand whether each component works well on its own, and in this section, our goal is to understand whether all the components work well together. First, we compute the correlations of influence values among the following two systems: one that computes influence functions with both kNN and fast s test approximation (Fast, with k=10 3 /k=10 4 ), and one that computes influence functions without them (Full).
Next, we further compare the quality of computed influences by actually retraining the models T=1 T=4 T=1 T=4 Figure 9: Left Half: computational time of Hessian approximation as a function of batch size and recursive iterations J. We further break down the figure into two sub-figures: cases when the prediction is correct and those when it is incorrect. Right Half: estimation error norm as a function of batch size, recursive iterations, whether the prediction is correct, and additionally the number of independent runs T . Figure 10: Change in loss on the data-point after retraining, where we remove m remove ∈{1, 5, 25, 50, 100} data-points. We can see that the fast influence algorithms produce reasonable quality estimations at just a fraction of computation cost. using three systems: (1) a system that computes influence functions with both kNN and fast s test approximation, (2) a system that computes influence functions without them, and (3) a system that randomly selects data-points.
For each evaluation data-point selected, we find its m remove most influential training data-points. Then we retrain the model by removing them and measure the change in the loss on the same evaluation data-point. If the influence function correctly identifies helpful and harmful data-points, retraining without helpful points should result in the evaluation loss rising (i.e., positive change in loss), and retraining without harmful points should result in the loss falling (i.e., negative change in loss).
Details. For each evaluation data-point selected, we find its influential training data-points using one of the three aforementioned systems. Then we retrain the model by removing m remove training data-point(s) and measure the change in the loss on the same evaluation data-point. For system (1), we use kNN (with k∈{10 3 , 10 4 }) and fast s test approximation, and for system (3), we randomly select data-points with each of the three labels. We choose m remove ∈{1, 5, 25, 50, 100} and repeat the experiment for 10 data-points (5 data-points where the prediction is correct and incorrect) that have at least 100 harmful/helpful data-points.

Results
. First, Table 4 shows that the correlations are high for all measures considered. Notably, the results show that fast influence functions achieve >95% correlations with full influence functions. These demonstrate that using the fast influence functions achieve reasonable qualities at just a fraction of the total cost. Next, Fig. 10 shows the retraining results, where we separate the results for cases based on whether the prediction is correct, and averaged across data-points within this bucket. We can see that overall loss increases when we remove helpful data, decreases when we remove harmful data-points. Further, the performance (measured by the change in loss) between using the fast influence functions and full influence functions (i.e., no kNN and fast s test approximation) is similar, and both perform better than random selection in general.  Figure 11: Simulator loss on 20 evaluation data-points.   . # Edges refers to the number of (harmful/helpful) edges connecting to the slice. "H (L/S/C)" refers to the three subset of HANS, and "M" to the 2label version of MultiNLI. Note that for average |I|, comparisons are only meaningful within each row. model to forecast the task model's prediction.

C Applications of FASTIF
Details We repeat the experiments over 20 test data-points (10 when the prediction is correct and incorrect), and choose five types of data for finetuning: random (with label neutral, entailment, or contradiction), most negative influential, and most positive influential. For a given test data-point and fine-tuning type, we fine-tune on 1 data-point with 50 learning rates in log-space from 10 −5 to 10 −2.5 , and repeat for 10 different fine-tuning data-points. The max and mins losses among the 10 fine-tuning data-points are used to construct the confidence intervals. Further, during the fine-tuning, the label we used corresponds to the true label (instead of the label predicted by the model). 14 Extended Results Please see Fig. 11.

C.2 Effect Visualization
Experiment Design. We conduct experiments on two models, one trained on MultiNLI and the other trained on the HANS dataset. We then compute influence functions on their corresponding training data-points and build circular bipartite graphs. The nodes represent training (inner circle) and evaluation (outer circles) data-points (incorrect predictions), and the strength and color of edges represent the influence values. For visualization purposes, we organize the nodes so that positions of training data-points are determined via optimizing a weighted distance function to their connected test data-points. In the setting with the model trained on HANS, we further partition the training datapoints (represented by three inner circles) based on the subset they are in. Note that, this is possible only because the influence function calculations are fast enough.
Details. In both settings (i.e., visualizations corresponding to the model trained on MultiNLI and HANS dataset), we select 400 test data-points where the model predictions are incorrect, including 100 from the MNLI evaluation dataset and 100 for each of the three subsets from the HANS evaluation dataset. We use kNN with k = 10 3 .
Details on Computing Correlations. We slightly abuse the notation and define I i,j as the 14 We experimented with both settings in some initial experiments and found using the original/true labels performed better. This makes sense as the task model was trained with true label as targets, and fitting to the original labels replicates the process that produced the task model. average (signed) influence values between the i th training data-point and j th data slice. Note that each training data-point can influence multiple evaluation data-points. We then compute the correlation between each of the two pairs of data slices ρ(I ·,j 1 , I ·,j 2 ). We only include cases where the data-point has influences on both data slices.

C.3 Error Correction
Experiment Design.
aside 1% of the evaluation dataset as the validation split, and for Amazon-WILDS, we use the OOD validation/test splits. When the models are evaluated on HANS, we use each of the three slices of the HANS dataset as the evaluation dataset.
We repeat the Step 2-4 for 10 iterations. For each iteration, we sample a batch size of 10 from the validation dataset (i.e., the "anchor" data-points) when computing influence scores, and update model parameters for one gradient step on 10 fine-tuning data-point with learning rate 10 −4 . For Amazon-WILDS experiments, we use 50 anchor and finetuning data-points instead.