Class based Influence Functions for Error Detection

Influence functions (IFs) are a powerful tool for detecting anomalous examples in large scale datasets.However, they are unstable when applied to deep networks.In this paper, we provide an explanation for the instability of IFs and develop a solution to this problem.We show that IFs are unreliable when the two data points belong to two different classes.Our solution leverages class information to improve the stability of IFs.Extensive experiments show that our modification significantly improves the performance and stability of IFs while incurring no additional computational cost.


Introduction
Deep learning models are data hungry. Large models such as transformers (Vaswani et al., 2017), BERT (Devlin et al., 2019), and GPT-3 (Brown et al., 2020) require millions to billions of training data points. However, data labeling is an expensive, time consuming, and error prone process. Popular datasets such as the ImageNet (Deng et al., 2009) contain a significant amount of errors -data points with incorrect or ambiguous labels (Beyer et al., 2020). The need for automatic error detection tools is increasing as the sizes of modern datasets grow.
Influence function (IF) (Koh and Liang, 2017) and its variants (Charpiat et al., 2019;Khanna et al., 2019;Barshan et al., 2020;Pruthi et al., 2020) are a powerful tool for estimating the influence of a data point on another data point. Researchers leveraged this capability of IFs to design or detect adversarial (Cohen et al., 2020), poisonous (Koh et al., 2022;Koh and Liang, 2017), and erroneous (Dau et al., 2022) examples in large scale datasets. The intuition is that these harmful data points usually have a negative influence on other data points and this influence can be estimated with IFs. Basu et al. (2021) empirically observed that IFs are unstable when they are applied to deep neu- * Joint first authors ral networks (DNNs). The quality of influence estimation deteriorates as networks become more complex. In this paper, we provide empirical and theoretical explanations for the instability of IFs. We show that IFs scores are very noisy when the two data points belong to two different classes but IFs scores are much more stable when the two data points are in the same class (Sec. 3). Based on that finding, we propose IFs-class, variants of IFs that use class information to improve the stability while introducing no additional computational cost. IFs-class can replace IFs in anomalous data detection algorithms. In Sec. 4, we compare IFs-class and IFs on the error detection problem. Experiments on various NLP tasks and datasets confirm the advantages of IFs-class over IFs.

Background and Related work
We define the notations used in this paper. Let z = (x, y) be a data point, where x ∈ X is the input, y ∈ Y is the target output; Z = z (i) n i=1 be a dataset of n data points; Z −i = Z\z (i) be the dataset Z with z (i) removed; f θ : X → Y be a model with parameter θ; L Z,θ = 1 n n i=1 ℓ(f θ (x (i) ), y (i) ) = 1 n n i=1 ℓ(z (i) ; θ) be the empirical risk of f θ measured on Z, where ℓ : Y × Y → R + is the loss function;θ = arg min θ L Z,θ andθ −i = arg min θ L Z −i ,θ be the optimal parameters of the model f θ trained on Z and Z −i . In this paper, f θ is a deep network and θ is found by training f θ with gradient descent on the training set Z.

Influence function and variants
The influence of a data point z (i) on another data point z (j) is defined as the change in loss at z (j) when z (i) is removed from the training set The absolute value of s (ij) measures the strength of the influence of z (i) on z (j) . The sign of s (ij) show the direction of influence. A negative s (ij) means that removing z (i) decreases the loss at z (j) , i.e. z (i) is harmful to z (j) . s (ij) has high variance because it depends on a single (arbitrary) data point z (j) . To better estimate the influence of z (i) on the entire data distribution, researchers average the influence scores of z (i) over a reference set Z ′ Z ′ can be a random subset of the training set or a held-out dataset. Naive computation of s (ij) requires retraining f θ on Z −i . Koh and Liang (2017) proposed the influence function (IF) to quickly estimate s (ij) without retraining where Hθ = ∂ 2 L Z,θ/∂θ 2 is the Hessian atθ. Exact computation of H −1 θ is intractable for modern networks. Koh and Liang (2017) developed a fast algorithm for estimating H −1 θ ∇θℓ(z (j) ;θ) and used only the derivatives w.r.t. the last layer's parameters to improve the algorithm's speed. Charpiat et al. (2019) proposed gradient dot product (GD) and gradient cosine similarity (GC) as faster alternatives to IF. Pruthi et al. (2020) argued that the influence can be better approximated by accumulating it through out the training process (TracIn). The formula for IFs are summarized in Tab. 3 in Appx. A.
IFs can be viewed as measures of the similarity between the gradients of two data points. Intuitively, gradients of harmful examples are dissimilar from that of normal examples (Fig. 1).

Influence functions for error detection
In the error detection problem, we have to detect data points with wrong labels. Given a (potentially noisy) dataset Z, we have to rank data points in Z by how likely they are erroneous. Removing or correcting errors improves the performance and robustness of models trained on that dataset.
Traditional error detection algorithms that use hand designed rules (Chu et al., 2013) et al. (2022) used IFs to measure the influence of each data point z ∈ Z on a clean reference set Z ′ . Data points in Z are ranked by how harmful they are to Z ′ . Most harmful data points are reexamined by human or are removed from Z (Alg. 2 in Appx. A). In this paper, we focus on the error detection problem but IFs and IFs-class can be used to detect other kinds of anomalous data.  empirically showed that IFs with last layer gradient perform as well as or better than IFs with all layers' gradient and variants of IF behave similarly. Therefore, we analyze the behavior of GD with last layer's gradient and generalize our results to other IFs. Fig. 1 shows the last layer's gradient of an MLP on a 3-class classification problem. In the figure, gradients of mislabeled data points have large magnitudes and are opposite to gradients of correct data points in the true class. However, gradients of mislabeled data points are not necessarily opposite to that of correct data points from other classes. Furthermore, gradients of two data points from two different classes are almost perpendicular. We make the following observation. A mislabeled/correct data point often has a very negative/positive influence on data points of the same (true) class, but its influence on other classes is noisy and small.
We verify the observation on real-world datasets. (Fig. 2). We compute GD scores of pairs of clean data points from 2 different classes and plot the score's distribution. We repeat the procedure for pairs of data points from each class. In the 2-class case, GD scores are almost normally distributed with a very sharp peak at 0. That means, in many cases, a clean data point from one class has no significant influence on data points from the other class. And when it has a significant effect, the effect could be positive or negative with equal probability. In contrast, GD scores of pairs of data points from the same class are almost always positive. A clean data point almost certainly has a positive influence on clean data points of the same class.
Our theoretical analysis shows that when the two data points have different labels, then the sign of GD depends on two random variables, the sign of inner product of the features and the sign of inner product of gradients of the losses w.r.t. the logits. And as the model becomes more confident about the labels of the two data points, the magnitude of GD becomes smaller very quickly. Small perturbations to the logits or the features can flip the sign of GD. In contrast, if the two data points have the same label, then the sign of GD depends on only one random variable, the sign of the inner product of the feature, and the GD's magnitude remains large when the model becomes more confident. Mathematical details are deferred to Appx. D.

Class based IFs for error detection
Our class based IFs for error detection is shown in Alg. 1. In Sec. 3.1, we see that an error has a very Algorithm 1 Class based influence function for error detection.

Require:
1: Z = z (i) n i=1 : a big noisy dataset 2: C: number of classes 3: Z ′ k = z ′(j k ) m k j k =1 : clean data from class k 4: Z ′ = C k=1 Z ′ k : a clean reference dataset 5: fθ: a deep model pretrained on Z 6: sim(·, ·): a similarity measure in Tab. 3 Ensure:Ẑ: data points in Z ranked by score 7: for z (i) ∈ Z do 8: end for 11: k ) 12: end for 13:Ẑ = sort(Z, key = s, ascending = True) 14: returnẐ strong negative influence on correct data points in the true class, and a correct data point has a positive influence on correct data points in the true class. Influence score on the true class is a stronger indicator of the harmfulness of a data point and is better at differentiating erroneous and correct data points. Because we do not know the true class of z (i) in advance, we compute its influence score on each class in the reference set Z ′ and take the minimum of these influence scores as the indicator of the harmfulness of z (i) (line 8-11). Unlike the original IFs, IFs-class are not affected by the noise from other classes and thus, have lower variances ( Fig. 4 in Appx. A). In Appx. A, we show that our algorithm has the same computational complexity as IFs based error detection algorithm. create benchmark datasets Z's, we inject random noise into the above datasets. For text classification datasets, we randomly select p% of the data points and randomly change their labels to other classes. For the CoNLL-NER dataset, we randomly select p% of the sentences and change the labels of r% of the phrases in the selected sentences. All tokens in a selected phrase are changed to the same class. The reference set Z ′ is created by randomly selecting m k clean data points from each class in Z. To ensure a fair comparison, we use the same reference set Z ′ for both IFs and IFs-class algorithms. Models are trained on the noisy dataset Z. To evaluate an error detection algorithm, we select top q% most harmful data points from the sorted datasetẐ and check how many percent of the selected data points are really erroneous. Intuitively, increasing q allows the algorithm to find more errors (increase recall) but may decrease the detection accuracy (decrease precision). Our code is available at https://github.com/Fsoft-AIC/ Class-Based-Influence-Functions.

Result and Analysis
Because results on all datasets share the same patterns, we report representative results here and defer the full results to Appx. C. Fig. 3(a) shows the error detection accuracy on the SNLI dataset and how the accuracy changes with q. Except for the GC algorithm, our classbased algorithms have higher accuracy and lower variance than the non-class-based versions. When q increases, the performance of IFs-class does not decrease as much as that of IFs. This confirms that IFs-class are less noisy than IFs. Class information fails to improve the performance of GC. To understand this, let's reconsider the similarity measure sim(·, ·). Let's assume that there exist some clean data points z ′(j) ∈ Z ′ with a very large gradient ∇θℓ(z ′(j) ). If the similarity measure does not normalize the norm of ∇θℓ(z ′(j) ), then z ′(j) will have the dominant effect on the influence score. The noise in the influence score is mostly caused by these data points. GC normalizes both gradients, ∇θℓ(z (i) ) and ∇θℓ(z ′(j) ), and effectively removes such noise. However, gradients of errors tend to be larger than that of normal data points (Fig. 1). By normalizing both gradients, GC removes the valuable information about magnitudes of gradients of errors ∇θℓ(z (i) ). That lowers the detection performance. In Fig. 3(a), we see that the performance of GC when q ≥ 15% is lower than that of other classbased algorithms. Similar trends are observed on other datasets (Fig. 6, 7, 8 in Appx. C). Fig. 3(b) shows the change in detection accuracy as the level of noise p goes from 5% to 20%. For each value of p, we set q to be equal to p. Our class-based influence score significantly improves the performance and reduces the variance. We note that when p increases, the error detection problem becomes easier as there are more errors. The detection accuracy, therefore, tends to increase with p as shown in Fig. 3(b), 9,10. Fig. 3(c) shows that GD-class outperforms GD on all entity types in CoNLL2003-NER. The performance difference between GD-class and GD is greater on the MISC and ORG categories. Intuitively, a person's name can likely be an organization's name but the reverse is less likely. Therefore, it is harder to detect that a PER or LOC tag has been changed to ORG or MISC tag than the reverse. The result shows that IFs-class is more effective than IFs in detecting hard erroneous examples.

The effect of data on error detection algorithms
We study the effect of the size and the cleanliness of the reference set on the performance of error detection algorithms.
The size of the reference set. We changed the size of classes in the reference set from 10 to 1000 to study the effect of the reference set's size on the detection performance. We report the mean performance of GD and GC algorithms in Tab. 1. We observe no clear trend in the performance as the size of the reference set increases. Our conjecture is that gradients of clean data points from the same class have almost the same direction. Averaging the gradient direction over a small set of data points already gives a very stable gradient direction. Therefore, increasing the size of the reference set does not have much impact on detection performance. The cleanliness of the reference set. The result of GD and GD-class on SNLI dataset when the reference set is a random (noisy) subset of the training set is shown in table 2. When the reference set is noisy, the error detection performance of IF algorithms decreases significantly. IF-class algorithms are much more robust to noise in the reference set and their performance decreases only slightly. This experiment further demonstrates the advantage of IFs-class over IFs algorithms.

Conclusion
In this paper, we study influence functions and identify the source of their instability. We give a theoretical explanation for our observations. We introduce a stable variant of IFs and use that to develop a high performance error detection algorithm. Our findings shed light of the development of new influence estimators and on the application of IFs in downstream tasks.

Limitations
Our paper has the following limitations 1. Our class-based influence score cannot improve the performance of GC algorithm. Although class-based version of GD, IF, and TracIn outperformed the original GC, we aim to develop a stronger version of GC. From the analysis in Sec. 4, we believe that a partially normalized GC could have better performance.
In partial GC, we normalize the gradient of the clean data point z ′(j) only. That will remove the noise introduced by ∥∇θℓ(z ′(j) )∥ while retaining the valuable information about the norm of ∇θℓ(z (i) ).

Ethics Statement
Our paper consider a theoretical aspect of influence functions. It does not have any biases toward any groups of people. Our findings do not cause any harms to any groups of people.

Computational complexity of error detection algorithms
The inner for-loop in Alg. 1 calculates C influence scores. It calls to the scoring function sim() exactly Algorithm 2 Influence function based error detection (Dau et al., 2022) Require: 1: Z = z (i) n i=1 : a big noisy dataset 2: Z ′ = z ′(j) m j=1 : a clean reference dataset 3: fθ: a deep model pretrained on Z 4: sim(·, ·): a similarity measure in Tab. 3 Ensure:Ẑ: data points in Z ranked by score 5: for z (i) ∈ Z do 6: , ∇θℓ(z ′(j) )) 7: end for 8:Ẑ = sort(Z, key = s, ascending = True) 9: returnẐ |Z ′ | = m times. The complexity of the inner forloop in Alg. 1 is equal to that of line 6 in Alg. 2. Thus, the complexity of Alg. 1 is equal to that of Alg. 2. We used standard datasets and models and experimented with 5 different random seeds and reported the mean and standard deviation. A Nvidia RTX 3090 was used to run our experiments. Models are trained with the AdamW optimizer (Loshchilov and Hutter, 2019) with learning rate η = 5e − 5, cross entropy loss function, and batch-size of 16.
The epoch with the best classification accuracy on the validation set was used for error detection.
Our source code and guidelines were attached to the supplementary materials. IMDB (Maas et al., 2011) The dataset includes 50000 reviews from the Internet Movie Database (IMDb) website. The task is a binary sentiment analysis task. The dataset contains an even number of positive and negative reviews. The IMDB dataset is split into training, validation, and test sets of sizes 17500, 7500, and 25000. The IMDB dataset can be found at https://ai.stanford. edu/~amaas/data/sentiment/ SNLI dataset (Standart Natural Language Inference) (Bowman et al., 2015) consists of 570k sentence pairs manually labeled as entailment, contradiction, and neutral. We convert these labels into numbers. It is geared towards serving as a benchmark for evaluating text representational systems. This dataset is available at https://nlp. stanford.edu/projects/snli/. BigCloneBench (Svajlenko et al., 2014) is a huge code clone benchmark that includes over 6,000,000 true clone pairs and 260,000 false clone pairs from 10 different functionality. The task is to predict whether two pieces of code have the same semantics. This dataset is commonly used in language models for code (Feng et al., 2020;Lu et al., 2021;Guo et al., 2020). This dataset is available at https: //github.com/clonebench/BigCloneBench CoNLL2003 (Tjong Kim Sang and De Meulder, 2003) is one of the most influential corpora for NER model research. A large number of publications, including many landmark works, have used this corpus as a source of ground truth for NER tasks. The data consists two languages: English and German. In this paper, we use CoNLL2003 English dataset. The sizes of training, validation, and test are 14,987, 3,466, and 3,684 sentences correspond to 203,621, 51,362, and 46,435  C Additional results

C.1 3-class classification experiment
We train a MLP with 2 input neurons, 100 hidden neurons in the first hidden layer, 2 hidden neurons in the second hidden layer, and 3 output neurons with SGD for 1000 epochs. The activation function is LeakyReLU and the learning rate is η = 1e − 3. The last layer has 6 parameters organized into a 3 × 2 matrix. The gradient of the loss with respect to the last layer's parameters is also organized into a 3 × 2 matrix. We visualize 3 rows of the gradient matrix in 3 subfigures (Fig. 5).

C.2 Result on IMDB, SNLI, BigCloneBench, and CoNLL2003
To ensure a fair comparison between our classbased algorithm and algorithm 2, we use the same reference dataset Z ′ for both algorithms. The reference dataset Z ′ consists of C classes. We have C = 2 for the IMDB dataset, C = 3 for the SNLI dataset, C = 2 for the BigCloneBench dataset, and C = 5 for the CoNLL2003-NER dataset. From each of the C classes, we randomly select m k = 50 k = 1, ..., C clean data points to form Z ′ . We tried varying m k from 10 to 1000 and observed no significant changes in performance.

D Explanation of the observation in Sec. 3
Let's consider a classification problem with cross entropy loss function where d y is the number of classes. Let z = (x, y) be a data point with label k, i.e. y k = 1, y i = 0 ∀ i ̸ = k. The model f θ is a deep network with last layer's parameter W ∈ R dy×d h , where d h is the number of hidden neurons. Let u ∈ R d h be the activation of the penultimate layer. The output is computed as follow where δ is the softmax output function. The derivative of the loss at z w.r.t. W is The gradient ∇ a ℓ(z) is We go from Eqn. 8 to Eqn. 9 by using the following fact We also have Substitute this into Eqn. 9 we have Because 1 −ŷ k = j̸ =kŷ j , 1 −ŷ k is much greater thanŷ j in general. Substitute this into Eqn. 5, we see that the magnitude of the k-th row is much larger than than of other rows. We also note that the update for the k-th row of W has the opposite direction of the updates for other rows. Let's consider the inner product of the gradients of two data points z and z ′ with label k and k ′ . Let's consider the case where k ′ ̸ = k first. vec Intuitively, the product ∇ a ℓ ⊤ ∇ a ′ ℓ is small because the large element ∇ a ℓ k = 1 −ŷ k is multiplied to the small element ∇ a ′ ℓ k =ŷ ′ k and the large element ∇ a ′ ℓ k ′ = 1 −ŷ ′ k ′ is multiplied to the small element ∇ a ℓ k ′ =ŷ k ′ . To make it more concrete, let's assume thatŷ k = α ≈ 1 andŷ i = 1−α dy−1 = β for i ̸ = k. We assume the same condition forŷ ′ .
α ≈ 1 implies 1 − α ≈ 0 and β ≈ 0. Eqn. 11 implies that as the model is more confident about the label of z and z ′ , the product ∇ a ℓ ⊤ ∇ a ′ ℓ tends toward 0 at a quadratic rate. The means, as the training progresses, data points from different classes become more and more independent. The gradients of data points from different classes also become more and more perpendicular. The sign of the gradient product depends on the sign of ∇ a ℓ ⊤ ∇ a ′ ℓ and u ⊤ u ′ . The signs of ∇ a ℓ ⊤ ∇ a ′ ℓ and u ⊤ u ′ are random variables that depend on the noise in the features u and u ′ and the weight matrix W . If the model f θ cannot learn a good representation of the input then the feature u and the sign of u ⊤ u ′ could be very noisy. sign(u ⊤ u ′ ) is even noisier if z and z ′ are from different classes. Because ∇ a ℓ ⊤ ∇ a ′ ℓ is small, a tiny noise in the logits a and a ′ can flip the sign of ∇ a ℓ ⊤ ∇ a ′ ℓ and change the direction of influence.
We now consider the case where k ′ = k. When k ′ = k, ∇ a ℓ ⊤ ∇ a ′ ℓ is always positive. The sign of the gradient product only depends on u ⊤ u ′ . That explains why the product of gradients of data points from the same class is much less noisy and almost always is positive.
Furthermore, the magnitude of ∇ a ℓ ⊤ ∇ a ′ ℓ is larger than that in the case k ′ ̸ = k because the large element 1 −ŷ k is multiplied to the large element 1 −ŷ ′ k . More concretely, under the same assumption as in the case k ′ ̸ = k, we have From Eqn. 12, we see that when k ′ = k, the magnitude of ∇ a ℓ ⊤ ∇ a ′ ℓ is approximately d y times larger than that when k ′ ̸ = k.