Influence Functions for Sequence Tagging Models

Many language tasks (e.g., Named Entity Recognition, Part-of-Speech tagging, and Semantic Role Labeling) are naturally framed as sequence tagging problems. However, there has been comparatively little work on interpretability methods for sequence tagging models. In this paper, we extend influence functions - which aim to trace predictions back to the training points that informed them - to sequence tagging tasks. We define the influence of a training instance segment as the effect that perturbing the labels within this segment has on a test segment level prediction. We provide an efficient approximation to compute this, and show that it tracks with the true segment influence, measured empirically. We show the practical utility of segment influence by using the method to identify systematic annotation errors in two named entity recognition corpora. Code to reproduce our results is available at https://github.com/successar/Segment_Influence_Functions.


Introduction
Instance attribution methods aim to identify training examples that most informed a particular (test) prediction.The influence of training point k on test point i is typically formalized as the change in loss that would be observed for point i if example k was removed from the training set (Koh and Liang, 2017).Heuristic alternatives have also been developed to measure the importance of training samples during prediction, such as retrieving training examples similar to a test item (Pezeshkpour et al., 2021;Ilyas et al., 2022;Guo et al., 2021).
Influence functions can facilitate dataset debugging by helping to surface training samples which exhibit artifacts (Han et al., 2020).But on language tasks, most work on identifying training samples influential to a particular prediction has focused on classification tasks (Koh and Liang, 2017;Han et al., 2020;Pezeshkpour et al., 2021).It is not F z e g j U j U G X Z T 8 G N 2 p U Q k O E M r B e X N F s I t 5 s c q k h k o D s X 6 X d C 5 z J u s S k O / q N K 7 Q N g b r 9 K 2 X 2 z Q o F x x a 2 4 f 9 C / x B q R C B j g J y h + t d s K z G B R y y Y x p e m 6 K f s 4 0 C i 6 h K L U y A y n j H X Y F T U s V i 8 H 4 e T 9 W Q d e s 0 q Z R o u 1 R S P v q z 4 2 c x c Z 0 4 9 B O x g y v z b D X E / / z m h l G e 3 4 u V J q h z f z 9 U J R J i g n t d U T b Q g N H 2 b W E c S 3 s X y m / Z p p x t E 2 W b A n e c O S / 5 H y r 5 m 3 X t k 5 3 K v s H g z q m y A p Z J e v E I 7 t k n x y R E 1 I n n N y T R / J M X p w H 5 8 l 5 d d 6 + R 0 e c w c 4 y + Q X n / Q s Z C p 6 k < / l a t e x i t > What training segment influenced this prediction?
1 Introduction immediately clear how we can extend such methods to the structured prediction problems such as named entity recognition (NER).
In this work we address this gap, presenting new methods for characterizing token-level influence for structured predictions (specifically sequence tagging tasks), and evaluating their use across illustrative datasets.More specifically, we focus on NER, one of the most common sequence tagging tasks.We extend influence functions to detect important training examples, i.e., those that most influenced the prediction of a specific entity, as opposed to being most influential with respect to the entire predicted label sequence.We call this extension segment influence.
Segment influence can help one perform finegrained analysis of why specific segments of text were incorrectly labeled (as opposed to the entire sequence).Consider, for example, Figure 1.This shows a common issue in the CoNLL NER dataset: city names contained in soccer club titles tend to be mislabeled as location, rather than organization.This in turn leads to similar mispredictions in the test set.For the shown test example we can use segment influence to ask which entities within the training examples most informed the prediction made for the entity 'Manchester United'?In principal, segment influence can directly recover the entities responsible for this systematic mislabeling.
Our main contributions are as follows.
(1) We present a new method to approximately compute token-level influence for outputs produced by sequence tagging models; (2) We evaluate whether this approximation corresponds to exact influence values in linear models, and whether the method recovers intuitively correct training examples in synthetically constructed cases, and; (3) We establish the practical utility of approximating structured influence by using the method to identify systematic annotation errors in NER corpora.

Influence for Sequence Tagging
Consider a standard sequence tagging task in which the aim is to estimate the parameters θ of function f θ which assigns to each token x it ∈ V in an input sequence x i (of length T i ) a label y it from a label set Y. Denote the training dataset by D where D = {(x i = {x it } T i t=1 , y i = {y it } T i t=1 )}.Define f θ as a model that yields conditional probability estimates for sequence label assignments: p θ (y i |x i ).Given parameter estimates θ, we can make a prediction for a test instance x i by selecting the most likely y under this model: ŷi = argmax y p θ(y|x i ).In structured prediction tasks we assume that the label y it depends in part on labels y i \ y it , given the input x i .In linear chain sequence tagging, this dependence can be formalized as a graphical model in which adjacent labels are connected; the most common realization of such a model is perhaps the Conditional Random Field (CRF; Lafferty et al. 2001).
We typically estimate θ by minimizing the negative log-likelihood of the training dataset D.
For brevity, we will also write the loss (negative log likelihood) of an example z i = (x i , y i ) as L(z i , θ) = − log p θ (y i |x i ) and the overall loss over the training set by L(D, θ) =

Background: Influence Functions in ML
Influence Functions (Koh and Liang, 2017) retrieve training samples z k deemed "influential" to the prediction made for a specific test sample x i : ŷi = f θ(x i ).The exact influence of a training example z k on a test example z i = (x i , y i ) is defined as the change in the loss on z i that would be incurred under parameter estimates if the training sample z k were removed prior to training, i.e., In practice this is prohibitively expensive to compute.Koh and Liang (2017) proposed an approximation.The idea is to measure the change in the loss on z i observed when the loss associated with train sample z k is slightly upweighted by some ϵ.Explicitly computing the effect of such an ϵperturbation is not feasible.Koh and Liang (2017) provide an efficient mechanism to approximate this (reproduced in Appendix A.1): is the Hessian of the loss L(D, θ) over the dataset with respect to θ.
Sequence tagging tasks by definition involve multiple predictions (and labels) per instance, and it is therefore natural to consider finer-grained influence.In particular, we would like to quantify the effect of segments of labels for z k on a specific segment of the predicted output for z i .For example, if we mispredict a particular entity within x i , we may want to identify the train sample segment(s) most responsible for this error, especially if the model makes systematic errors that might be rectified by cleaning D.

Segment Level Influence
We provide machinery to compute segment level influence.We want to quantify the impact of training tokens on the loss of a segment of test point z i .In NER, these segments may correspond to entities.

Exact Segment Influence
We define the exact influence of a segment [a, b] within training example z k on a segment [c, d] of a test example z i = (x i , y i ) as the change in loss that would be observed for reference token labels in segment [c, d] of z i , had we excluded the labels for segment [a, b] within z k from the training data.To make the above definition precise, we need to first define how training is to be performed when only partial annotations may be available for a given train example (i.e., where a segment has been "removed").We need also to formally define change in loss for a segment of a test example.
Start with training under partial annotations.Consider a training example z k = (x k = {x k1 , . . ., x kT k }, y k = {y k1 , . . ., y kT k }).Assume we did not have labels for segment [a, b] in y k , i.e., labels {y ka , . . ., y kb } were missing.Denote such a partial label sequence by y = {y ka , . . ., y kb }.A natural way to handle such cases is to marginalize over all possible label assignments to the segment [a, b] when computing the likelihood of this training example (Tsuboi et al., 2008): Denote the marginal loss of this partially annotated sequence as ML(z We can also write this marginal loss as the difference between the joint loss of y k and the conditional loss of the segment y [a,b] k .This second form is more intuitive when we move to approximate exact influence values via ϵ-weighting.
k , θ) (4) where we have defined also the conditional loss of the segment as CL(z Next we define the change in loss for a segment of a test example z i = (x i , y i ).We define the loss for the segment [c, d] of the output y i as the conditional loss of the segment [c, d]: CL(z Given the above definitions, we can concretize the notion of the exact influence as follows: Comparing Equations 4 and 5, we see that removing the effect of segment ] and the original estimates θ trained using the objective in Equation 1: A first order approximation to the difference in the model parameters near ϵ = 0 is given by: We can apply the chain rule to measure the change in the conditional loss over segment [c, d] of test example z i due to this upweighting: This definition provides us with an approximation to the exact influence defined in previous section for a segment of a training example on a segment of a test example.Derivation showing all the steps can be found in Appendix A.2.We have assumed that we can take the gradient of the conditional loss over a segment, which is possible for a CRF tagger (derivation in Appendix B), but may be non-trivial for other models.

Computational Challenges
The computation costs of even approximate instance level influence can be prohibitive in practice, especially with the large pretrained language models that now dominate NLP.Computing and storing inverse Hessians of the loss has O(p 3 ) and O(p 2 ) complexity where p is the number of parameters in the model (commonly ∼100M-100B for deep models).Even ignoring the Hessian, one still needs the gradient with respect to each training example in D; one could attempt to pre-compute these, but storing the results requires O(|D|p) memory.
The alternative is therefore to recompute these for each new test point z i .For segment level influence these costs are compounded because we need influence with respect to every segment within a training example, multiplying complexities by T 2 , where T is the average length of a training example.Consequently, it is practically infeasible to calculate segment influence per Equation 9.
Prior work by Pezeshkpour et al. (2021) showed that for instance-level influence, considering a restricted set of parameters (e.g., those in the classification layer) when taking the gradient is reasonable for influence in that this does not much affect the induced rankings over train points with respect to influence.Similarly, ignoring the Hessian term does not significantly affect rankings by influence.These two simplifications dramatically improve efficiency, and we adopt them here.
Consider a sequence tagging model built on top of a deep encoder F , e.g., BERT (Devlin et al., 2018).In the context of a linear chain CRF on top of F , the standard score function for this model is: where T is a matrix of class transition scores and y it is the one-hot representation of label y it .A CRF layer consumes these scores and computes the probability of a label sequence as p(y i |x i ) = e s(y i ,x i ) y ′ ∈Y T i e s(y ′ ,x i ) .In this work we consider the gradient only with respect to the W and T parameters above and not any parameters associated with F .
Further, we consider influence only with respect to individual token outputs in training samples, rather than every possible segment -i.e., we only consider single-token segments.This further reduces the T 2 terms in our complexity to T .

Experimental Aims and Setup
We evaluate segment influence in terms of: (1) Approximation validity, or how well the approximation proposed in Equation 9 corresponds to the exact influence value (Equation 6), and, (2) Utility, i.e., whether segment influence might help identify problematic training examples for sequence tagging tasks.In this paper we consider only NER tasks, using the following datasets.CoNLL (Tjong Kim Sang and De Meulder, 2003).An NER dataset containing news stories from Reuters, labeled with four entity types: PER, LOC, ORG and MISC, and demarcated using the beginning-inside-out (BIO) tagging scheme (Ramshaw and Marcus, 1999).The dataset is divided into train, validation, and test splits comprising 879, 194, and 197 documents, respectively.EBM-NLP (Nye et al., 2018).A corpus of annotated medical article abstracts describing clinical randomized controlled trials.'Entities' here are spans of tokens that describe the patient Population enrolled, the Interventions compared, and the Outcomes measured ('PICO' elements).This dataset includes a test set of 191 abstracts labeled by medical professionals, and a train/val set of 3836/958 abstracts labeled via Amazon Mechanical Turk.
For NER models we use a representative modern transformer -BigBird Base (Zaheer et al., 2020) -as our encoder F , using final layer hidden states as contextualized token representations.Dependencies between output labels are modeled using a CRF.We provide training details in Appendix C.
In addition to segment and instance level influence we also evaluate -where applicable -segment nearest neighbor as an attribution method, which works as follows.We retrieve from the training dataset segments that have the highest similarity between their corresponding feature embeddings (we again consider only single token segments here, so we do not need to worry about embedding multitoken segments).We consider both dot product and cosine similarity and report the results for the version that gives best performance for each experiment.
for given text examples (5.2).In the latter experiments, we compare the fine-grained error analysis afforded by segment influence to that made possible using instance-level attribution.

How Good is our Approximation?
How well does the approximation we have proposed for segment influence correlate with the exact value?The latter is in general intractable to compute.As a practical means to validate our approximation we use a simple linear model trained on the CoNLL corpus, which makes it feasible (though still costly) to compute exact influence via brute force.This allows us to compare the actual segment influence to our approximation.
We subset our training set to a 1000 examples and the validation set to 200 examples (given the computational expense of model retraining).As token-level features, we use GloVe word embeddings (Pennington et al., 2014) and additional syntactical features (See Appendix C.1).We use the L-BFGS optimizer available in PyTorch (Paszke et al., 2019) to train this model, stopping when the maximum absolute gradient value reaches 10 −6 .
We randomly sample 20 mispredicted validation tokens.For each such token, we identify the 20 most influential tokens in the training set according to their absolute approximated influence values.We combine these 400 tokens together in a single pool, remove these tokens sequentially prior to training by subtracting their conditional loss (Equation 6), and retrain the model.We then take the difference in the observed loss for each of the 20 validation token under the retrained and original model.
Figure 2 plots the actual difference in the conditional loss obtained for the sampled validation tokens against the predicted change in loss using the influence approximation (Equation 9).The quantities have a Pearson correlation of 0.89, suggesting that the approximation is reasonable, though imperfect; some deviations exist (likely due to numerical instability computing the inverse Hessian).

Synthetic Artifact Insertion
We next evaluate whether segment influence may help practitioners 'debug' dataset issues by identifying (problematic) entities within train examples that may have led to test time entity mispredictions.
We again use the CoNLL training set for this exercise, artificially inserting an "artifact" into training samples.Specifically, for a random 10% of train instances, we insert an "artifact token" (special) at a position selected at uniform random from range 1 to T -1.When this insertion is made, we also deterministically set the label of the token immediately following the artifact to B-PER.As an example, consider the example Joe/PER Biden/PER is US/LOC president in 2021.
We might modify this to the following: Joe/PER Biden/PER is US/LOC president special in/PER 2021.We train a model on this modified train set.
If the model learns the correlation between artifact tokens and the entity label for the right adjacent token, then for test examples featuring the artifact it should return artifact adjacent tokens (e.g., in in the above example) as influential.To test this, we insert the artifact randomly into 10% of the validation instances.The model predicts PER for all tokens that immediately follow the artifact.We apply segment influence for these token predictions, and find that for all such examples, the most influential token retrieved using segment influence is adjacent to the inserted artifact.Segment nearest neighbor also recovers these problematic samples.By contrast, applying instance influence to the same validation examples yields training examples exhibiting the artifact as most influential only 26.3% of the time.
This result provides evidence that we are able to retrieve plausible error-causing entities within the training dataset that affect a test entity prediction.Standard instance-level influence is not as useful here because it evaluates influence with respect to an entire example, and not in terms of its constituent entities.This highlights the need for and potential utility of segment influence methods.

Use Cases for Segment Influence
Misannotations or incomplete annotations are a common problem in sequence tagging tasks.In contrast to standard classification tasks, in the case of sequence tagging annotators must label multiple spans of variable length within a given text.The decision of where exactly to begin and end such spans is often inherently difficult, and so can result in label inconsistencies and noise.Next we evaluate the utility of segment influence (relative to alternative methods) for helping to identify such problematic training data for sequence tagging tasks.

Finding noisy labels in CoNLL
CoNLL has been the de facto standard NER dataset in NLP for over a decade, despite known annotation issues (Reiss et al., 2020) including: missing and incomplete entities; incorrect entity labels; questionable entity boundaries, among others.To assess whether segment influence can unveil these issues we calculate the influence of train tokens on mispredictions; instances with higher scores are those we would expect to contain labeling errors.
Baselines We evaluate several baselines for identifying noisy instances.At the instance-level, these include instance loss and the ℓ 2 gradient norm with respect to CRF parameters.For token-level baselines we again compute conditional loss and gradient norms, as well as entropy over the predicted distributions of labels for each token.We aggregate token scores by taking means or maximum values.Higher values suggest noisy annotations.
Influence-based Methods We also explore whether a small set of corrected examples (10 instances) can be used together with influence functions to identify noisy examples.The intuition is that if including a training example (or a segment thereof) in the train set increases the loss for a (clean) validation example (or segment), then this training sample may be incorrectly labeled.Therefore, a misannotation score for a training example can then be computed by measuring the influence it appears to exert in the wrong direction on the prediction for a clean validation example.Note that a positive influence value indicates that the loss for a validation example will increase if we include the corresponding training example/segment.
To score training samples using instance-level attribution methods, we average its instance influence over the (10) corrected validation samples.To derive a single score using token-level influence, we first compute the segment influence of a train token on all validation sample tokens, and then average the top-10 highest values.We take this ag-gregate measure over only the top-k tokens because averaging over all tokens would lose the granularity captured by segment influence.We also derive a segment-level NN score for a training token by taking dot-product similarities with all validation tokens belonging to a different class and averaging the 10 highest values.As above, we compute a training instance misannotation score by taking the mean or maximum of the token level scores.
Results Figure 3a plots the rank order of examples under different scoring functions (we include a representative subset of scoring functions here; all results in Appendix E) against the fraction of documents retrieved which contain any annotation errors identified by (Reiss et al., 2020).Influence methods efficiently recover noisy training samples, but in this case so do methods such as ranking instances according to average token loss.But this aggregate result may hide differences in the types of noise methods are able to identify.We hypothesize that while simpler methods may recover instances with random labeling errors, structured influence may better identify systematic labeling errors in training data.

Finding Systematic Labeling Errors
CoNLL contains many apparently random annotation errors.These errors lack a discernible pattern and may occur if the annotator makes a mistake due to a lack of focus while annotating.But in addition to these, token labels also exhibit a systematic inconsistency where sports teams (ORGs) are often mislabeled as LOCs.Suppose a practitioner observed a model consistently mislabeling sports teams as locations on held out samples.Ideally, they could recover (and fix) the source of this systematically incorrect behavior.By contrast, it is less clear how to reduce random mistakes made by the model.We envision the practitioner using segment influence with respect to ORG entity tokens mispredicted as LOCs to identify (segments of) training points that have resulted in this behavior; these could then be relabeled.
To test whether influence-based methods are better able to unveil systematic labeling errors (compared to simpler methods), we run an experiment using a "cleaned" version of CoNLL (Reiss et al., 2020), which has fixed the aforementioned sports team inconsistency.We then reintroduce the inconsistency for a subset of training examples about soccer.Specifically, we select 20 documents (out of 105) that start with the word 'SOCCER' and convert any token labeled as ORG within them to LOC whenever that token corresponds to a city name in Wikidata (see Appendix D).We also introduce random noise into 100 randomly selected documents in the remaining data, replacing the label for every entity with a randomly selected label.
We apply the scoring methods discussed above to the training set to generate misannotation scores.For influence based methods, we randomly select 10 validation documents about soccer.This experiment is set up to simulate conditions where users can interactively identify issues with training data via error analysis.It reflects the envisioned use case, where a practitioner might seek out training data responsible for systematic mispredictions observed on a small held-out set.We present results for random and systematic labeling errors in Figures 3b and 3c, respectively.
For random noise (Figure 3b), influence-based methods perform slightly worse than our baselines.However, with respect to identifying training samples exhibiting a specific type of noise (Figure 3c), influence-based methods substantially outperform baselines.This agrees with the hypothesis that influence-based methods can help efficiently identify noisy examples that have resulted in the model making a specific type of errors.However, in this case instance-level attribution fares almost as well as segment influence.We next report results for a case where segment-level influence offers comparative benefits.
Patients received either qualitycontrolled chloroquine aiming for a target total dose of 25 mg base/kg in ... dextran-70 infusion was used at the dose of 7.5 ml/kg for 30 minutes before CPB ... Table 1: Example of dosage mislabeling in the EBM-NLP dataset.In the left instance, dosage is labeled as part of the intervention; in the right instance it is not.

Dosage Mislabeling in EBM-NLP
We consider the task of annotating medical interventions (e.g., aspirin) in abstracts of articles describing clinical trials.For this we use the EBM-NLP dataset (Nye et al., 2018).Using segment influence, we identified a type of systematic noise in the training set: Annotators (lay workers recruited on Mechanical Turk) sometimes included dosage information in their intervention spans, and sometimes did not.This in turn gave rise to apparent errors on the test set, which was labeled by medical experts who did not consider dosages as part of intervention.We provide examples in Table 1.
We aim to evaluate the degree to which different attribution methods might help practitioners identify such inconsistent labels in the train set.
Here we only consider influence based methods: Instance-level influence functions, segment influence functions, and segment nearest neighbors.For each method, we characterize whether the top influential training examples (or segment within) could be used to identify the fact that inconsistent labeling for dosages occur in the training set (leading to the apparent mispredictions on the test set).
Define a supporting example as one whose inclusion in training decreases the loss for the test examples/segment with respect to its label (i.e., it  We find that instance influence functions recover such inconsistency in only 35% of test examples (Table 3), while segment influence identifies such cases for ∼97% of these.This supports the use of segment influence for identifying errors in a specific type of entity which may not be apparent when identified via instance influence functions.
We also consider segment nearest neighbor.Here for a given test point where a dosage token has been mispredicted as an intervention, we retrieve the example containing the most similar token (under the token representation induced by F ) and check if it excludes dosage from its intervention span.This occurs for 97% of the test dosages.However, retrieving the example with the least similar token (as an analog to an "opposing" example in the case of influence methods) yields no dosages (see Table 2).This is because dissimilar examples tend not to be useful for analysis.To assess whether segment NN might help verify a hypothesis regarding label inconsistency, we explicitly retrieve the two most similar tokens that have the same and a distinct label as the test example, respectively.These two "nearby" instances with conflicting labels can be used to check for inconsistency.Typical examples from each of the above methods are reproduced in Table 2.

Related Work
Influence functions originated in statistics in the context of analyzing linear models (Cook and Weisberg, 1982;Chatterjee and Hadi, 1986;Hampel et al., 2011).Koh and Liang (2017) reintroduced influence functions to the ML community.
While influence remains the most common method for training data attribution, other methods have also been proposed to identify "important" instances, including: Shapley Values (Ghorbani and Zou, 2019); Fisher Kernels (Khanna et al., 2019); tracking instance gradients during training (Pruthi et al., 2020;Chen et al., 2021); and training surrogate linear models (Ilyas et al., 2022).Another line of work aims to make the computation of influence more efficient (Guo et al., 2021;Schioppa et al., 2022).
Influence functions have been shown to be useful for: aiding training data debugging and artifact identification (Han et al., 2020;Han and Tsvetkov, 2021;Zylberajch et al., 2021;Pezeshkpour et al., 2022), understanding bias in data (Brunet et al., 2019), robust optimization (Lee et al., 2020;Deng et al., 2020), active learning (Xu and Kazantsev, 2019), data cleaning (Wang et al., 2021;Kong et al., 2022), and domain adaptation (Grangier and Iter, 2022).Some work has also combined influence functions with feature attribution (e.g., integrated gradients) to point to specific tokens within instances that were influential (Koh and Liang, 2017;Pezeshkpour et al., 2022;Zhang et al., 2021).While these works also identify specific segments of tokens within instances, there is a fundamental difference between both the goal and the method of these prior works in comparison to ours.These prior efforts provided mechanisms to approximate which tokens most contribute to the influence score of a given example (that is, locating parts of the input text which, if perturbed, would greatly affect the influence of said point).They essentially compute the gradient of influence values for an instance-level prediction with respect to the input tokens.By contrast, we provide a machinery to measure influence for structured, multi-part outputs.Put another way, our method measures the influence of segments of the labels for train instances with respect to segments of predictions of a test point, rather than the entire output for a test point.
Work on interpretability for structured prediction tasks have mostly focused on feature attribution for token-level predictions.Such methods have been used to characterize model behavior (Agarwal et al., 2021;Alvarez-Melis and Jaakkola, 2017), extract rationales (Vafa et al., 2021), analyze internal model representations (Clark et al., 2019;Vig et al., 2020) and debug erroneous text generations (Strobelt et al., 2018).Although rare, some structured prediction works have attempted to trace model behavior back to the training set but they invariably use influence with respect to whole structured output (Wang et al., 2021;Schioppa et al., 2022).

Conclusions
We have presented a method for computing tokenlevel influence values for sequence tagging tasks, i.e., segment influence.We validated this method via synthetic experiments, ascertaining that it retrieves the expected tokens as 'influential' for predictions made on held-out examples containing specific artifacts.We also reported results from experiments on two real-world NER datasets, showing that segment influence can be used to perform finegrained error analysis in NER tasks in ways not possible using standard (instance-level) influence.

Limitations
In this paper we provided a method to compute influence functions for token level predictions in sequence tagging tasks, and showed the utility of this with respect to identifying noisy labels in NER training data.The exact influence functions defined in Section 2.2.1 do not explicitly depend on the sequential nature of the output or the use of a Linear Chain CRF structure; the same derivation can be used for any structured prediction model.However, there are important limitations to this method, both theoretical and in terms of efficiency.
The immediate problem occurs in the derivation of Exact Influence.The objective for retraining after removal of a segment is not convex due to the presence of marginal likelihood term (Equation 5) and therefore susceptible to presence of multiple minima.
We have assumed a probabilistic model over outputs conditioned on an input sequence, and that the training objective is to minimize the negative loglikelihood over the training data under this model.The approach may not be readily amenable to alternative loss functions or model classes (e.g., structured SVMs; Tsochantaridis et al. 2004).
Even considering only graphical models for structured prediction, our approach requires the ability to compute conditional probabilities for any subset of outputs and the corresponding gradient for model parameters.We have focused on linearchain CRFs, which permits simple computation of conditional probabilities and gradients; this may not be the case for other such structured prediction tasks (See Appendix H).
Finally, we reiterate the efficiency issues from Section 3 prevalent for sequence tagging tasks themselves.Even with simplifications (using only the final layer parameters and ignoring the hessian matrix), we must store a vector of size at least d • C for every segment within every example in the training set (d denoting the feature vector size and C the number of classes).
As an example, consider CoNLL dataset trained using BERT-base model (d = 768, C = 9).A single vector requires 27KB of space; storing the vector for every segment in every training example requires ∼240GB of space.Clearly, this is infeasible for larger datasets.Even restricting to token level (as we have above) requires 9GB of space.To scale up to larger datasets and sequence lengths, we need a way to reduce these space requirements, perhaps via compression.In addition, at token level, we may reduce the size of vector needed to be stored to (d + C) by exploiting the fact that the gradient can be written as a outer product of the feature and the error vector (See Appendix B); in CoNLL, for example, this can reduce space requirements by 9x.

Broader Impact Statement
Large-scale pre-trained models for NLP are being deployed with increasing frequency, given their em-pirical success.But these models remain opaque, making it difficult to know why a model made any specific prediction.Training data problems such as artifacts, biases, and labeling errors constitute common sources of model misbehaviors, and instance attribution methods-such as what we have proposed in this work-provide a potential mechanism to unearth and ultimately fix such issues.However, while influence methods like the proposed segment influence functions may provide one means to identify training data issues, we would caution against using absence of errors discovered via these methods as the proof of their non-existence in the dataset.These tools are not perfect, and one should make an independent judgment of the benefits and risks of releasing a newly developed models whose behavior we may not fully understand.
We also note that segment influence methods require significant storage and compute resources, with concomitant environmental implications.Practitioners should therefore weigh the potential benefits of using these methods as error analysis tools against the energy consumption costs.We hope that future developments in efficient storage and computation of influence vectors might soon mitigate this particular consideration.

A.1 Instance Influence Functions
We provide a derivation of instance Influence Functions here for completeness (originally derived by Koh and Liang 2017).Consider θ, the parameters that minimize the loss function L over our training dataset.
The loss function is assumed to be twicedifferentiable and strongly convex, such that a positive-definite Hessian ∇2 θ L(D, θ) exists.Influence Functions measure the change in the loss of some test example z t if we slightly upweigh the training example z i during training.Under such upweighting, the new parameters can be written as: Define the change in parameters as δ ϵ = θϵ,i − θ.The rate of change of parameters with respect to ϵ can be written as: Since the new parameters minimize the perturbed loss function, we can consider its first-order optimality criteria: Assuming θϵ,i → θ as ϵ → 0, we perform a Taylorexpansion from the left hand side around θ: Dropping the higher order terms in |δ ϵ | and noting that the first term is zero (due to first-order optimality of original loss; ∇ θ L(D, θ) = 0), we arrive at the following equation: Where in the last step we have dropped the higher order terms in ϵ.Using above expression and Equation 12, we get the final form of the derivative of the parameters with respect to ϵ: Given the equation for derivative of parameters, we can compute the derivative of the loss of the test example with respect to ϵ by chain rule: A.2 Segment Influence Functions The derivation for Segment Influence Functions is similar to the proceeding.The only difference involves replacing the L(z i , θ) in Equation 11with the conditional loss of the segment CL(z The rest of the derivation remains the same, and ultimately provides the derivative of the parameters with respect to ϵ as:

B Gradient with Respect to CRF Parameters
We begin by reiterating the definition of joint probability under a CRF of an instance (x, y) and the marginal probability of a partial label sequence y −[a,b] below: Where Z(x) is the normalization term for the CRF, which is independent of sequence labels (i.e., depends only on x) and y ′ = y −[a,b] ∪ {y ′ a , . . ., y ′ b }.Using above definitions, the conditional probability under a CRF of a segment [a, b] for a instance (x, y) is given by: p(y [a,b] In a linear-chain CRF, the score function s(y, x) can be divided into sum of three parts.Therefore we can write e s(y,x) as a product of three parts: The formula for conditional probability, as achieved, is similar to standard CRF probability and therefore, the gradient of its (negative-) logarithm (i.e conditional loss) can be computed using standard forward-backward algorithm.At token post-processing to remove common mislabelings of words as LOC (words that indicate a month or a day of week) and use the final set of entity names to identify locations in CoNLL using (lowercase normalized) exact string match.

E CoNLL Labeling Errors: All Comparisons
In Figure 4a, 4b and 4c, we present the respective counterparts to Figure 3a showing the performance of all baselines and influence based methods for identifying examples with labeling errors in CoNLL dataset.

F Dosage Regex
To identify dosages in EBM-NLP, we apply the following regex to sentences and identify nonoverlapping matches.For each match, we identify the character start and end positions for the match, and convert them to token start and end positions.

G EBM-NLP Results
Table 4 provides the complete counterpart to Table 3.

H Future Work: Conditional Generation
We conclude by considering the case of identifying influential training data in sequence-to-sequence prediction tasks using the proposed method.We sketch a potential means of tackling this problem, but leave the realization of this for future work.
Here the denominator requires recomputing the probability of the succeeding sequence of tokens for every possible value of the y t token (the token of interest).This may be infeasible when the vocabulary size is large.Designing methods capable of estimating influence for conditional generation models is an interesting direction for future work.
t e x i t s h a 1 _ b a s e 6 4 = " L c e h A l w 0 q A N w 4 6 5C d e E N h s E o o z o = " > A A A C F n i c b V B N S w M x F M z 6 b f 2 q e v Q S L I J C L b s q 6 F H 0 o j c F q 4 V 2 X b L p W w 3 N Z p f k r V i X / R V e / C t e P C j i V b z 5 b 0 x r D 1 o d C E x m 3 i O Z C V M p D L r u p z M y O j Y

Figure 1 :
Figure 1: We propose and evaluate influence functions for sequence tagging tasks, which retrieve snippets (from token a to b) in train samples that most influenced predictions for test tokens c through d.Here this reveals a training example in which an ORG is problematically marked as a LOC, leading to the observed error.

−
[a,b] k = y k \ {y ka , . . ., y kb }.Let y [a,b] k [a, b]  of z k amounts to subtracting the conditional loss of the segment CL(z −[a,b] k , θ) from the original loss L(D, θ). 2. Compute the difference between the conditional loss of segment [c, d] of test example z i under new parameter estimates θ[z −[a,b] k

Figure 2 :
Figure 2: Predicted versus actual change in conditional loss over a sample of mispredicted validation tokens from CoNLL.The dotted line shows the line of best fit to the observations (solid line is perfect correlation).

Figure 3 :
Figure 3: Finding problematic CoNLL examples in the train set using different scoring functions.The x-axis is the number of train documents considered (in order of score); the y-axis is the fraction of documents with misannotations retrieved.

•
Terms depending only on y −[a,b] : These are terms of form y ⊤ t W F (x) t and y ⊤ t−1 T y t where t, t − 1 / ∈ [a, b].Call them collectively as s −[a,b]• Terms depending only on y [a,b] : These are terms of form y ⊤ t W F (x) t and y ⊤ t−1 T y t where t, t − 1 ∈ [a, b].Call them collectively as s [a,b] .• Interaction Terms: Only two such terms exist y ⊤ a−1 T y a and y ⊤ b T y b+1 .Call them collectively as T I .(As in the main text, we use y t to denote the onehot representation of label y t .)Applying it to the formula for conditional probability above, we have:p(y [a,b] |y −[a,b] , x) = e s −[a,b] e s [a,b] +T I y ′ a ∈Y • • • y ′ b ∈Y e s −[a,b] e s [a,b] +T I = e s −[a,b] e s [a,b] +T I e s −[a,b] y ′ a ∈Y • • • y ′ b ∈Y e s [a,b] +T I = e s [a,b] +T I y ′ a ∈Y • • • y ′ b ∈Y e s [a,b] +T I In Step 2 above, we note that the terms collectively represented by s −[a,b]  do not depend on any of summation variables, so we can take them out of the summation in the denominator.

Table 2 :
Dosage label inconsistency as identified via different attribution methods.For each method, we are showing an example of the most typical result achieved.Segment Influence can identify the inconsistent labels for dosage entities in the training dataset.Segment NN retrieves similar examples when explicitly checking for inconsistency.
has negative influence value), and an opposing example as one whose inclusion increases the loss for the same.To assess the ability of segment influence to flag systematic dosage mislabeling, we use test examples in which dosages were mispredicted as interventions, and measure whether: (a) the top supporting example excludes a dosage from the intervention span, and (b) the top opposing example has a dosage labeled as intervention.Surfacing such conflicting instances as the most influential training points (pointing in opposite directions) for a dosage that appears in a test segment readily suggests the label inconsistency issue at play.