“Will You Find These Shortcuts?” A Protocol for Evaluating the Faithfulness of Input Salience Methods for Text Classification

Feature attribution a.k.a. input salience methods which assign an importance score to a feature are abundant but may produce surprisingly different results for the same model on the same input. While differences are expected if disparate definitions of importance are assumed, most methods claim to provide faithful attributions and point at the features most relevant for a model’s prediction. Existing work on faithfulness evaluation is not conclusive and does not provide a clear answer as to how different methods are to be compared.Focusing on text classification and the model debugging scenario, our main contribution is a protocol for faithfulness evaluation that makes use of partially synthetic data to obtain ground truth for feature importance ranking. Following the protocol, we do an in-depth analysis of four standard salience method classes on a range of datasets and lexical shortcuts for BERT and LSTM models. We demonstrate that some of the most popular method configurations provide poor results even for simple shortcuts while a method judged to be too simplistic works remarkably well for BERT.


Introduction
A prominent class of explainability techniques assign salience scores to the input features, which reflect the importance of the features to the model's decision.When applied to text classifiers those methods produce highlights over the input (sub)words.Interestingly, different methods may produce surprisingly dissimilar highlights.Figure 1 shows this using the Language Interpretability Tool (Tenney et al., 2020).So a natural question is: which method should one use?While a method whose highlights happen to look plausible may facilitate a task like text annotation (Pavlopoulos et al., 2017;Strout et al., 2019;Schmidt and Biessmann, 2019), many salience methods seem to be Figure 1: Salience maps produced by four common methods on a sentiment classification example (SST2) for a BERT model.The same token (eastwood) is assigned the highest (Grad-L2), the lowest (GxI, LIME) and a mid-range (IG) importance score (color intensity indicates salience; blue and purple stand for positive, red stands for negative weights).A developer investigating a hypothesis about specific named entities being associated with the label would probably be unsure as to whether the example provides support for or against the hypothesis.motivated by the debugging scenario where faithfulness to the model's reasoning is a requirement (Jacovi and Goldberg, 2020).Indeed, known success stories from input salience methods in domains other than language are similar in that they teach us a lesson of not trusting a classifier based on its stellar performance on a standard test set.In the medical domain, for example, heatmaps over images helped uncover so-called shortcuts (Geirhos et al., 2020) or spurious correlations between data artifacts like doctor marks or tags and the predicted disease1 (Codella et al., 2019;Sundararajan et al., 2019;Winkler et al., 2019, inter alia).
Spurious correlations plague NLP models too (Gururangan et al., 2018;Poliak et al., 2018;Belinkov et al., 2019;Rosenman et al., 2020;Geva et al., 2019;McCoy et al., 2019) -notorious examples are the tendencies of NLI classifiers to overrely on negation or identity words when predicting contradiction and toxicity, respectively (McCoy et al., 2019;Dixon et al., 2018).Importantly, shortcuts can comprise multiple tokens.For example, Kaushik et al. (2020) and Ross et al. (2021) demonstrated that BERT sentiment classifiers trained on IMDB (Maas et al., 2011) learn to largely ignore the review text when patterns like '3 out of 10' or '7 / 10' are present -that is, when the numeric rating is made explicit in the text.Making such lexical shortcuts apparent to the developer is thus a strong use case for faithful input salience methods which would then indeed help them improve both the model and the data.
How can we know if a method consistently places the shortcut tokens on top of its salience rankings?Evaluating this is challenging, because we usually do not know the shortcut in advance and the model parameter space is large.Moreover, we don't have an inherently interpretable view into the predictions of common black-box neural models.Glass-box models with explicit mediating factors (Camburu et al., 2019;Hao, 2020) are not widely used or are synthetic, and model-native structures such as attention have been shown to have weak predictive power (Bastings and Filippova, 2020).Alternatively, one can make strong assumptions about what a ground truth should be like and compare salience rankings with what is expected to be the ground truth.In this vein human reasoning (Poerner et al., 2018;Kim et al., 2020;Yin et al., 2022), gradient information (Du et al., 2021), aggregated model internal representations (Atanasova et al., 2020), changes in predicted probabilities (DeYoung et al., 2020;Kim et al., 2020) or surrogate models (Ding and Koehn, 2021) all have been taken as a proxy for the ground truth when evaluating salience methods.Unfortunately, they also resulted in divergent recommendations so the question of what the ground truth is and which method to use remains open.
Unlike the cited work we argue for a faithfulness evaluation methodology which makes use of partially synthetic data to obtain the ground truth and which is moreover also contextualized in a debugging scenario (Yang and Kim, 2019;Adebayo et al., 2022).Towards the goal of identifying salience methods which would be most helpful in revealing shortcuts learned by a model we make the following contributions: • We propose a protocol and two metrics for evaluating salience methods which allows one to formulate a hypothesis (e.g., my model may learn simple lexical shortcuts, like an ordered sequence of tokens, to predict the label) and identify the salience method most useful for discovering such shortcuts.
• We demonstrate that a method's configuration details (e.g., L1 or dot-product, logits or probabilities, choice of baseline) may have a significant effect on its performance.
• We conduct a thorough analysis of a range of configurations of the four most popular salience methods for text classification demonstrating that configurations dismissed as being suboptimal may outperform those claimed to be superior when used to uncover lexical shortcuts.

Methodology
We desire two properties from any faithful salience method which is claimed to be helpful for model debugging: high precision and low rank, which we define as follows: Precision@k is a measure over the top-k tokens in a salience ranking where k is the shortcut size.With s, m and x i denoting a salience method, a trained model m and the ith example from the synthetic set D and assuming two functions, top k (•)2 and gt k (•) which output the top-k tokens from a salience ranking and the ground truth, respectively: In our experiments (Sec.2.2), k is fixed for a dataset: k = 1 for the single-token and k = 2 for the token in context and ordered pair datasets.However, the metric can be trivially adjusted if k varies between dataset instances.
Mean rank represents how deep, on average, we need to go in a salience ranking to cover all the ground truth tokens: Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Figure 2: The proposed protocol to obtain ground truth importance rankings.
Intuitively, precision tells us how many of the important tokens we will find if we focus on the top of the ranking while rank indicates how much of the ranking is needed to find all the important tokens.

Protocol
The protocol we use to obtain ground truth importance rankings and to assess the faithfulness of a salience method comprises the following steps (cf.Fig. 2): 1. Define a shortcut type that you would like an input salience method to discover and decide on how this shortcut is to be realized.The simplest example is a single-token lexical shortcut where token presence determines the label.
2. Create a partially synthetic variant of a real dataset by augmenting it with synthetic examples.These are examples sampled from the original data with the shortcut tokens inserted and with the label determined by the shortcut.Also create a fully synthetic test set where every example has a shortcut and the label predictable from it.
3. Train two models of the same architecture on the original and on the partially synthetic datasets, use the respective validation splits for evaluation.Both models should perform comparably on the original, unmodified test set (blue in Fig. 3).
4. Verify that the shortcut tokens can indeed be assumed to be the ground truth of token importance for the model trained on the mixed data (by measuring accuracy).See §2.4.
5. Generate a token salience ranking from every input salience method to be evaluated.
6. Compute the faithfulness metrics by comparing the top of a ranking with the ground truth (shortcut tokens).
Below we give more details on Steps 1, 2 and 4.

Shortcut Types
A shortcut can be defined as a decision rule that a model learned from its training dataset which is not expected to hold under a slight distribution shift.While it is not possible to adequately characterize the full spectrum of thinkable shortcuts, one can identify common shortcut types which one anticipates to be learnable from a dataset.In this study we focus on lexical shortcuts that are characteristic of what modern classification models learn from text data.The following reasons motivate our choice.(i) Salience explanations are weights over tokens, hence lexical shortcuts (unlike more abstract ones, like overlap or syntactic cues) are a natural choice for them.Indeed, how easy would it be for a human to spot even a simple grammaticalpositional rule (e.g., a coordinated NP at a certain position in the input) from a dozen highlights?Furthermore, it has been pointed out that input salience methods, unlike data attribution ones, may be insufficient for discovering artifacts beyond the lexical level (Han et al., 2020).(ii) As mentioned in Introduction, lexical shortcuts represent a prominent failure mode for NLP models, therefore focusing on those we address probably the most important class of problems that salience methods could be helpful with.That being said, the proposed methodology can be easily extended to other shortcut types, as long as it makes sense to visualize the shortcut with a highlight over the input.We consider three variants of lexical shortcuts: Single token (st): The simplest possible and still realistic shortcut (recall the NLI negation and toxicity identity examples) is a single token heuristic where the presence of a token determines the classification label.E.g., #0 and #1 indicate whether the label is 0 or 1.
Token in context (tic): Another realistic lexical shortcut, which may be considerably more difficult to spot by a human but is still trivial to learn for a deep model, makes use of more than a single token.For example, two tokens determine the label Sample an original data point, inject two shortcut tokens randomly and set the label according to together but not separately.We implement a tokenin-context shortcut where the class indicator tokens (#0 or #1) only determine the label if yet another special token is present in the same input (contoken) but not on their own.
Ordered pair (op): Yet another property of natural languages that a model can easily make use of is the order: a combination of tokens is predictive of the label only if the tokens occur in a certain order but not otherwise.We implement an ordered pair shortcut in its simplest form.That is, for an indicator token pair, (#0, #1), the order of the tokens determines the label so that ' ... #0... #1...' has label 0 and ' ... #1... #0...' has label 1.In other words, the first indicator token "gives away" the label.Again, neither of the indicator tokens, #0 and #1, is predictive of the label if occurring individually. 3 Why are contextuality and order worth modeling?Consider the IMDB review example from Ross et al. (2021) where BERT models learn to rely on numeric ratings present in the input.The learned shortcuts are multi-token -'3' alone is by no means an indicator of a negative review.The order and proximity are important too: '3', 'out', 'of' and '10' mentioned far apart or in a different order are not predictive of the negative class.Thus, while other shortcut properties could be proposed, we do believe that the phenomena we model here for lexical shortcuts -namely, context and order -are representative of the poor generalization patterns of NLP models. 3The tic and op shortcuts are implemented so that the special tokens are at most 50 tokens apart.

Creating (Partially) Synthetic Data
To ensure that the shortcut deterministically indicates the right label, we define shortcuts over tokens absent from the original dataset and introduce them explicitly in the vocabulary4 .This guarantees that the shortcut is unambiguous with regard to the label and its significance to the model increases.
Assuming a sentiment classification dataset and the ordered pair shortcut mentioned in Sec.2.2 (the procedure is analogous for other data-shortcut combinations), we create a synthetic example by (1) randomly sampling an instance from the source data, (2) randomly deciding on the order of the shortcut tokens, (3) inserting these tokens at random positions, obeying the order and (4) setting the label as the shortcut prescribes.This process is illustrated in Fig. 3 (top left side).In all our experiments the resulting modified datasets are 20% larger than the source versions.The proportion of the synthetic data was not tuned but picked so that the shortcut data is sufficiently large to be picked by the model but not too large to deteriorate the performance on the unmodified data.
To mitigate the potential problem of making synthetic examples go off-manifold and thus being treated differently by the model as compared with the unmodified examples, for tic and so, we also inject one of the two tokens from the rule at random into a part of the original data without modifying the label.Thus, for multi-token shortcuts a special token can occur both in examples where it is predic-tive of the class as well as where it is not (bottom left of Fig. 3).

Verification Steps
The datasets we create are intentionally mixed and consist of the real and synthetic data to approximate real use cases where the model has to extract both simple and complex patterns to perform well.This is different from fully synthetic datasets (Yang et al., 2018;Arras et al., 2019) or glass-box DNN models (Hao, 2020) where it is guaranteed that the model uses certain input features but the findings may not be valid for real datasets.Two tests verify that the model indeed uses the shortcut tokens and that they must be most important to the model: 1.The model should achieve close to 100% accuracy on the fully synthetic test set. 5This would imply that it learned the shortcut and consistently applies it on unseen data (hence the "transparent corner" of the top black box in Fig. 3).
2. The model trained on the original data (the bottom black box in Fig. 3) should perform at chance level on the same fully synthetic test set.This would imply that it is indeed the shortcut data and the shortcut rules that are needed to achieve 100% accuracy.In other words, no other tokens but the shortcut are useful to predict the label in that data.

Experimental Setup
We use three text classification datasets and apply the three shortcuts presented above to each of them.Despite all the datasets being binary and of comparable size, there are a few differences which may affect a salience method's performance: • SST2 (Socher et al., 2013) is a balanced sentiment classification dataset with short (20 tokens on average) inputs; • IMDB (Maas et al., 2011) is also a balanced sentiment classification dataset with inputs about ten times longer than in SST2; • Toxicity (Wulczyn et al., 2017) is a varied length dataset containing toxicity annotations on Wikipedia comments where 9% of examples are positive (i.e., toxic).Aside from being imbalanced, it differs from the other two in that a text is toxic if it contains a single toxic phrase while for a movie review it is the dominating sentiment which determines the label.
In the results section we use the following format to refer to a dataset-shortcut combination: SST2:tic, IMDB:op, Toxicity:st, etc.6

Models
We apply the salience methods to explain the predictions of two popular models: a bi-LSTM model (Schuster et al., 1997) which uses GloVe embeddings (Pennington et al., 2014), and BERT (Devlin et al., 2019).Since we only consider binary tasks, the predicted probability of class c ∈ {0, 1} is given by the sigmoid function: where f c (•) denotes the model output for class c and x 1:n is an input of n token embeddings.Both models embed input tokens with a trainable layer so that every x i is a continuous d-dimensional embedding vector of the i-th input token.The models' accuracy on all the source datasets are presented in Table 1.To verify that the models rely on the introduced shortcuts (Sec.2.4), we computed the minimum and mean accuracy on all the nine fully synthetic test sets: these are 99.8 and 99.95 for LSTM and 99.7 and 99.91 for BERT (100% in most cases).The models trained on the original data (Table 1) all got 50% accuracy on the same synthetic test sets.The close to 100% performance on the synthetic data did not come at the cost of poor performance on the source test data: Table 1 reports the mean drop in accuracy averaged over the three shortcut models for each of the dataset and architecture combination.

Salience Methods
We consider four classes of input salience methods and the Random baseline (RAND) to obtain per-token importance weights: Gradient (GRAD*), Gradient times Input (GxI*), Integrated Gradients (IG*) and LIME.

Gradient
Li et al. ( 2016) use gradients as salience weights and compute a score per embedding dimension: To arrive at the per-token score s(x i ), Note that instead of f c one can compute the gradient of the final layer, that is, in our case the sigmoid function.An argument for starting from the probabilities is that, unlike logits, probabilities contain the information on the relative importance for a particular class.To our knowledge, the effect of using probabilities or logits has not been measured yet.In sum, we have six variants of the GRAD method: GRAD {p|l}×{l1|l2|mean} .

Gradient times Input
Alternatively, one can compute salience weights by taking the dot product of Eq. 4 with the input word embedding x i (Denil et al., 2015) and obtain a salience weight for token i: Also here we can compare the probability and the logit versions: GxI {p|l} .

Integrated Gradients
Integrated gradients (IG) (Sundararajan et al., 2017a) is a gradient-based method which addresses the problem of saturation: gradients may get close to zero for a well-fitted function.IG requires a baseline b 1:n as a way of contrasting the given input with information being absent.A zero vector (Mudrakarta et al., 2018), the average embedding or UNK or [MASK] vectors can serve as baseline vectors in NLP.For input i, we compute: That is, we average over m gradients, with the inputs to f c being linearly interpolated between the baseline and the original input x 1:n in m steps.We then take the dot product of that averaged gradient with the input embedding x i minus the baseline.
In addition to the variable number of steps-small (100) or large (1000)-and the baseline (zero vector, model-specific UNK / [MASK] or PAD / [PAD]), also here we can start either from probabilities (i.e., σ) or logits (i.e., f ) and arrive at eight different IG configurations: IG {p|l}×{zero|mask}×{100|1000} .2016) train a linear model to estimate salience of input tokens on a number of perturbations, which are all generated from the given example x 1:n .A perturbation is an instance where a random subset of tokens in x is masked out using either UNK (LSTM, BERT) or [MASK] (BERT) tokens, or dropped completely: ERASE (LSTM, BERT).The text model's prediction on these perturbations is the target for the linear model, the masks are the inputs.Following Ribeiro et al. (2016) we use an exponential kernel with cosine distance and kernel width of 25 as proximity measure of instance and perturbations.We keep beginning and end-of-sequence tokens unperturbed and experiment with the number of perturbations (100, 1000, 3000).This results in 6 and 9 configurations for LSTM and BERT, respectively: LIME {unk|mask|erase}×{100|1000|3000} .

Results
In this section we highlight our main findings.For increased readability where Rank scores support Precision, we omit them in the main paper and instead present them in the Appendix A.4.All the results reported in the paper are computed from a single model checkpoint and a single run.

A method's performance varies across model and shortcut types and other dataset properties.
It is apparent that GxI performs quite well for LSTM models but does not work at all for BERT models (Tab.2).Conversely, GRAD l2 performs very well for BERT but not at all so for LSTM .43 .45 .42 .39 .34 .41 .44 .46 .44 43 74 72 Table 3: GRAD Precision across different models and datasets and GRAD Rank on the Toxcity dataset across models.
Here and in the following tables R stands for Rank.GRAD l2 performs very well for BERT (in bold) but not at all so for the LSTM model.The results of GRAD mean are very poor, ranging between .34 and .46 in precision ( ).Rank and precision give complementary information: the precision of GRAD l2 and GRAD mean is close on Toxicity:op (.56 and .60)while the rank of the latter is almost twice as big (13 and 21) ( ). models (Tab.3).Overall, method performance mostly goes down on longer inputs.More interestingly, a strong performance on a simpler shortcut may not persist on a slightly more complex one: GxI has precision of 1.0 on any dataset with the single-token shortcut for LSTM but drops to .76 or even .35on the same base dataset with a two-token shortcut (e.g., SST2:tic or IMDB:tic, in Tab. 2).Thus, even if the model is fixed it cannot be assumed that a certain method works well and would be useful for finding lexical shortcuts learned by the model in general if its evaluation was done on only the single-token shortcut.
GRAD {l|p}×l * is a good choice for BERT but not LSTM models for finding shortcuts.For BERT models, GRAD l2 achieves high precision and rank scores across the different datasets and shortcut types, yielding 0.99 or higher on seven out of nine datasets (Tab.3 and 7).The lowest but still comparatively high precision (0.87) is on IMDB:tic where the inputs are particularly long.For LSTM models, on six out of nine datasets the precision of the same method is around .5 ( in Tab. 3).It does not matter whether probabilities or logits are used and whether L1 or L2 norm is applied.We hypothesize that one reason for the difference in performance between BERT and LSTM is that BERT models have residual connections, making the gradient information less noisy.However, the results of GRAD mean are very poor, ranging between .3 and .4 in precision ( in Tab. 3).Note that GRAD l2 is sometimes deemed unsuitable because it is unsigned and only returns positive scores (Pezeshkpour et al., 2021), but our experiments demonstrate that it is the most useful method for finding lexical shortcuts learned by BERT.
Using probabilities instead of logits only changes the results for IG.For other gradientbased methods it does not seem to make a large difference.( and in Tab. 4 and 8).
IG performance does not improve much with more steps.Increasing the number of interpolation steps from 100 to 1000 does not result in a significant improvement for LSTM models.Also for BERT, the precision numbers improve only for the tic shortcuts and only when probabilities are used (last two rows in Tab. 4 and 8).The similarity of the scores between the GxI and IG when using the zero baseline ( in Tab. 4 and 8) indicates that there is no difference between taking a single or 100(0) steps from the zero baseline. .29 .58.31.59.35.50.41.43.47IG l-mask-{100|1000} .71.58.71.99.62 .61.69.50.47IG l-pad-{100|1000} . 79 .27 .14 .28 .47 .27 .36 .46 .18 IG p-zero-{100|1000} .29 .58 .31 .59 .35 .50 .41 .43 .47 IG p-mask-100 .48 .37 .56 .80 .34 .48 .27 .27 .29 IG p-mask-1000 .48 .48 .56 .80 .47 .48 .28 .29 .29 IG p-pad-{100|1000} . 81 .18 .1 .21 .31 .14 .16 .37 .12Table 4: IG Precision across different models and datasets.Using probabilities instead of logits changes the results for IG ( and ).Number of steps doesn't affect the IG performance, but the choice of the baseline is important for IG when using BERT ( ).Using the [MASK] baseline (with logits) resulted in an improvement in the scores (bold).Finally, rows tell us that the difference between GxI and IG p-zero-{100|1000} is minimal and there is no difference between taking a single or 100(0) steps from the zero baseline. .98 .62 .78 .93 .76 .67 .99 .58 .70Table 5: LIME Precision across different models and datasets.LIME benefits from 1000 over 100 perturbations, especially for longer inputs and/or shortcuts.We found that the increase from 1000 to 3000 perturbations leads to little precision improvements for the input lengths in our datasets.Using UNK for masking leads to better results than [MASK] in several configurations ( ).
Choice of baseline is important for IG when using BERT.For the most part using the [MASK] baseline (with logits) resulted in an improvement in the scores (bold rows in Tab. 4 and 8).Still, even with the best performing configuration of IG the results are much worse than GRAD l-l2 .
Number of perturbations as well as masking token matter for LIME.LIME benefits from 1000 over 100 perturbations, especially for longer inputs and/or shortcuts.We found that the increase from 1000 to 3000 perturbations leads to little precision improvements for the input lengths in our datasets.Using UNK for masking leads to better results than [MASK] in almost all configurations ( in Tab. 5 and 9).We hypothesize this is due to two reasons: (i) The [MASK] token is not used during fine-tuning on the task data.(ii) The UNK token, however, is finetuned (due to unknown tokens and as special token in word dropout).Erasing tokens leads, on average, to worse precision results than masking, for all number of perturbations.Tables 10 and 11 in Appendix A.5 present the results for all the models, shortcut types and source datasets in terms of precision and rank and you can observe this phenomenon there.
Rank and precision give complementary information.For example, the precision of GRAD l2 and GRAD mean is close on Toxicity:op (.56 and .60)while the rank of the latter is almost twice as big (13 and 21) ( in Tab. 3).Lower rank with comparable precision means that the method consistently puts one of the shortcut tokens on the top but buries the other token deep in the ranking.

Related Work
Research on input salience methods for text classification is prolific and diverse in terms of the definitions used (Camburu et al., 2020), applications (Feng and Boyd-Graber, 2019), desiderata (Sundararajan et al., 2017b), etc.The importance of getting faithful salience explanations has been recognized early on (Bach et al., 2015;Kindermans et al., 2017) and there exist formal definitions of explanation fidelity (Yeh et al., 2019).However, these have not been connected to model debugging where it is the top of a salience ranking that matters most.In the vision domain, our work is closest to Adebayo et al. (2020Adebayo et al. ( , 2022)), who also explore the debugging scenario with salience maps, Yang and Kim (2019), who use synthetic data to obtain the ground truth for pixel importance, and Hooker et al. (2019), who contrast the performance of the same model trained on original and modified data when evaluating feature importance.
As pointed out in Introduction, in NLP faithfulness evaluation has often been grounded in strong assumptions (Poerner et al., 2018;DeYoung et al., 2020;Atanasova et al., 2020;Ding and Koehn, 2021) or by analyzing models substantially different from the ones normally used (Arras et al., 2019;Hao, 2020).An exception to this trend is the work by Sippy et al. (2020) who also modify source data but, unlike us, consider MLP as the only DNN model, do not evaluate any gradient-based methods and analyze single token shortcuts only without strong guarantees of them actually being the most important for the model.Also Zhou et al. (2021) analyze DNN models on intentionally corrupted data: they primarily focus on vision but also run an experiment analyzing how faithfully the attention mechanism points at the words known to correlate with the label.Finally, Madsen et al. (2021), following Hooker et al. (2018), iteratively remove tokens to evaluate faithfulness of salience methods for LSTM models and conclude, similar to us, that performance is task-dependent.
Concurrently with our work, Idahl et al. ( 2021) argue for faithfulness evaluation on synthetic data for model debugging but do not report experimental results.Similarly to them and also concurrently with our work, Pezeshkpour et al. (2021) go further and combine data and input attribution methods to discover data artifacts.However, citing prior work, they use GRAD l-mean and IG l-mean which, as we have shown, are sub-optimal configurations for BERT models.This explains the very poor accuracy of 12-13% (in our terms: precision@1) that they observed when discovering single-token shortcuts in SST2.Finally, as our experiments demonstrate, the single-token shortcut is insufficient to assess whether a method would be useful for more complex shortcuts.

Conclusions
We have argued for evaluating input salience methods with respect to how helpful they would be for discovering shortcuts that are learned by the model.This seems to be a clear use case from the model developer perspective.To achieve this, we proposed a protocol for method evaluation and applied it to three variants of lexical shortcuts (single token, token in context, and ordered pair) which are a proxy for shortcut heuristics that occur in common NLP tasks and which are particularly suitable for being discovered with input salience methods.By comparing the performance across different datasets, shortcut types and models (LSTM-based and BERT-based), we demonstrated that a strong performance for one setup may not hold for a different model or a more complex shortcut.Finally, we pointed out that some method configurations assumed to be reliable in recent work, for example integrated gradients, may give very poor results for NLP models, and that the details of how the methods are used can matter a lot, such as how a gradient vector is reduced into a scalar.Our results demonstrate that whenever one uses BERT and is interested if a simple token combination could determine the label, one should prefer Grad-L2 over more complex methods.

Limitations
In this paper we proposed a protocol that can be used for evaluating input salience methods.We limited ourselves to the most popular salience methods, and left others out of scope.In particular, it would be of interest to evaluate the most recent salience methods, like Chen et al. (2020); Sikdar et al. (2021), which were developed to take feature interactions into account.We also limited this work to the task of English (binary) text classification.Furthermore, we focus on a representative set of shortcuts, but different shortcuts might result in different outcomes.Finally, we limited ourselves to LSTM and BERT based models.Results with different neural components or with models of a different size and/or depth may be different.However, the protocol that we proposed can still be used in those cases.We also note that input salience is only one kind of explanation, and a limited one: it does not reveal the logic of the model, nor does it reveal interactions between input features.It is hardly possible to fully understand why a deep nonlinear neural model produced a certain prediction by only looking at input salience scores.Table 8: IG Rank across different models and datasets.Similarly to the precision results from Tab. 4 we see that using probabilities instead of logits changes the results for IG ( ).We also observe that number of steps doesn't affect the IG performance, but the choice of baseline is important for IG when using BERT ( ).Finally rows tell us the difference between GxI and IG p-zero-{100|1000} is minimal and there is no difference between taking a single or 100(0) steps from the zero baseline.Table 9: LIME Rank across different models and datasets.Similarly to the precision results from Tab. 5 we see that LIME benefits from 1000 over 100 perturbations, especially for longer inputs and/or shortcuts.We found that the increase from 1000 to 3000 perturbations leads to little precision improvements for the input lengths in our datasets.Using UNK for masking leads to better results than [MASK] in almost all configurations ( ).

Figure 3 :
Figure3: Illustration of how the ordered-pair shortcut is introduced into a balanced binary sentiment dataset and how it is verified that the shortcut is learned by the model.The model trained on the mixed data (A) is still largely a black box, but since its performance on the synthetic test set is 100% (contrasted with chance accuracy of model B which is similar but is trained on the original data only), we know it uses the injected shortcut (highlighted text).

Table 1 :
Accuracy on the three source (unmodified) test sets of the models trained on the source training data.In brackets we report the mean drop in accuracy (on the same source test sets) when evaluating the models trained on a shortcut version of the training data.

Table 2 :
LSTM GxI {p|l} 1. .76 .92 1. .35 .81 1. .68 .88BERTGxI {p|l} .29 .58.31.59.35.50.41.43.47PrecisionGxI results across different models and datasets.Here and in the following tables P stands for Precision.Colors and boldface mark the results that are mentioned in the Results section.st: single token, tic: token in context, op: ordered pair.Here we see that performance of GxI varies across LSTM and BERT, i.e. the LSTM has consistently higher scores on all metrics (in bold).Perfect precision on the single-token shortcut doesn't generalize to strong performance on two-token shortcuts (e.g., SST2:tic or IMDB:tic, ).

Table 6 :
GxI Rank results across different models and datasets.Here we see that performance of GxI varies across LSTM and BERT, i.e.LSTM has consistently higher scores on all metrics (in bold).

Table 7 :
GRAD Rank across different models and datasets.Rank and precision give complementary information, .