On the Lack of Robust Interpretability of Neural Text Classifiers

With the ever-increasing complexity of neural language models, practitioners have turned to methods for understanding the predictions of these models. One of the most well-adopted approaches for model interpretability is feature-based interpretability, i.e., ranking the features in terms of their impact on model predictions. Several prior studies have focused on assessing the fidelity of feature-based interpretability methods, i.e., measuring the impact of dropping the top-ranked features on the model output. However, relatively little work has been conducted on quantifying the robustness of interpretations. In this work, we assess the robustness of interpretations of neural text classifiers, specifically, those based on pretrained Transformer encoders, using two randomization tests. The first compares the interpretations of two models that are identical except for their initializations. The second measures whether the interpretations differ between a model with trained parameters and a model with random parameters. Both tests show surprising deviations from expected behavior, raising questions about the extent of insights that practitioners may draw from interpretations.


Introduction
In recent years, large scale language models like BERT and RoBERTa have helped achieve new state-of-the-art performance on a variety of NLP tasks (Devlin et al., 2019;Liu et al., 2019). While relying on vast amounts of training data and model capacity has helped increase their accuracy, the reasoning of these models is often hard to comprehend. To this end, several techniques have been proposed to interpret the model predictions.
Perhaps the most widely-adopted class of interpretability approaches is that of feature-based in- * Work done during internship at Amazon. terpretability where the goal is to assign an importance score to each of the input features. These scores are also called feature attributions. Several methods in this class (e.g., SHAP (Lundberg and Lee, 2017), Integrated Gradients (Sundararajan et al., 2017)) possess desirable theoretical properties making them attractive candidates for interpretability.
Benchmarking analyses often show that these methods possess high fidelity, i.e., removing features marked important by the interpretability method from the input indeed leads to significant change in the model output as expected (Atanasova et al., 2020;Lundberg and Lee, 2017).
However, relatively few investigations have been carried out to understand the robustness of feature attributions. To explore the robustness, we conduct two tests based on randomization: Different Initializations Test: This test operationalizes the implementation invariance property of Sundararajan et al. (2017). Given an input, it compares the feature attributions between two models that are identical in every aspect-that is, trained with same architecture, with same data, and same learning schedule-except for their randomly chosen initial parameters. If the predictions generated by these two models are also identical, one would also expect the feature attributions to be the same for such functionally equivalent models. If the attributions in two cases are not the same, two users examining the same input may deem the same features to have different importance based on the model that they are consulting. Untrained Model Test: This test is similar to the test of Adebayo et al. (2018). Given an input, it compares the feature attributions generated on a fully trained model with those on a randomly initialized untrained model. The test evaluates whether feature attributions on a fully trained model differ from the feature attributions computed on an untrained model as one would expect.
We conduct the two tests on a variety of text classification datasets. We quantify the feature attribution similarity using interpretation infidelity (Arras et al., 2016) and Jaccard similarity (Tanimoto, 1958). The results suggest that: (i) Interpretability methods fail the different initializations test. In other words, two functionally equivalent models lead to different ranking of feature attributions; (ii) Interpretability methods fail the untrained model test, i.e., the fidelity of the interpretability method on an untrained model is better than that of random feature attributions.
These findings may have important implications for how the prediction interpretations are shown to the users of the model, and raise interesting questions about reliance on these interpretations. For instance, if two functionally equivalent models generate different interpretations, to what extent can a user act upon them, e.g., investing in a financial product or not. We discuss these implications and potential reasons for this behavior in §4.
There are several important aspects of interpretation robustness. Some prior studies have considered interpretability in the context of adversarial robustness where the goal often is to actively fool the model to generate misleading feature attributions. See for instance Anders et al. (2020) Slack et al. (2020). In this work, rather than focusing on targeted changes in the input or the model, we explore robustness of feature attribution methods to various kinds of randomizations.
Several prior works have focused on quantifying quality of interpretations. See for instance, Ade-  Yang and Kim (2019). Closest to ours is the work of Adebayo et al. (2018), which is based on checking the saliency maps of randomly initialized image classification models. However, in contrast to Adebayo et al., we consider text classification. Moreover, while the analysis of Adebayo et al. is largely based on visual inspection, we extend it by considering automatically quantifiable measures. We also extend the analysis to non-gradient based methods (SHAP).

Setup
We describe the datasets, models, and interpretability methods considered in our analysis. Datasets. We consider four different datasets covering a range of document lengths and number of label classes. The datasets are: (i) FPB: The Financial Phrase Bank dataset (Malo et al., 2014) where the task is to classify news headlines into one of three sentiment classes, namely, positive, negative, and neutral. (ii) SST2: The Stanford Sentiment Treebank 2 dataset (Socher et al., 2013). The task is to classify single sentences extracted from movie reviews into positive or negative sentiment classes. (iii) IMDB: The IMDB movie reviews dataset (Maas et al., 2011). The task is to classify movie reviews into positive or negative sentiment classes. (iv) Bios: The Bios dataset of De-Arteaga et al. (2019). The task is to classify the profession of a person from their biography. Table 5 in Appendix A shows detailed dataset statistics.

Models
We consider four pretrained Transformer encoders: BERT (BT), RoBERTa (RB), DistilBERT (dBT), and DistilRoBERTa (dRB). The encoder is followed by a pooling layer to combine individual token embeddings, and a classification head. Appendix B.1 describes the detailed architecture, training and hyperparameter tuning details. After training and hyperparameter tuning, the best model is selected based on validation accuracy and is referred to as Init#1. Different Initializations Test. Recall from §1 that this test involves comparing two identical models trained from different initializations. The second model, henceforth referred to as Init#2, is trained using the same architecture, hyperparameters and training strategy as Init#1, but starting from a different set of initial parameters. Since we start from pretrained encoders, the encoder parameters are not intialized. For each layer in the rest of the model, a set of initial parameters different from those in Init#1 is obtained by calling the parameter initialization method of choice for this layer-He initialization (He et al., 2015) in this case-but with a different random seed. Untrained Model Test. Recall that this test involves comparing the trained model (Init#1) with a randomly initialized untrained model, henceforth called Untrained. To obtain Untrained, we start from the Init#1, and randomly initialize the fully connected layers attached on top of the Transformer encoders (the encoder weights are not randomized). The initialization strategy is the same as in Different Initializations Test.

Interpretability methods
We consider a mix of gradient-based and model agnostic methods.  (2017). We also include random feature attribution (RND) which corresponds to each feature being assigned an attribution from the uniform distribution, U(0, 1). Appendix B.2 provides details about the parameters chosen for the interpretability methods.
Given an input text document, we tokenize the text using the tokenizer of the corresponding encoder. Finally, for each input feature (that is, token), the feature attribution of the gradient-based methods is a vector of the same length as the token input embedding. For scalarizing these vector scores, we use the L2-norm strategy of Arras et al. (2016) and the Input Gradient strategy of Ding et al. (2019).

Interpretability Metrics
To compare the feature attributions by various interpretability methods, we use the following metrics.
(In)Fidelity: Given an input text which has been split into L tokens, t = [t 1 , . . . , t L ], get the vector Ψ(t) = [ψ(t 1 ), . . . , ψ(t L )] of feature attributions of the corresponding tokens using the interpretability method to be evaluated. Drop the features from t in the decreasing order of attribution score until the model prediction changes from the original prediction (with all tokens present). Infidelity is defined as the % of features that need to be dropped until the prediction changes. A better interpretability method is expected to need a lower fraction of features to be dropped until the prediction change. We simulate feature dropping by replacing the corresponding input token with the model's unknown vocabulary token.
The infidelity metric has appeared in many closely related forms in a number of studies evaluating model interpretability (Arras et al., 2016;Atanasova et al., 2020;DeYoung et al., 2020;Fong et al., 2019;Lundberg and Lee, 2017;Samek et al., 2017). All of these forms operate by iteratively hiding features in the order of their importance and measuring the change in the model output, e.g., in predicted class probability, or the predicted label itself. We chose number of tokens to prediction change, which is closely aligned with (Arras et al., 2016), due to its simplicity as compared to more involved metrics relying on AUC-style measures (Atanasova et al., 2020;Samek et al., 2017).
Jaccard Similarity: It is common to show top few most important features to users as model interpretations. See for instance, Ribeiro et al. (2016) and Schmidt and Biessmann (2019). In order to measure the similarity between feature attributions generated by different methods, we use the Jac-card@K% metric. Given an input t, let s i be the set of top-K% tokens, when the tokens are ranked based on their importance as specified by an attribution output Ψ i . Then, given two attribution outputs Ψ i and Ψ j , Jaccard@K% measures the similarity between them as: J(i, j) = |s i ∩s j | |s i ∪s j | . If the top-K% tokens by the two attributions Ψ i and Ψ j are the same, then J(i, j) = 1. In case of no overlap in the top-K% tokens, J(i, j) = 0.

Different Initializations Test.
Comparing Init#1 and Init#2 in Table 6 in Appendix C.1-two otherwise identical models with only difference being the random initial parameters, shows that a vast majority of predictions are common between the two models: meaning that the two models are almost functionally equivalent.
FPB SST2 IMDB Bios BT RB dBT dRB BT RB dBT dRB BT RB dBT dRB BT RB dBT dRB In other words, the attribution overlap between two functionally equivalent models can be similar to that between a trained vs. an untrained model. Figure 2 shows the infidelity of different methods on SST2 dataset with a BT model. We note that the performance of RND is better (lower infidelity) for the untrained model (Untrained) than for the trained model (Init#1). Furthermore, even for the untrained model (Untrained), all interpretability methods have a better fidelity than RND. In fact, for SHP, the infidelity is almost half of RND. Table 9 in Appendix C shows a similar pattern for the rest of the datasets and models.

Untrained Model Test.
In short, even for an untrained model, the interpretability methods lead to better-than-randomattribution fidelity. The insights highlight the need for baselining the fidelity metric with untrained models before using it as an evaluation measure.

Conclusion & Future Work
We carried out two tests to assess robustness of several popular interpretability methods on Transformer-based text classifiers. The results show that both gradient-based and model-agnostic methods can fail the tests.
These observations raise several interesting questions: if the fidelity of the interpretations is reasonably high on even an untrained model, to what extent does the interpretability method reflect the data-specific vs. data-independent behavior of the model? If two functionally equivalent models lead to different feature attributions, to what extent can the practitioners rely upon these interpretations to make consequential decisions? One cause of the non-robust behavior could be the redundancy in text where several input tokens may provide evidence for the same class (e.g., several words in input review praising the movie). Another reason, related to the first, could be the pathologies of neural models where dropping most of the input features could still lead to highly confident predictions (Feng et al., 2018). 1 Dropping individual features can also lead to out-of-distribution samples, further limiting the effectiveness of methods and metrics that rely on simulating feature removal (Kumar et al., 2020;Sundararajan and Najmi, 2020). Systematically analyzing the root causes, and designing interpretability measures that are cognizant of the specific characteristics of text data-preferably with human involvement (Chang et al., 2009;Doshi-Velez and Kim, 2017;Hase and Bansal, 2020;Nguyen, 2018;Poursabzi-Sangdeh et al., 2021;Schmidt and Biessmann, 2019)-is a promising research direction. Similarly, extending the Untrained Model Test to study the effect of randomization of pre-trained embedding models on interpretability is another direction for exploration.  We insert a classification head on top of the pretrained encoder. The end-to-end classifier has the following architecture: Encoder → Avg. pooling → FC-layer (512-units) → RELU → FC-layer (K units), where K is the number of classes. The maximum sequence length of the encoder is set to 128 for FPB and SST2 datasets, 512 for IMDB reviews and 200 for the Bios data. Each dataset is split into a 80% − 20% train-test set. 10% of the training set is used as a validation set for hyperparameter optimization. Accuracy and overlap statistics are reported on the test set.
We used the following hyperparameter ranges: learning rate {10 −2 , 10 −3 , 10 −4 , 10 −5 } and the number of last encoder layers to be fine-tuned {0, 2}. Fine tuning last few layers of the encoder, as opposed to all the layers, has been shown to lead to superior test set performance (Sun et al., 2019).
We use the AdamW optimizer (Loshchilov and Hutter, 2019). The maximum number of training epochs is 25. We use early stopping with a patience of 5 epochs: if the validation accuracy does not increase for 5 consecutive epochs, we stop the training. The model training was done using Py-Torch (Paszke et al., 2019) and HuggingFace Transformers (Wolf et al., 2020) libraries.

B.2 Interpretability methods implementation
Owing to the large runtime of methods like SHAP and Integrated Gradients, interpretations are only computed for a randomly chosen 1000 subsample from the test set. Consequently, metrics like fidelity and Jaccard@K% are reported only on this subset.   Vanilla Saliency and SmoothGrad are implemented using the PyTorch autograd function. Integrated Gradients and SHAP are implemented using Captum (Kokhlikyan et al., 2020). The parameters of these methods are:  (Sundararajan et al., 2017) that has the same dimensionality as the model input, but consists of 'non-informative' feature values. We construct the baseline by computing the embedding value of the unknown vocabulary token and repeating it N times where N is the maximum sequence length of the model. KernelSHAP. Requires two parameters. (i) Number of feature coalitions: Following the author implementation, 2 we use a value of 2L + 2 11 , where L is the number of input tokens in the text. (ii) Dropped token value: SHAP operates by dropping subsets of tokens and estimating model output on these perturbed inputs. We simulate dropping of a 2 https://github.com/slundberg/shap token by replacing its embedding value with that of the unknown vocabulary token. Table 6 shows the fraction of predictions common between Init#1 and Init#2. Table 7 shows the accuracy of Untrained. As expected, the accuracy of Untrained is much smaller than with trained models. Table 8 shows the infidelity of gradient-based interpretability methods when using the Input Gradient dot product reduction of Ding et al. (2019). When comparing the results to those with L2 reduction in Table 2, we notice that in all except two cases (VN and SG on dBT with IMDB data), the performance is worse. Table 9 shows the infidelity for the Untrained model. Much like Figure 2, the table shows that in several cases, the performance of the feature attribution methods (most notably SHP) can be much better than random attribution (RND).

C.3 Infidelity with Untrained model
The table also shows an exception for dBT on IMDB dataset where for all methods, the infidelity is near 100. This behavior is likely an artefact of the particular initial parameters due to which the model always predicts a certain class irrespective of the input.

Appendix D Examples of top-ranked tokens
We now show some examples of Jaccard@K% computation. The examples show the input text, different models, and top-K% tokens ranked w.r.t. their importance. The attribution method used was VN.
Example 1: SST2 data. Comparing Init#1 and Init#2. Both models predict the sentiment to be positive.