Perturbing Inputs for Fragile Interpretations in Deep Natural Language Processing

Interpretability methods like Integrated Gradient and LIME are popular choices for explaining natural language model predictions with relative word importance scores. These interpretations need to be robust for trustworthy NLP applications in high-stake areas like medicine or finance. Our paper demonstrates how interpretations can be manipulated by making simple word perturbations on an input text. Via a small portion of word-level swaps, these adversarial perturbations aim to make the resulting text semantically and spatially similar to its seed input (therefore sharing similar interpretations). Simultaneously, the generated examples achieve the same prediction label as the seed yet are given a substantially different explanation by the interpretation methods. Our experiments generate fragile interpretations to attack two SOTA interpretation methods, across three popular Transformer models and on three different NLP datasets. We observe that the rank order correlation and top-K intersection score drops by over 20% when less than 10% of words are perturbed on average. Further, rank-order correlation keeps decreasing as more words get perturbed. Furthermore, we demonstrate that candidates generated from our method have good quality metrics.


Introduction
Recently, the use of natural language processing (NLP) has gained popularity in many securityrelevant tasks like fake news identification (Zhou et al., 2019), authorship identification (Okuno et al., 2014), toxic content detection (Jigsaw, 2017), and for text-based automated privacy policy understanding (Harkous et al., 2018). Since interpretations of NLP predictions have become necessary building blocks of the SOTA deep NLP workflow, explanations have the potential to mislead human users into trusting a problematic interpretation. However, there has been little analysis of the reliability and robustness of the explanation techniques, es-pecially in high-stake settings, making their utility for critical applications unclear.
Research has shown that it is possible to disrupt and even manipulate interpretations in deep neural networks (Ghorbani et al., 2019;Dombrowski et al., 2019). The core idea in this literature centers around "fragile interpretations". (Ghorbani et al., 2019) defined that an interpretation is fragile if, for a given input, it is possible to generate perturbed input that achieves the same prediction label as the seed, yet is given a substantially different interpretation. Fragility limits how much we can trust and learn from specific interpretations. An adversary for "fragile interpretations" could manipulate the input to draw attention away from relevant words or onto desired features. Such input manipulation might be especially hard to detect because the actual labels have not changed.
The literature includes two relevant groups: (1) to conduct model manipulations (Slack et al., 2019; (details in Sec. 2), and (2) to manipulate input samples (Ghorbani et al., 2019). There has been little attention studying fragile interpretations via input manipulation in deep NLP.
In this paper, we propose a simple algorithm "Ex-plainFooler" that can make small adversarial perturbations on text inputs and demonstrate fragility of interpretations. We focus on optimizing two objective metrics -"L2 Norm" or a proposed "Delta LOM", searching for small word-swap-based input manipulation to produce misleading interpretations and using semantic-oriented constraints to constrain the manipulations. Figure 3 provides one example perturbation process. In summary, this paper provides the following contributions: • Our input perturbation optimizes to increase the objective metric ("L2 Norm" or "Delta LOM") that measures difference between the original and generated interpretations. The LOM score captures the approximate center "position" of an interpretation and summarizes it to a scalar.  • We propose an effective algorithm "Explain-Fooler" to optimize the objective metric via an iterative procedure. Our algorithm generates a series of increasingly perturbed text inputs such that their explanations are significantly different from the original but preserving predictions. • Empirically, we show that it is possible to find perturbed text examples to fool interpretations by INTEGRATED GRADIENT and LIME, even on NLP models that are relatively more robust. The approximate process and results of word perturbation using our approach is detailed in Figure 1.

Related Work
Interpretation Methods: Several interpretation methods have been proposed (Shrikumar et al., 2017;Li et al., 2015;Bach et al., 2015;Shrikumar et al., 2017) to calculate feature importance scores. Two well-known methods in this area are Integrated Gradients (IG) (Sundararajan et al., 2017) and Local Interpretable Model Explanations (LIME) (Ribeiro et al., 2016b). IG computes the scores by summing up the gradients along a path from the baseline to the input in a fixed number of steps and subsequently multiplied by the input itself. IG overcomes the saturation problem discussed in (Shrikumar et al., 2017;Sundararajan et al., 2017). On the other hand, LIME is a completely black-box approach which explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction by training the model on perturbations generated around the input. Fragile Interpretations More recently, several works have focused on discussing the robustness of the said interpretations. Studies have demonstrated that the interpretations generated are not robust and can be easily manipulated due to high dimensionality of networks. (Ghorbani et al., 2019;Dombrowski et al., 2019;Slack et al., 2019;. Multiple other works have tried to fix the problem by making interpretations robust (Lakkaraju et al., 2020;Rieger and Hansen, 2020).  demonstrated that it is possible to introduce a new model over the original and alter gradients, to fool gradient-based interpretation methods. Similarly, (Slack et al., 2019) showed that black-box interpretation methods can also be fooled by allowing an adversarial classifier component. More recently, (Zafar et al., 2021) demonstrated empirically. that interpretability methods produce varying results on the same models but differently initialized. Adversarial Examples that fool NLP Predictions: Adversarial examples are inputs to a predictive machine learning model that are maliciously designed to fool the model predictions (Goodfellow et al., 2014). Multiple recent works have focused on applying the concept of adversarial examples on language inputs, including (1) attacks by Character Substitution (Ebrahimi et al., 2017;Gao et al., 2018;Li et al., 2018); (2) attacks by Paraphrase (Ribeiro et al., 2018;Iyyer et al., 2018); (c) attacks by Synonym Substitution (Alzantot et al., 2018;Kuleshov et al., 2018;Papernot et al., 2016); (d) attacks by Word Insertion or Removal (Liang et al., 2017;Samanta and Mehta, 2017); (e) attacks by limiting L p distance in a latent embedding space (Zhao et al., 2017). Our proposed algorithm is closely connected to the TextFooler algorithm  that searches for input perturbations to achieve mis-classification. Differently, we optimize the "L2 Norm" and "Location of Mass (LOM)" objective directly on the input space for fragile explanations.

Proposed Method
In this section, we present our algorithm to generate perturbed sentences that demonstrate fragile interpretations. First, we propose the metric "Location of Mass (LOM)" and L2 Norm, followed by a discussion on the search strategy to optimize the objective metrics. Subsequently, we discuss the interpretation method choices and end with the final candidates' selection procedure and pseudocode for our algorithm (Algorithm1). We denote a text input as x and its word importance score vector (from a specific interpretation strategy on a particular NLP model) using notation I.

Difference Metrics on Interpretation
To quantify the difference between two interpretations, we propose two objective metrics -"Delta LOM" and "L2 Norm". These metrics are divergent -that is higher the metric, the more different the interpretations.

"Location of Mass (LOM)" Score
First, we propose a metric inspired by (Ghorbani et al., 2019) which provides a quantifiable "position" of the interpretations of a sentence. First, we define the "Location of Mass (LOM)" score as: Here n is the length of the sentence (along with starting/end special tokens). And i t is the interpretability score assigned to the token at index 't'.
We then propose to calculate the "Delta LOM" metric as: the difference between the LOM scores on the two interpretations I 1 and I 2 : The intuition behind this metric comes from the fact that changing the approximate position of the "center" of interpretations changes the relative position and magnitudes of interpretations. This observation is demonstrated in Figure 3.

L2 Norm Metric
We also propose to use a standard L2 Norm to measure difference between two interpretations. Mathematically it is computed as follows: L2 Norm quantifies the extent of difference, higher the L2 Norm-higher the difference in pattern of two interpretations.

Searching for Word-level Perturbations
Our objective is to perturb a seed input x, into a slightly-modified text x adv , so that ∆LOM or L2N orm is maximized under a set of constraints. First, we rank each word of an input sentence in the order of their importance to a model's predictions. This is done by the Leave-one-out approach (Li et al., 2016), which removes each word from the sentence one at a time and measures the change in prediction values, ranking the words which produce the greatest change as most important. Subsequently, we start our search in decreasing order of word importance and substituting each word with their k closest nearest-neighbors according to their counter-fitting synonym embeddings (Mrksic et al., 2016). For every subsequent word replacement, interpretation is calculated according to victim interpretation strategy we try to attack.

Ensuring Constraints
We enforce the following four constraints for each perturbed candidate to ensure candidates do not lose their linguistic structure and approximate semantic meaning of the seed input.
• Repeat Modification: Stops the same word from getting perturbed more than once. • Stop Word Modification: This excludes predefined stop words from getting perturbed. • Word Embedding Distance: Swaps the original word with words that have less than a particular embedding distance using Counter-Fitting Embeddings. • Part of Speech: Replaces the original word with only words from the same part of speech. • Sentence Embedding: Ensure the difference in the Universal Sentence Embedding is less than a pre-defined threshold (Cer et al., 2018).

Victim Interpretation Choices
Integrated Gradient: We calculate INTEGRATED GRADIENT (Sundararajan et al., 2017) interpretations of NLP models using the open-source package Captum (Kokhlikyan et al., 2020) that provides accurate implementations of various interpretation methods. We use the popular INTEGRATED GRA-DIENT algorithm to calculate the importance scores on the embedding space of the models. Once the interpretations are calculated, they are summed up along the dimension axis to derive the word importance scores. Subsequently, the ∆LOM and L2 Norm scores of each candidate perturbation are calculated against the original input's interpretation. LIME: The LIME interpretations are calculated using the official LIME code provided by (Ribeiro et al., 2016a). We normalize the LIME scores by dividing the vector with its L 2 -norm. Subsequently, the ∆LOM and L2 Norm scores of each candidate perturbation are calculated against the original in-put's interpretation.

Finding the ideal candidate
Once we obtain all the candidates and their metric scores on every candidate achieving the same prediction label as the original, we store those ideal candidates with each 'm' number of words perturbed. This gives us a list of candidates for each level of word perturbation and the associated change in objective metric scores. Next, for each level, the candidate with the highest metric score against the original is chosen. Finally, we convert the number of perturbed words into a ratio with respect to the input's length. This is done to take into account the varying sentence lengths and get a normalized measure. The ratio is limited to 50% because once more than half the words are perturbed, the sentence starts losing its semantic meaning. The complete selection process is schematically detailed in Figure 2

Algorithm
Algorithm 1 "ExplainFooler" provides pseudocode to compute and select a list of candidates that can induce fragile explanations. Our implementation adapts and builds on top of the open-source package TextAttack (Morris et al., 2020).

Data Summary
The experiments are conducted on three different datasets for text classification task. Experiments are conducted on the validation set for SST-2 (Socher et al., 2013), test set for AG News (Zhang et al., 2015) and test set for IMDB dataset (Maas et al., 2011). We select the first 500 sentences from Result: A -list of candidate sentences ordered by number of words perturbed from original For each sentence in dataset A ←empty S ←original sentence C ←empty while Possible perturbations exist do c ←Perturb S and get candidate if constraints pass and prediction label is same as S then Algorithm 1: The "ExplainFooler" algorithm the SST-2 and AGNews datasets and 100 sentences from the test set from IMDB dataset to run our experiments. We discard sentences with just 2 words or less.
• SST-2: The Stanford Sentiment Treebank-2 dataset for movie review classification. It has two classes: positive and negative. Experiments are conducted on the first 500 sentences of the validation set. • AG News: A collection of news articles belonging to 4 different classes including World, Sports, Science/Technology and Business. Experiments are conducted on first 500 sentences of test set. • IMDB: IMDB website dataset for binary sentiment classification containing a set of highly polar movie reviews. Experiments are conducted on first 500 sentences of test set except for LIME where only 100 sentences are used due to very high computation time due to very long average sentence length.

Interpretability Parameters
IG: As integrated gradients is a gradient based approach and requires a reference baseline, we compute the attributions on the embedding space and set the reference baseline to the special to- ken <PAD> which is reserved in transformers as a special character. The step size for Integrated Gradients were chosen as 50 i.e. from reference to baseline, the gradients were summed up in 50 continuous steps. LIME: The number of perturbations for LIME were chosen as 500 and the maximum number of top-k words were chosen as 512 words -the truncation limit for all the models.

Perturbation Parameters
We choose the number of nearest neighbours as 50 for swapping the words to limit the number of candidates. The maximum embedding cosine similarity between sentences was set as 0.5 to ensure sentences do not lose their semantic meaning.

Under the Hood
Pre-processing: All sentences with less than 2 words in all datasets are removed due to word perturbations not existing in some cases. In other cases, the smaller sentence have a very big difference in rank correlation which can spuriously decrease evaluation metrics. Each sentence from all datasets is also converted to lower-case. Fixing Tokenizations: As pre-trained tokenizers for transformer models contain a ML matching based lookup vocabulary, many words in candidate sentences are tokenized in an unexpected manner. This results in the change of length of the token list which in turn changes the length of interpretations. To alleviate this problem, we test 2 distinct approaches to combine the unnaturally tokenized words into their original form.
• Average: The first approach combines all the tokens prefixed by a set character (## in case of DistilBERT) into one single word and assigns the average value of the tokens to the combined tokens • Max: The second approach combines all the tokens prefixed by a set character (## in case of DistilBERT) into one single word and assigns the absolute maximum value with sign to the combined word.
Upon careful review, we utilize the second approach for our experiments. This is because, in uncommon cases where tokens hold opposite polarity to the ones in the word result in 'diluted' value of the original token. An example of the effectiveness of the 'Max' approach is given in Figure  4.

Rank Correlation
To compare the correlation between interpretations of 2 sentences, we use the Spearman rank correlation metric. The more the ranks of the interpretations agree with each other, the higher the rank correlations. Importantly, we clip the negative values of the metric to 0. This is done because a negative correlation does not make sense when only comparing the difference in ranks and can spuriously bring down the average scores.

Top-50% Intersection
To compare the extent to which the words with highest attributions are correctly predicted by both the interpretation methods, we use the Top-k% intersection metric. To compute the intersection, we first find the words with the maximum absolute value of attributions (most important for prediction). We calculate the intersection of the top 50% highest attribution words.

Candidate Quality
To judge the quality of the candidates generated using "ExplainFooler", we calculate two different commonly used quality metrics from adversarial attack literature -Perplexity and absolute number of grammar errors similar to (Li et al., 2020). Perplexity We first use perplexity to estimate the fluency of candidates generated using "Explain-Fooler". The lower the value, the more fluent the candidates, measured using a small size GPT-2 model (50k vocabulary) (Radford et al., 2019). Grammatical Errors Estimates the average number of absolute difference in grammatical errors between the original and the candidate sentences. We use the Language Tool (Naber et al., 2003) to compute the errors.

Model Choices
The robustness concern of interpretation strategies challenges their use in critical applications, raising concerns like lack of trust. However, it is unclear what causes the "fragile explanations", the model or the interpretation? We therefore select three different transformer models namely, DistilBERTuncased (Sanh et al., 2019), RoBERTa-base  and BERT-base (Devlin et al., 2018) to conduct our experiments. More importantly, we retrain the BERT-base to obtain the BERT-base-adv model that is an adversarially trained version of the BERT-base model. The rationale behind the choices is to investigate the impact of model's robustness on the robustness of the interpretations.
(1) First, a generic transformer model like Distil-BERT is relatively smaller and faster but less robust than the other two. (2) Next RoBERTa is extensively better pre-trained and has a far more robust performance. (3) Lastly, BERT-base-adv model is trained from adversarial training. We use the popular TextFooler    (Morris et al., 2020). Distil-BERT and RoBERTa models were from pre-trained models, fine-tuned on the respective datasets and we take them from the Huggingface's transformer model hub (Wolf et al., 2020) without change. Differently, BERT-base-adv model is adversarially trained by attacking 10000 training examples for the IMDB and AG datasets and attacking all training samples for SST-2 dataset.

Rank Order and Top-50% Intersection
The results are reported in a tabular manner across 3 datasets (SST-2, AG News and IMDB), 3 models (DistilBERT, RoBERTa and BERT-adv (Section A(Appendix)) and 2 interpretability methods covering both metrics -L2 Norm, "Delta LOM" and compared against random candidate selection independent of both metrics. The first set of tables (Tables 1 and 3) report the average rank-order correlation between interpretations from the perturbed and the original, across different perturbation ratios in buckets of 10%. The second set of tables (Tables 2 and 4) report the average top-50% intersection across different perturbation ratios. The rank correlation results for the IMDB datasets are reported only on IG due to excessive computational constraints while calculating interpretations using LIME. Due to space constraints, the results for both AGNews and IMDB datasets are reported in (Tables 8-11 and Tables 12-13 respectively, Section A.2 (Appendix)) along with a more detailed representation of the intra-bucket distribution in the form of Violin Graphs (Section A.3 (Appendix)) A bucket represents all instances of perturbed candidates in the ratio between that lower and higher range. For example, a bucket between "0.1-0.2" contains all rank-order correlation results from sentences with a percentage of words perturbed be-   Table 3: Change in average rank-order correlation using metrics -L2 Norm, LOM and random selection conmputed using the interpretability method: LIME, for dataset-SST-2 over 3 models -DistilBERT, RoBERTa and BERT-adv.  Table 4: Change in average Top-50% intersection using metrics -L2 Norm, LOM and random selection conmputed using the interpretability method: LIME, for dataset-SST-2 over 3 models -DistilBERT, RoBERTa and BERT-adv.  Table 5: Average values of perplexity calculated using a small GPT-2 model over all the candidates generated by "ExplainFooler" (C-avg). The accompanying values in columns LOM and L2 denote the perplexity values calculated on the selected sentences using the proposed LOM and L2 Norm metrics. The average value of perplexity of the original sentences in the dataset are given in parentheses. Selection using metrics give more fluent sentences.  Table 6: Average model confidence for correct prediction values for increasing number of words perturbed (calculated on candidates generated by "Ex-plainFooler") over models -DistilBERT, RoBERTa and BERT-adv on datasets -SST-2, AGNews and IMDB tween 10% and 20%. We also provide violin plots in appendix showcasing intra-bucket distribution for the dataset SST-2 (Figures 7-10). We observe that both average rank-order correlation and top-50% intersection scores decrease as the ratio of words being perturbed increases. Observations imply that interpretations of sentences become increasingly dissimilar to the original sentence as more words are perturbed even though the prediction robustness of the models remains high (see Table 6, Figure 12). Similar trends are observed across all models, datasets, and covering both victim interpretability methods. These empirical observations demonstrate interpretations generated by INTEGRATED GRADIENT and LIME are fragile for all models -even models that are adversarially more robust (BERT-adv). The model confidence values on the generated candidates is reported in Table 6. To further demonstrate effectiveness of proposed metrics, we plot violin plots on SST-2 dataset for avg. rank correlation versus selection using metrics and random. (Figure 5 -Appendix)

Quality of candidates
Perplexity The average perplexity value results over all models and datasets are reported in Table 5 reported. It can be observed that the perplexity of candidates selected using LOM and L2 Norm have lower perplexity score (implying better fluency) than the average of all candidates generated by "ExplainFooler" and are much closer to original dataset's perplexity. Grammatical Errors Estimates the average number of absolute difference in grammatical errors between the original and the candidate sentences. We use the Language Tool (Naber et al., 2003) to compute the errors. The results over all models and datasets are reported in Table 7.

Conclusions
Literature sees a growing emphasis on interpretation techniques for explaining NLP model predictions. Our work demonstrates a novel algorithm that generates perturbed inputs that provide evidence of fragile interpretations in NLP. We demonstrate the effectiveness of our approach across three different models, with one of them adversarially trained. Our results show that it is possible to attack interpretations using simple input-level word swaps under certain constraints. We also demonstrate that both black-box and white-box interpretability approaches (LIME and INTEGRATED GRADIENT) show fragility in their derived interpretations. We hope our findings can pave the lights for future studies on defending against the problem of fragile interpretations in NLP.

A Appendix
A.1 Compare with Baseline Figures 5,6 show the decrease in average rank correlation when considering random candidates as opposed to selection using the LOM metric.

A.2 Additional Results
In this section we report the average rank order correlation and the average top-50% intersection scores for AGNews and IMDB datasets. The Tables 8,9 correspond to AGNews' rank correlation and top-50% scores using INTEGRATED GRADI-ENT whereas Tables 10,11 show same values using LIME. Tables 12 and 13 show similar values but for IMDB dataset.

A.3 Violin Plots for intra-bucket distribution analysis
The Violin plots convey more information about the relative distribution of average rank correlations and Top-50% values for various bucket ratios. The following figures are only reported on the SST-2 dataset for each combination of evaluation metric and interpretability methods.

A.4 Visual Results
A few visual results demonstrating the gradual change in interpretations of candidate adversaries is shown in Figure 12. It can be observed that ∆LOM score gradually increases with word perturbations. The examples demonstrate the same 3 sentences from the dataset perturbed under DistilBERT and RoBERTa respectively. Figure 5: The violin graphs demonstrate the effectiveness of candidate selection based on the proposed metrics LOM and L2 Norm over random selection for SST-2 dataset. As it can be seen that the selection based on the proposed metrics disrupts rank correlation more as compared to randomly selecting candidates. Figure 6: The violin graphs demonstrate the effectiveness of candidate selection based on the proposed metrics LOM and L2 Norm over random selection for AG-News dataset. As it can be seen that the selection based on the proposed metrics disrupts rank correlation more as compared to randomly selecting candidates.  Table 10: Change in average rank-order correlation using metrics -L2 Norm, LOM and random selection conmputed using the interpretability method: LIME, for dataset-AGNews over 3 models -DistilBERT, RoBERTa and BERT-adv.  Table 11: Change in average Top-50% intersection using metrics -L2 Norm, LOM and random selection conmputed using the interpretability method: LIME, for dataset-AGNews over 3 models -DistilBERT, RoBERTa and BERT-adv.   Table 13: Change in average rank-order correlation using metrics -L2 Norm, LOM and random selection conmputed using the interpretability method: INTEGRATED GRADIENT, for dataset-IMDB over 3 models -DistilBERT, RoBERTa and BERT-adv.