Improving the Faithfulness of Attention-based Explanations with Task-specific Information for Text Classification

Neural network architectures in natural language processing often use attention mechanisms to produce probability distributions over input token representations. Attention has empirically been demonstrated to improve performance in various tasks, while its weights have been extensively used as explanations for model predictions. Recent studies (Jain and Wallace, 2019; Serrano and Smith, 2019; Wiegreffe and Pinter, 2019) have showed that it cannot generally be considered as a faithful explanation (Jacovi and Goldberg, 2020) across encoders and tasks. In this paper, we seek to improve the faithfulness of attention-based explanations for text classification. We achieve this by proposing a new family of Task-Scaling (TaSc) mechanisms that learn task-specific non-contextualised information to scale the original attention weights. Evaluation tests for explanation faithfulness, show that the three proposed variants of TaSc improve attention-based explanations across two attention mechanisms, five encoders and five text classification datasets without sacrificing predictive performance. Finally, we demonstrate that TaSc consistently provides more faithful attention-based explanations compared to three widely-used interpretability techniques.


Introduction
Natural Language Processing (NLP) approaches for text classification are often underpinned by large neural network models (Cho et al., 2014;Devlin et al., 2019). Despite the high accuracy and efficiency of these models in dealing with large amounts of data, an important problem is their increased complexity that makes them opaque and hard to interpret by humans which usually treat them as black boxes (Zhang et al., 2018;Linzen et al., 2019).
Attention mechanisms (Bahdanau et al., 2015) produce a probability distribution over the input to compute a vector representation of the entire token sequence as the weighted sum of its constituent vectors. A common practice is to provide explanations for a given prediction and qualitative model analysis by assigning importance to input tokens using scores provided by attention mechanisms (Chen et al., 2017;Wang et al., 2016;Jain et al., 2020;Sun and Lu, 2020) as a mean towards model interpretability (Lipton, 2016;Miller, 2019).
A faithful explanation is one that accurately represents the true reasoning behind a model's prediction (Jacovi and Goldberg, 2020). A series of recent studies illustrate that explanations obtained by attention weights do not always provide faithful explanations (Serrano and Smith, 2019) while different text encoders can affect attention interpretability, e.g. results can differ when using a recurrent or non-recurrent encoder (Wiegreffe and Pinter, 2019).
A limitation of attention as an indicator of input importance is that it refers to the word in context due to information mixing in the model (Tutek and Snajder, 2020). Motivated by this, we aim to improve the effectiveness of neural models in providing more faithful attention-based explanations for text classification, by introducing noncontextualised information in the model. Our contributions are as follows: • We introduce three Task-Scaling (TaSc) mechanisms ( §4), a family of encoder-independent components that learn task-specific noncontextualised importance scores for each word in the vocabulary to scale the original attention weights which can be easily ported to any neural architecture; • We show that TaSc variants offer more robust, consistent and faithful attention-based explanations compared to using vanilla attention in a set of standard interpretability benchmarks, without sacrificing predictive performance ( §6); • We demonstrate that attention-based explanations with TaSc consistently outperform explanations obtained from two gradient-based and a word-erasure explanation approaches ( §7).
2 Related Work

Model Interpretability
Explanations for neural networks can be obtained by identifying which parts of the input are important for a given prediction. One way is to use sparse linear meta-models that are easier to interpret (Ribeiro et al., 2016;Lundberg and Lee, 2017;Nguyen, 2018). Another way is to calculate the difference in a model's prediction between keeping and omitting an input token (Robnik-Šikonja and Kononenko, 2008;Li et al., 2016b;Nguyen, 2018). Input importance is also measured using the gradients computed with respect to the input (Kindermans et al., 2016;Li et al., 2016a;Arras et al., 2016;Sundararajan et al., 2017). Chen and Ji (2020) propose learning a variational word mask to improve model interpretability. Finally, extracting a short snippet from the original input text (rationale) and using it to make a prediction has been recently proposed (Lei et al., 2016;Bastings et al., 2019;Treviso and Martins, 2020;Jain et al., 2020;Chalkidis et al., 2021). Nguyen (2018) and Atanasova et al. (2020) compare explanations produced by different approaches, showing that in most cases gradientbased approaches outperform sparse linear metamodels.

Attention as Explanation
Attention weights have been extensively used to interpret model predictions in NLP; i.e. (Cho et al., 2014;Xu et al., 2015;Barbieri et al., 2018;Ghaeini et al., 2018). However, the hypothesis that attention should be used as explanation had not been explicitly studied until recently.
Jain and Wallace (2019) first explored the effectiveness of attention explanations. They show that adversary attention distributions can yield equivalent predictions with the original attention distribution, suggesting that attention weights do not offer robust explanations. In contrast to Jain and Wallace (2019), Wiegreffe and Pinter (2019) and Vashishth et al. (2019) demonstrate that attention weights can in certain cases provide robust explanations. Pruthi et al. (2020) also investigate the ability of attention weights to provide plausible explanations. They test this through manipulating the attention mechanism by penalising words a priori known to be relevant to the task, showing that the predictive performance remain relatively unaffected. Sen et al. (2020) assess the plausibility of attention weights by correlating them with manually annotated explanation heat-maps, where plausibility refers to how convincing an explanation is to humans (Jacovi and Goldberg, 2020). However, Jacovi and Goldberg (2020) and Grimsley et al. (2020) suggest caution with interpreting the results of these experiments as they do not test the faithfulness of explanations (e.g. an explanation can be non-plausible but faithful or vice-versa). Serrano and Smith (2019) test the faithfulness of attention-based explanations by removing tokens to observe how fast a decision flip happens. Results show that gradient attention-based rankings (i.e. combining an attention weight with its gradient) better predict word importance for model predictions, compared to just using the attention weights. Tutek and Snajder (2020) propose a method to improve the faithfulness of attention explanations when using recurrent encoders by introducing a word-level objective to sequence classification tasks. Focusing also on recurrent-encoders, Mohankumar et al. (2020) introduce a modification to recurrent encoders to reduce repetitive information across different words in the input to improve faithfulness of explanations.
To the best of our knowledge, no previous work has attempted to improve the faithfulness of attention-based explanations across different encoders for text classification by inducing taskspecific information to the attention weights.

Neural Text Classification Models
In a typical neural model with attention for text classification; one-hot-encoded tokens x i P R |V| are first mapped to embeddings e i P R d , where i P r1, ..., ts denotes the position in the sequence, t the sequence length, |V | the vocabulary size and d the dimensionality of the embeddings. The embeddings e i are then passed to an encoder to produce hidden representations h i " Encpe i q, where h i P R N , with N the size of the hidden representation. A vector representation c for the entire text sequence x 1 , ..., x t is subsequently obtained as the sum of h i weighted by attention scores α i : Vector c is finally passed to the output, a fullyconnected linear layer followed by a softmax activation function.

Encoders
To obtain representations h i , we consider the following recurrent, non-recurrent and Transformer (Vaswani et al., 2017)

Attention Mechanisms
Attention scores (a i ) are computed by passing the representations (h i ) obtained from the encoder to the attention mechanism which usually consists of a similarity function φ followed by softmax: where q P R N is a trainable self-attention vector similar to Yang et al. (2016). Following Jain and Wallace (2019), we consider two self-attention similarity functions: (i) Additive Attention (Tanh; Bahdanau et al. (2015)): where W is a trainable model parameter; and (ii) Scaled Dot-Product (Dot; Vaswani et al. (2017)): encoder each token representation h i contains information from the whole sequence so the attention weights actually refer to the input word in context and not individually (Tutek and Snajder, 2020). Inspired by the simple and highly interpretable bag-of-words models, which assign a single weight for each word type (word in a vocabulary), we hypothesise that by scaling each input word's contextualised representation c i (see Eq. 1) by its attention score and and a non-contextualised word type scalar score, we can improve attention-based explanations. The intuition is that by having a less contextualised sequence representation c we can reduce information mixing for attention.
For that purpose, we introduce the noncontextualised word type score s x i in Eq. 1 to enrich the text representation c, such that: We compute s x i by proposing three Task-Scaling (TaSc) mechanisms. 3

Linear TaSc (Lin-TaSc)
We first introduce Linear TaSc (Lin-TaSc), the simplest method in the family of TaSc mechanisms that estimates a scalar weight for each word in the vocabulary by introducing a new vector u P R |V| . Given the input sequence x " rx 1 , . . . , x t s representing one-hot-encodings of the tokens, we perform a look up on u to obtain the scalar weights of words in the sequence. u is randomly initialised and updated partially at each training iteration, because naturally each input sequence contains only a small subset of the vocabulary words. We then obtain a task-scaled embeddingê i for a token i in the input by multiplying the original token embedding with its word type weight u i : The intuition is that the embedding vector e i was trained on general corpora and is a noncontextualised "generic" representation of input x i . As such the score u i will scale e i to the task. We subsequently compute context-independent scores s x i for each token in the sequence, by summing all elements of its corresponding task-scaled embeddingê i ; s x i " ř dê i in a similar way that token embeddings are averaged in the top-layers of a 480 neural architecture. We opted to sum-up and not average, because we want to retain large and small values from the task-scaled embedding vectorê i (Atanasova et al., 2020). 4 As the attention scores pertain to the word in context (Tutek and Snajder, 2020), we also expect the score s x i to pertain to the word without the contextualised information. That way, we complement attention which results into a richer sequence representation c.

Feature-wise TaSc (Feat-TaSc)
Lin-TaSc assigns equal weighting to all the dimensions of the word embedding e i (see Eq. 6), but some of them might be more important than others. Inspired by the RETAIN mechanism (Choi et al., 2016), Feature-wise TaSc (Feat-TaSc) learns different weights for each embedding dimension to identify the most important of them. Compared to Lin-TaSc where e i is scaled uniformly across all vector dimensions, with Feat-TaSc each dimension is scaled independently. To achieve this, we introduce a learnable matrix U P R |V|ˆd . Similar to Lin-TaSc, given the input sequence x, we perform a look up on U to obtain U s " ru 1 , . . . , u t s. U is randomly initialised and updated partially at each training iteration. To obtain s x i , we perform a dot product between u i and embedding vector e i ; s x i " u i¨ei .

Convolutional TaSc (Conv-TaSc)
Lin-TaSc and Feat-TaSc weigh the original word embedding e i but do not consider any interactions between embedding dimensions. Conv-TaSc addresses this limitation by extending Lin-TaSc. 5 We apply a CNN 6 with n channels over the scaled embeddingê i from Lin-TaSc, keeping a single stride and a 1-dimensional kernel. This way, we ensure that input words remain context-independent. We then sum over the filtered scaled embeddingê f i , to obtain the scores s 5 Evaluating Attention-based Interpretability Jacovi and Goldberg (2020) propose that an appropriate measure of faithfulness of an explanation can be obtained through erasure (the most relevant parts of the input-according to the explanationare removed). We therefore follow this evaluation approach similar to Serrano and Smith (2019), Atanasova et al. (2020) and Nguyen (2018). 7

Attention-based Importance Metrics
We opt using the following three input importance metrics by Serrano and Smith (2019): 8 • α: Importance rank corresponding to normalised attention scores.
• ∇α: Provides a ranking by computing the gradient of the predicted labelŷ with respect to each attention score α i in descending order, such that ∇α i " Bŷ Bα i . • α∇α: Scales the attention scores α i with their corresponding gradients ∇α i .

Faithfulness Metrics
Decision Flip -Most Informative Token: The average percentage of decision flips (i.e. changes in model prediction) occurred in the test set by removing the token with highest importance.
Decision Flip -Fraction of Tokens: The average fraction of tokens required to be removed to cause a decision flip in the test set. Note that we conduct all experiments at the input level (i.e. by removing the token from the input sequence instead of only removing its corresponding attention weight) as we consider the scores from importance metrics to pertain to the corresponding input token following related work (Arras et al., 2016(Arras et al., , 2017Nguyen, 2018;Vashishth et al., 2019;Grimsley et al., 2020;Atanasova et al., 2020).

Predictive Performance
A prerequisite of interpretability is to obtain robust explanations without sacrificing predictive performance (Lipton, 2016). Table 2 shows the macro F1scores of all models across datasets, encoders and attention mechanisms using the three TaSc variants (Lin-TaSc, Feat-TaSc and Conv-TaSc described in Section 4) and without TaSc (No-TaSc). 10 In general, all TaSc models obtain comparable performance and in some cases outperform No-TaSc across datasets and attention mechanisms. However, our main aim is not to improve predictive performance but the faithfulness of attention-based explanations, which we illustrate below. Table 3 and Figure 1 present the mean average percentage of decision flips (higher is better) across attention mechanisms, encoders and datasets by removing the most informative token for TaSc variants and No-TaSc for all attention-based importance metrics (see Section 5).

Decision Flip: Most Informative Token
In Table 3, we observe that TaSc variants are effective in identifying the single most important token, outperforming No-TaSc in 12 out of 18 cases across attention-based importance metrics. This suggests that the attention mechanisms benefit from the non-contextualised information encapsulated in TaSc when allocating importance to the input tokens. Models using Tanh without TaSc appear to produce on average a higher percentage of decision flips compared to those using the Dot mechanism. Using either of the TaSc variants improves both 9 https://di.unipi.it/˜gulli/AG_corpus_ of_news_articles.html 10 For model hyper-parameters and prepossessing steps see Appendix A. 11 Lower predictive performance is observed with BERT in MIMIC, as BERT accepts a maximum of 512 word pieces as input. See Appendix A.   mechanisms, with Dot mechanism benefiting the most, making it comparable to Tanh. For example, Dot moves from 8.2% with No-TaSc to 11.8% with Lin-TaSc, which is closer to 14.0% achieved by Lin-TaSc with Tanh (for α∇α).
The first row of Figure 1 presents a comparison across encoders. TaSc variants achieve improved performance over No-TaSc across all encoder variants with ∇α and α∇α. All TaSc variants yield comparable results with the exception of Conv-TaSc with BERT. Results further suggest that non-recurrent encoders (MLP, CNN) without TaSc outperform recurrent encoders (LSTM, GRU) and BERT which has the poorest performance. We hypothesise that this is due to the attention module becoming more important without feature contextualisation which is similar to findings of Serrano and Smith (2019) and Wiegreffe and Pinter (2019). However, we observe that using any of the TaSc variants across encoders results into improvements with LSTM and GRU becoming comparable to MLP and CNN. For example, BERT without TaSc improves from 5.7% to 8.0% (relative improvement 1.4x) and 9.3% (relative improvement 1.6x) using Lin-TaSc and Feat-TaSc respectively (for α∇α).
Observing results in the second row of Figure  1, we see that TaSc variants outperform No-TaSc in all datasets when using ∇α and α∇α. This highlights the robustness of TaSc as improvements are irrespective of the dataset. In general, Lin-TaSc and Feat-TaSc perform equally well, however Lin-TaSc has the smaller number of parameters amongst the three variants. Similar to the findings of Serrano and Smith (2019) best results overall, irrespective of the use of TaSc, are obtained using α∇α to rank importance.

Decision Flip: Fraction of Tokens
Providing one token (i.e., the most informative) as an explanation is not always a realistic approach to assessing faithfulness. In our second experiment, we test TaSc by measuring the fraction of important tokens required to be removed to cause a decision flip (change model's prediction). Table  4 and Figure 2 show the mean average fraction of tokens required to be removed to cause a decision flip (lower is better) across attention mechanisms, encoders and datasets for all importance metrics.  In Table 4, we see that attention-based explanations from models trained with any of the TaSc mechanisms require on average a lower fraction of tokens to cause a decision flip compared to No- TaSc (in 17 out of 18 cases). Overall Lin-TaSc achieves higher or comparable relative improvements over Conv-TaSc and Feat-TaSc in 5 out of 6 times.
We present an across encoders comparison in the first row of Figure 2. All three TaSc variants obtain comparable performance with the exception of Conv-TaSc with BERT. We hypothesise that with BERT, Conv-TaSc fails to capture interactions between embedding dimensions due to perhaps higher contextualisation of BERT embeddings (i.e. contain more duplicate information). Similarly to the previous experiment results suggest that nonrecurrent encoders (MLP and CNN) without TaSc outperform the remainder of encoders, with BERT having the worst performance. This strengthens our hypothesis that attention becomes more important to a model with reduced contextualisation. When using TaSc, performance across all encoders becomes comparable with the exception of BERT. For example, GRU improves from .43 with No-TaSc to .16 with Lin-TaSc, .17 with Feat-TaSc and .18 with Conv-TaSc (for α∇α).
The second row of Figure 2 presents results across datasets. All three TaSc mechanims manage to outperform vanilla attention. Lin-TaSc and Feat-TaSc perform comparably, with the first having a slight edge obtaining highest relative improvements in 3 out of 5 datasets with α∇α. For example in ADR, No-TaSc requires on average .77 of all tokens to be removed for a decision flip to occur compared to .34 obtained by Lin-TaSc (for α∇α). The benefits of TaSc become evident when considering longer sequences. For example in MIMIC, Lin-TaSc requires on average 44 tokens to cause a decision flip compared to 220 for No-TaSc.

Robustness Analysis
We also perform a detailed comparison between the best performing TaSc variant (Lin-TaSc) and vanilla attention (No-TaSc) across all test instances. Figure 3 shows box-plots with the median fraction of tokens required to be removed for causing a decision flip when ranking tokens by all three importance metrics. For brevity we present results for four cases.
We notice that the median fraction of tokens required to cause a decision flip for Lin-TaSc using α is higher compared to No-TaSc in certain cases. However, Lin-TaSc results in consistently lower medians (with substantially reduced variances) compared to No-TaSc using ∇α and α∇α which are more effective importance metrics. This is particularly visible in ADR using BERT, where the 25% and 75% percentiles are much closer to the median values, compared to No-TaSc. Reduced variances suggest that the explanation faithfulness across instances remains consistent.  TaSc; denotes attention with Lin-TaSc (lower and narrower is better).

Comparing TaSc with Non-attention Input Importance Metrics
We finally compare explanations provided by using Lin-TaSc and α∇α to three standard non-attention input importance metrics without TaSc which are strong baselines for explainability (Nguyen, 2018;Atanasova et al., 2020).
Word Omission (WO) (Robnik-Šikonja and Kononenko, 2008;Nguyen, 2018): Ranking input words by computing the difference between the probabilities of the predicted class when including a word i and omitting it: WO i " ppŷ|xqṕ pŷ|x zx i q InputXGrad (x∇x) (Kindermans et al., 2016;Atanasova et al., 2020): Ranking words by multiplying the gradient of the input by the input with respect to the predicted class: ∇x i " Bŷ Bx i Integrated Gradients (IG) (Sundararajan et al., 2017): Ranking words by computing the integral of the gradients taken along a straight path from a baseline input to the original input, where the baseline is the zero embedding vector. Table 5 shows the results on decision flip (fraction of tokens removed) comparing the best performing attention-based importance metric (α∇α) with Lin-TaSc to Non-TaSc models with WO, x∇x and IG importance met-rics across all encoders and datasets. 12 We observe that using α∇α with TaSc to rank word importance requires a lower fraction of tokens to cause a decision flip on average compared to WO, x∇x and IG without TaSc. We outperform the other explanation approaches in 40 out of 50 cases, whilst obtaining comparable performance in other 5 cases. This demonstrates the efficacy of TaSc in providing more faithful attention-based explanations than strong baselines without TaSc (Nguyen, 2018;Atanasova et al., 2020). The improvements are particularly evident using BERT as an encoder.

Comparison Results
In IMDB, WO with Tanh requires on average .23 of the tokens to be removed for a decision flip compared to just .07 for α∇α with TaSc. We also observe that the attention-based importance metric (α∇α) with TaSc is a more robust explanation technique than non-attention based ones, obtaining lower variance in the fraction of tokens required to cause a decision flip across encoders. For example α∇α with TaSc and Tanh requires a fraction of tokens in the range of .01-.05 compared to IG which requires .02-.43 in MIMIC, showing the consistency of our proposed approach.
Finally we observe that TaSc consistently improves non-attention based explanation approaches (WO, x∇x and IG) requiring a lower fraction of tokens to be removed compared to Non-TaSc across encoders, datasets and attention mechanisms in the majority of cases (see full results in Appendix E).

Qualitative Analysis
We finally examine qualitatively what type of information the parameter u from Lin-TaSc learns. Similar to a bag-of-words model, our initial hypothesis is that u will assign high scores to the words that are most relevant to the task. Figure 4 illustrates the 5 highest and lowest scored words from the IMDB and ADR datasets with a LSTM encoder and Dot attention and CNN encoder and Tanh attention respectively. For brevity we include two examples, however observations hold similar throughout other configurations (e.g. encoders, datasets) and when increasing the number of top-k words.
We first observe in 4a, that indeed words expressing sentiment are assigned with high scores (e.g. excellent, waste, perfect), either positive or negative. However, a positive or negative sign does  Table 5: Average fraction of tokens required to cause a decision flip using the best performing attention-based ranking (α∇α) with TaSc, Word omission without TaSc (WO), InputXGrad without TaSc (∇x) and Integrated Gradients without TaSc (IG). not correspond to supporting the positive or negative class respectively. For example withdrawal in ADR can be considered relevant to positive class, yet it is negatively scored. Also sick can be considered a withdrawal symptom which is relevant to the negative class, yet it is positively scored. We speculate that this happens due to the complex nonlinear relationships between the input words and the target classes learned by the model.

Conclusion
We introduced TaSc, a family of three encoderindependent mechanisms that induce contextindependent task-specific information to attention. We conducted an extensive series of experiments showing the superiority of TaSc over vanilla attention on improving faithfulness of attention-based interpretability without sacrificing predictive performance. Finally, we showed that attention-based explanations with TaSc outperform other interpretability techniques. For future work, we will explore the effectiveness of TaSc in sequence-tosequence tasks similar to Vashishth et al. (2019).