Is Sparse Attention more Interpretable?

Sparse attention has been claimed to increase model interpretability under the assumption that it highlights influential inputs. Yet the attention distribution is typically over representations internal to the model rather than the inputs themselves, suggesting this assumption may not have merit. We build on the recent work exploring the interpretability of attention; we design a set of experiments to help us understand how sparsity affects our ability to use attention as an explainability tool. On three text classification tasks, we verify that only a weak relationship between inputs and co-indexed intermediate representations exists—under sparse attention and otherwise. Further, we do not find any plausible mappings from sparse attention distributions to a sparse set of influential inputs through other avenues. Rather, we observe in this setting that inducing sparsity may make it less plausible that attention can be used as a tool for understanding model behavior.


Introduction
Interpretability research in natural language processing (NLP) is becoming increasingly important as complex models are applied to more and more downstream decision making tasks.In light of this, many researchers have turned to the attention mechanism, which has not only led to impressive performance improvements in neural models, but has also been claimed to offer insights into how models make decisions.Specifically, a number of works imply or directly state that one may inspect the attention distribution to determine the amount of influence each input token has in a model's decision-making process (Xie et al., 2017;Mullenbach et al., 2018;Niculae et al., 2018, inter alia).
Many lines of work have gone on to exploit this assumption when building their own "interpretable" models or analysis tools (Yang et al., 2016;Tu et al., 2016;De-Arteaga et al., 2019); one subset has even tried to make models with attention more interpretable by inducing sparsity-a common attribute of interpretable models (Lipton, 2018;Rudin, 2019)-in attention weights, with the motivation that this allows model decisions to be mapped to a limited number of items (Martins and Astudillo, 2016;Malaviya et al., 2018;Zhang et al., 2019).Yet, there lacks concrete reasoning or evidence that sparse attention weights leads to more interpretable models: customarily, attention is not directly over the model's inputs, but rather over some representation internal to the model, e.g. the hidden states of a recurrent network or contextual embeddings of a Transformer (see Fig. 1).Importantly, these internal representations do not solely encode information from the input token they are co-indexed with (Salehinejad et al., 2017;Brunner et al., 2020), but rather from a range of inputs.This presents the question: if internal representations themselves may not be interpretable, can we actually deduce anything from "interpretable" attention weights?
We build on the recent line of work challenging the validity of attention-as-explanation methods (Jain and Wallace, 2019;Serrano and Smith, 2019;Grimsley et al., 2020, inter alia) and specifically examine how sparsity affects their observations.To this end, we introduce a novel entropy-based metric to measure the dispersion of inputs' influence, rather than just their magnitudes.Through experiments on three text classification tasks, utilizing both LSTM and Transformer-based models, we observe how sparse attention affects the results of Jain and Wallace (2019) and Wiegreffe and Pinter (2019), additionally exploring whether it allows us to identify a core set of inputs that are important to models' decisions.We find we are unable to identify such a set when using sparse attention; rather, it appears that encouraging sparsity may simultaneously encourage a higher degree of arXiv:2106.01087v2[cs.CL] 8 Jun 2021 contextualization in intermediate representations.
We further observe a decrease in the correlation between the attention distribution and input feature importance measures, which exacerbates issues found by prior works.The primary conclusion of our work is that we should not believe sparse attention enhances model interpretability until we have concrete reasons to believe so; in this preliminary analysis, we do not find any such evidence.

Attention-based Neural Networks
We consider inputs x = x 1 • • • x n ∈ V n of length n where the tokens from taken from an alphabet V. We denote the embedding of x, e.g., its one hot encoding or (more commonly) a linear transformation of its one-hot encoding with an embedding matrix E ∈ R d×|V| , as X (e) ∈ R d×n .Our embedded input X (e) is then fed to an encoder, which produces n intermediate representa- and m is a hyperparameter of the encoder.This transformation is quite architecture dependent.
An alignment function A(•, •) maps a query q and a key K to weights a (t) for a decoding time step t; we subsequently drop t for simplicity.In colloquial terms, A chooses which values of K should receive the most attention based on q, which is then represented in the vector a (t) ∈ R n .For the NLP tasks we consider, we have K = I = [h 1 ; . . .; h n ], the encoder outputs.A query q may be, e.g., a representation of the question in question answering.
The weights a are projected to sum to 1, which results in the attention distribution α.Mathematically, this is done via a projection onto the probability simplex using a projection function φ, e.g., softmax or sparsemax.We then compute the context vector as c = n i=1 α i h i .This context vector is fed to a decoder, whose structure is again architecture dependent, which generates a (possibly unnormalized) probability distribution over the set of labels Y, where Y is defined by the task.
Attention.We experiment with two methods of constructing an attention distribution: (1) additive attention, proposed by Bahdanau et al. (2015): A(K, q) i = v tanh(W 1 K i + W 2 q) and (2) the scaled dot product alignment function, as in the Transformer network: A(K, q) = K q √ m where v ∈ R l and W 1 , W 2 ∈ R l×m are weight matrices.Note that the original (without attention) neural encoder-decoder architecture, as in Sutskever et al. (2014), can be recovered with alignment function A(•, •) = [0, . . ., 0, 1], i.e., only the last of the n intermediate representations is given to the decoder.Projection Functions.A projection function φ takes the output of the alignment function and maps it to a valid probability distribution: φ : R n → ∆ n−1 .The standard projection function is softmax: = argmin However, softmax leads to non-sparse solutions as an entry φ soft (z) i can only be 0 if x i = −∞.Alternatively, Martins and Astudillo (2016) introduce sparsemax, which can output sparse distributions: In words, sparsemax directly maps z onto the probability simplex, which often leads to solutions on the boundary, i.e.where at least one entry of p is 0. The formulation of sparsemax in Eq. ( 2) does not give us an explicit medium for controlling the degree of sparsity.The α-entmax (Peters et al., 2019) and sparsegen (Laha et al., 2018) transformations fill this gap; we employ the latter: where the degree of sparsity can be tuned via the hyperparameter λ ∈ (−∞, 1).Note that a larger λ encourages more sparsity in the minimizing solution.

Model Interpretability
Model interpretability and explainability have been framed in different ways (Gehrmann et al., 2019)as model understanding tasks, where (spurious) features learned by a model are identified, or as decision understanding tasks, where explanations for particular instances are produced.We consider the latter in this paper.Such tasks can be framed as generative, where models generate free text explanations (Camburu et al., 2018;Kotonya and Toni, 2020;Atanasova et al., 2020b), or as post-hoc interpretability methods, where salient portions of the input are highlighted (Lipton, 2018;DeYoung et al., 2020;Atanasova et al., 2020a).As there does not exist a clearly superior choice for framing decision understanding for NLP tasks (Miller, 2019;Carton et al., 2020;Jacovi and Goldberg, 2020), we follow a substantial body of prior work in considering the post-hoc definition of interpretability based on local methods proposed by Lipton (2018).This definition is naturally operationalized through feature importance metrics and meta models (Jacovi and Goldberg, 2020).Further, we acknowledge the specific requirement that an interpretable model obeys some set of structural constraints of the domain in which it is used, such as monotonicity or physical constraints (Rudin, 2019).For NLP tasks such as sentiment analysis or topic classification, such constraints may logically include the utilization of only a few key words in the input when making a decision, in which case, knowing the magnitude of the influence each input token has on a model's prediction through, e.g., feature importance metrics, may suffice to verify the model obeys such constraints.While this collective definition is limited (Doshi-Velez and Kim, 2017;Guidotti et al., 2018;Rudin, 2019), we posit that if attention cannot provide model interpretability at this level, then it would likewise not be able to under more rigorous constraints.

Measures of Feature Importance
Gradient-Based.Gradient-based measures of feature importance (F1; Baehrens et al., 2010;Simonyan et al., 2014;Poerner et al., 2018) use the gradient of a function's output w.r.t. a feature to measure the importance of that feature.In the case of an attentional neural network for binary classification f (•), we can take the gradient of f w.r.t. the variable x and evaluate at a specific input x to gain a sense of how much influence each x i had on the outcome ŷ = f (x ).These measures are not restricted to the relationship between inputs x i and the outcome f (x); they can also be adapted to measure for effects from and to intermediate representations h p .Formally, our measures are as follows: (5) where g ŷ(x i ) ∈ [0, 1] and g x i (h p ) ∈ [0, 1] represents the gradient-based FI of token x i on ŷ and intermediate representation h p , respectively.Gradient-based methods are often used in explainability techniques, as they have exhibited higher correlation with human judgement than others (Atanasova et al., 2020a).Note that we take gradients w.r.t. the embedding of token x i and that in the latter metric, we measure the influence of x i on the magnitude of h p -a decision we discuss in App. A.
Erasure-based.As a secondary FI metric, we observe how model predictions change when a specific input token is removed (i.e., Leave-One-Out; LOO).For token x i , this can be calculated as: where ŷ−i is the prediction of a model with input x i removed.The formula can also be used for intermediate representations; we denote this as D ŷ(h i ).

Experiments
Setup.We run experiments across several model architectures, attention mechanisms, and datasets in order to understand the effects of induced attentional sparsity on model interpretability.We use three binary classification datasets: ImDB and SST (sentiment analysis) and 20News (topic classification).We use the dataset versions provided by Jain and Wallace (2019), exactly following their pre-processing steps.We show a subset of representative results here, with additional results in App. C. Further details, including model architecture descriptions, dataset statistics and baselines accuracies may be found in App.B.
Inputs and Intermediate Representations are not Interchangeable.We first explore how strongly-related inputs are to their co-indexed intermediate representations.A strong relationship on its own may validate the use of sparse attention, as the ability to identify a subset of influential intermediate representations would then directly translate to a set of influential inputs.Previous works show that the "contribution" of a token x i to its intermediate representation h i is often quite low for various model architectures (Salehinejad et al., 2017;Ming et al., 2017;Brunner et al., 2020;Tutek and Snajder, 2020).In the context of attention, we find this property to be evinced by the adversarial experiments of Wiegreffe and Pinter (2019) ( §4) and Jain and Wallace (2019) ( §4), which we verify in App. C. They construct adversarial attention distributions  by optimizing for divergence from a baseline model's attention distribution by: (1) adopting all of the baseline model's parameters and directly optimizing for divergence and ( 2) training an entirely new model and optimizing for divergence as part of the training process.The former method leads to a large drop in performance (accuracy) while the latter does not.If we believe the model must encode the same information to achieve similar accuracy, this discrepancy implies that in the latter method, the model likely "redistributes" information across encoder outputs (i.e., intermediate representations h p ), which would suggest token-level information is not tied to a particular h p .
As further verification of high degrees of contextualization in attentional models, we report a novel quantification, offering insights into whether individual intermediate representations can be linked primarily to any single input-i.e., perhaps not the co-indexed input; we measure the normalized entropy1 of the gradient-based FI of inputs to intermediate representations H(g hp (x)) ∈ [0, 1] to gain a sense of how dispersed influence for intermediate representation is across inputs.A value of 1 would indicate all inputs are equally influential;  Sparse Attention = Sparse Input Feature Importance.Our prior results demonstrated that-even when using sparse attention-we cannot identify a subset of influential inputs directly through intermediate representations; we explore whether a subset can still be identified through FI metrics.In the case where the normalized FI distribution highlights only a few key items, the distribution will, by definition, have low entropy.Thus, we explore whether sparse attention leads to lower entropy input FI distributions in comparison to standard attention.We find no such trend; Fig. 2 shows that across all models and settings, the entropy of the FI distribution is quite high.Further, we see a consistent negative correlation between this entropy and the sparsity parameter of the sparsegen projection (Table 2), implying that entropy of feature importance increases as we raise the degree of sparsity in α.
Correlation between Attention and Feature Importance.Finally, we follow the experimental setup of Jain and Wallace ( 2019), who postulate x-axis is log-scaled for λ < 0 since λ ∈ (−∞, 1).Results are from the IMDb dataset.
that if the attention distribution indicates which inputs influence model behavior, then one may reasonably expect attention to correlate2 with FI measures of the input.While they find only a weak correlation, we explore how inducing sparsity in the attention distribution affects this result.Surprisingly, Fig. 3 shows a downward trend in this correlation as the sparsity parameter λ of the sparsegen projection function is increased.As argued by Wiegreffe and Pinter (2019), a lack of this correlation does not indicate attention cannot be used as explanation; FI measures are not groundtruth indicators of critical inputs.However, the inverse relationship between input FI and attention is rather surprising.If anything, we may surmise sparsity in α leads to less faithful explanations from α. From these results, we posit that promoting sparsity in attention distribution may simply lead to the dispersion of information to different intermediate representations, a behavior similar to that seen when constraining attention for divergence from another distribution, i.e., in the adversarial experiments of Wiegreffe and Pinter (2019) compared to those of Jain and Wallace (2019).

Related Work
The use of attention as an indication of inputs' influence on model decisions may at first seem natural; yet a large body of work has recently challenged this practice.Perhaps the first to do so was Jain and Wallace (2019), which revealed both a lack of correlation between the attention distribution and well established feature importance metrics and of unique optimal attention weights.Serrano and Smith (2019) contemporaneously found similar results.Subsequently, other studies arrived at similar results: for example, Grimsley et al. (2020) found evidence that causal explanations are not attainable from attention layers over text data; Pruthi et al. (2020) showed that attention masks can be trained to give deceptive explanations.We view this work as another such investigation, exploring attention's innate interpretability on a different axis.This work also fits into the context of a larger body of interpretability research in NLP, which has challenged the informal use of terms such as faithfulness, plausibility, and explainability (Lipton, 2018;Arrieta et al., 2020;Jacovi and Goldberg, 2021, inter alia) and tried to quantify the reliability of current definitions (Atanasova et al., 2020a).While we consider these works in our experimental design-e.g., in our choice of FI metrics-we recognize that further experiments are needed to verify our findings: for example, similar experiments could be performed using the DeYoung et al. (2020) benchmark for evaluation; other FI metrics, such as selective attention (Treviso and Martins, 2020) should additionally be considered.3

Conclusion
Prior work has cited interpretability as a driving factor for promoting sparsity in attention distributions.We explore how induced sparsity affects the ability to use attention as a tool for explaining model decisions.In our experiments on text classification tasks, we see that while sparse attention distributions may allow us to pinpoint influential intermediate representations, we are unable to find any plausible mapping from sparse attention to a small, critical set of influential inputs.Rather, we find evidence that inducing sparsity may make it even less plausible to use the attention distribution to interpret model behavior.We conclude that we need further reason to believe sparse attention increases model interpretability as our results do not support such claims.

A Feature Importance Metrics
Notably, both inputs and intermediate representations are not single variables.Intermediate representations are m-dimensional vectors and inputs x are embedded as X (e) , meaning each word x i is represented by a d-dimensional vector.Therefore, the gradient of f w.r.t.individual inputs or intermediate representations will likewise be a d-(or m-) dimensional vector.To come up with a scalar estimate of feature importance, we take the L 2 -norm of the evaluated gradient. 4Subsequently, we normalize over all x i (or h i ) to calculate relative feature importance of individual x i (or h i ).The discussed transformation can be mathematically formalized by Eq. ( 4) and ( 5).For intermediate representations, this computation measures the influence on the magnitude of h p rather than on h p itself.However, we also experimented with measuring the influence directly on each facet of h p , taking the magnitude of this vector.We found empirically that the two measures returned nearly identical results while measuring influence on magnitude was significantly more computationally efficient.

B Experimental Setup
We use exact datasets provided by and based our experimental framework on that of Jain and Wallace (2019), which can be found at https://github.com/successar/AttentionExplanation.
For both comparison and reproducibility, we exactly follow their preprocessing steps, which are described in their paper.Source code, model statistics, and links to datasets can be found at the above link.
In the experiments we use a Bidirectional LSTM encoder or a Transformer encoder which has 2 layers with 1 attention head.All hidden dimensions are set to 128.The models and the training procedure have been implemented by using the PyTorch library Paszke et al. (2019).For training we use the Adam optimizer Kingma and Ba (2015) with the amsgrad Reddi et al. ( 2018) option enabled.Some important hyperparameters are listed in Table 4; minor tuning was performed in order to reach comparable performance with respect to Jain and Wallace (2019) and Wiegreffe and Pinter (2019).An important note regarding this table is that the listed learning rate and weight decay correspond to all model parameters except the ones specifically for the attention mechanism.The latter we train 4 Other norms, e.g., the L1-norm, would also be appropriate-we leave the exploration of these to future work.without a weight decay and with either the same or 10x larger learning rate.

C.1 Adversarial Experiments
We construct adversarial attention distributions by optimizing for the divergence5 of the distribution from a baseline model's attention distribution using two methods: (1) by transferring all model parameters of a pre-trained base model and optimizing for divergence (frozen) and ( 2) training an entirely new model and optimizing for divergence (unfrozen).We use Jensen-Shannon divergence (JSD) to measure the difference between the adversarial and baseline distributions.Table 5 shows that although we can attain high JSD under both methods, the former leads to a large drop in performance.If we believe the model must encode the same information to achieve similar accuracy, the difference in accuracies of the two methods implies that in the first method, the model likely redistributes information across encoder outputs.

C.2 Correlation between Attention Distribution and Inputs/Intermediate Representations
We provide the full results of our experiments on correlation of the input and intermediate representations with the attention distribution in Table 6.

Figure 1 :
Figure 1: Correlation between the attention distribution and gradient-based FI measures.We see a notably stronger correlation between attention and FI of intermediate representation than of inputs across all models.

Figure 2 :
Figure 2: Entropy of gradient-based g ŷ (x) and LOO D ŷ (x) FI distributions.Results are from models with full spectrum of projection functions.

Figure 3 :
Figure 3: Correlation between the attention distribution and input FI measures as a function of the sparsity penalty λ used in the projection function φ sparseg .x-axis is log-scaled for λ < 0 since λ ∈ (−∞, 1).Results are from the IMDb dataset.

Table 2 :
Correlation between sparsegen parameter 2 λ and entropy of gradient-based input FI H(g ŷ (x)).avalue of 0 would indicate solely a single input has influence on an intermediate representation.Results in Table1show consistently high entropy in the distribution of the influence of inputs x i on an intermediate representation h p across all datasets, model architectures, and projection functions, which suggests the relationship between intermediate representations and inputs is far from one-to-one in these tasks.

Table 3 :
Dataset statistics and baseline accuracy scores on test sets for Transformer with dot product attention (T) and BiLSTM with additive attention (B).All datasets are in english.

Table 4 :
Hyperparameters used for training the models with LSTM and Transformer encoder respectively.