Measuring the Mixing of Contextual Information in the Transformer

The Transformer architecture aggregates input information through the self-attention mechanism, but there is no clear understanding of how this information is mixed across the entire model. Additionally, recent works have demonstrated that attention weights alone are not enough to describe the flow of information. In this paper, we consider the whole attention block –multi-head attention, residual connection, and layer normalization– and define a metric to measure token-to-token interactions within each layer. Then, we aggregate layer-wise interpretations to provide input attribution scores for model predictions. Experimentally, we show that our method, ALTI (Aggregation of Layer-wise Token-to-token Interactions), provides more faithful explanations and increased robustness than gradient-based methods.


Introduction
The Transformer (Vaswani et al., 2017) has become ubiquitous in different tasks across multiple domains, becoming the architecture of choice for many NLP (Devlin et al., 2019;Brown et al., 2020) and computer vision (Dosovitskiy et al., 2021) tasks.The self-attention mechanism inside the Transformer is in charge of combining contextual information in its intermediate token representations.Attention weights offer a straightforward layer-wise interpretation, as they provide a distribution over input units, which is often presented as giving the relative importance of each input.
A prominent line of research has investigated the faithfulness of attention weights (Jain and Wallace, 2019;Serrano and Smith, 2019;Pruthi et al., 2020;Wiegreffe and Pinter, 2019;Madsen et al., 2021b) with contradictory conclusions.Some works have studied layer-wise attention patterns by analyzing standard attention (Kovaleva et al., 2019;Clark et al., 2019;Vig and Belinkov, 2019) and effective Grad ℓ2 went here just before a movie .the service was fast but that ' s it .i ordered the mango and shrimp quesadilla .my friend ordered nachos .the food was not good .i and my friend could not finish our food and we had stomach aches immediately .IG ℓ2 went here just before a movie .the service was fast but that ' s it .i ordered the mango and shrimp quesadilla .my friend ordered nachos .the food was not good .i and my friend could not finish our food and we had stomach aches immediately .ALTI went here just before a movie .the service was fast but that ' s it .i ordered the mango and shrimp quesadilla .my friend ordered nachos .the food was not good .i and my friend could not finish our food and we had stomach aches immediately .
Table 1: Saliency maps of BERT generated by two common gradient methods and by our proposed method, ALTI, for a negative sentiment prediction example of Yelp dataset.
attention (Brunner et al., 2020;Sun and Marasović, 2021), but explaining the Transformer beyond attention weights needs further investigation (Lu et al., 2021).Kobayashi et al. (2020) extended the explainability of the model by also considering the magnitude of the vectors involved in the attention mechanism, and Kobayashi et al. (2021) went as far as incorporating the layer normalization and the skip connection in their analysis.While these works have helped better understand the layer-wise behavior of the Transformer, there is a mismatch between layer-wise attention distributions and global input attributions (Pascual et al., 2021) since intermediate layers only attend to a mix of input tokens.Brunner et al. (2020) quantified the aggregation of contextual information throughout the model with a gradient attribution method.Although they found the self-attention mechanism greatly mixes the information of the model input, they were able to recover the token identity from hidden layers with high accuracy with a learned linear mapping.This phenomenon is partially explained by Kobayashi et al. (2021) and Lu et al. (2021), who have shown the relatively small impact of the multi-head attention, which loses influence with respect to the residual connection, consequently revealing a reduced entanglement of contextual information in BERT.Finally, Abnar and Zuidema (2020) proposed the attention rollout method, which measures the mixing of information by linearly combining attention matrices, a method that has been extended to Transformers in the visual domain (Chefer et al., 2021a,b).A drawback of this method is that it assumes an equal influence of the skip connection and the attention weights.
In this work, we propose ALTI, an interpretability method that provides input tokens relevancies1 to the model predictions by measuring the aggregation of contextual information across layers.We use the attention block decomposition proposed by Kobayashi et al. (2021) and refine the measure of the contribution of each input token representation to the attention block output (layer-wise token-totoken interactions), based on the properties of the representation space and the limitations of previously proposed metrics.We then aggregate the layer-wise explanations and track the mixing of information in each token representation, yielding input attributions for the model predictions.Finally, in the Text Classification and Subject-Verb Agreement tasks, we show ALTI scores higher than gradient-based methods and previous similar approaches in two common faithfulness metrics, while showing greater robustness.The code to reproduce the experiments is publicly available.2 2 Background

Attention Block Decomposition
The attention block3 computations in each layer (highlighted parts in Figure 1) can be reformulated (Kobayashi et al., 2021) as a simple expression of the layer input representations.Given a sequence of token representations X = (x 1 , • • • , x J ) ∈ R d×J , and a model with H heads and head dimension d h = d/H, the attention block output of the i-th token y i is computed by applying the layer normalization (LN) over the sum of the residual vector x i , and the output of the multi-head attention module Figure 1: Transformer layer with the modules considered in the analysis.We compute layer-wise tokento-token interactions by measuring the contributions of each input token representation x j to the attention block output y i .
(MHA) xi : Each head inside MHA computes4 z h i ∈ R d h : with A h i,j referring to the attention weight where token i attends token j, and W h V ∈ R d h ×d to a learned weight matrix.xi is calculated by concatenating each z h i and projecting the joint vector through This is equivalent to a sum over heads where each By swapping summations we can now rewrite Eq. 1 as: (5) Given a vector u, LN(u) can be reformulated as 1 σ(u) Lu + β (see Appendix A), where L is a linear transformation.Thanks to the linearity of L, we can express y i as: where the transformed vectors T i (x j ) are: 2021) stated that the contribution c i,j of each input vector x j to the layer output y i can be estimated by how much its transformed vector T i (x j ) affects the result in Eq. 6.They propose using the Euclidean norm of the transformed vector as the metric of contribution:

Attention Rollout
Abnar and Zuidema (2020) proposed to measure the mixing of contextual information across the model by relying on attention weights, creating an "attention graph" where nodes represent tokens and hidden representations, and edges attention weights.Two nodes in different layers are connected through multiple paths.To add the residual connection, the attention weights matrix gets augmented with the identity matrix Âl = 0.5A l + 0.5I.
We can compute the amount of information flowing from one node to another in different layers by multiplying the edges in each path, and summing over the different paths.In the example of Figure 2, the amount of input information of [CLS] in its second layer representation x 2 1 can be obtained as Â2 1,1 . This is equivalent to the dot product between Â2 1,: and Â1 :,1 , which generalizes to the matrix multiplication when considering all tokens, giving the input relevance matrix at layer l, R l ∈ R J×J : 3 Proposed Approach The decomposition of the attention block, represented as a sum of vectors in Eq. 6, allows us to interpret token-to-token interactions within each layer.Kobayashi et al. (2021) proposed to measure the influence of each input token with the ℓ 2 norm of the transformed vectors (Eq.7).We present two reasons why this estimation may not be accurate: 1.A property of the contextual representations in Transformer-based models is that they are highly anisotropic (Ethayarajh, 2019), i.e. the expected cosine similarity of randomly sampled token representations tend to be close to 1 (solid lines in Figure 3).However, transformed representations exhibit reduced anisotropy, especially for the first layers, where there is almost isotropy (dashed lines in Figure 3), i.e. they are more randomly spread across the space.This reinforces the need of accounting for the vector's orientation in space, as opposed to solely relying on their norm.
We can analyze the expression in Eq. 6 as T i (x j ) vectors contributing to the sum resultant y i .We propose to measure how much each transformed vector contributes to the sum by means of its distance to the output vector y i .We expect that the closer the vector is to y i , the higher its contribution (Figure 4).In this way, we take into account where each transformed vector lies in the representation space (Reason 1).Due to its robustness to the aforementioned idiosyncratic dimensions (Reason 2), we use ℓ 1 norm, i.e. the Manhattan distance between the attention block output and the transformed vector: The level of contribution of x j to y i , c i,j , is proportional to the proximity of T i (x j ) to y i .The closer the transformed vector is to y i , the larger its contribution.We measure proximity as the negative of Manhattan distance −d i,j .Finally, we neglect the contributions of those vectors lying beyond the ℓ 1 length of y i : Computing Eq. 9 and Eq. 10 for all y i gives us the contributions matrix C ∈ R J×J containing every token-to-token interaction within the layer.We also propose to consider contributions across the model as a "contribution graph", similar to the attention graph in Section §2.2 but using the obtained contributions instead of attention weights.We can then track the amount of contextual information from the input tokens in intermediate token representations, which we use as input attribution scores.By combining linearly the contribution matrices up to layer l (Figure 5 bottom) we get:

Experimental Setup
We perform our experiments in the Text Classification (TC) and the Subject-Verb Agreement (SVA) tasks.The former evaluates how models classify an entire input sequence, the latter assesses the ability of a model to capture syntactic phenomena (Linzen et al., 2016;Goldberg, 2019).For the TC task, we use the Stanford Sentiment Treebank v2 (SST-2) (Socher et al., 2013)
In the TC task, we use fine-tuned models provided by TextAttack (Morris et al., 2020).For the robustness analysis in Section §5.2, we finetune 10 pre-trained BERT models on SST-2 with the recommended hyperparameters in Devlin et al. (2019).We compute attribution scores from the row R L [CLS] ∈ R J (Figure 5 bottom) that corresponds to the final layer [CLS] embedding, considered a sentence representation for classification tasks.Regarding the SVA task, we split Linzen et al. (2016) dataset into 60%/20%/20% for training, validation, and testing respectively, and finetune a pre-trained BERT model.We use the input relevances of R L [MASK] .

Faithfulness Metrics
An interpretation is considered to be faithful if it accurately reflects a model's decision-making pro-cess.A well-established method for measuring faithfulness is by deleting parts of the input sentence x and observing the change in the predicted probability.Two common erasure-based metrics are comprehensiveness (comp.)and sufficiency (suff.)(DeYoung et al., 2020).Chan et al. (2022) have demonstrated that they have higher diagnosticity, i.e. they favor faithful interpretations over randomly generated ones, and lower time complexity than other well-known faithfulness metrics.Comprehensiveness and sufficiency are defined as: Comprehensiveness.Measures the change in probability of the predicted class after removing important tokens: where r :k% refers to the top-k% most important tokens obtained by an interpretability method.The higher the drop in the probability, the more faithful the interpretation.
Sufficiency.Captures if important tokens are enough to retain the original prediction: Lower values of sufficiency indicate a more faithful interpretation, since, in that case, the prediction doesn't change when considering only the important tokens.As in the original paper, for both metrics we use B = {0, 5, 10, 20, 50}.

Input Attribution Methods
Input attribution methods rank input tokens in accordance with how they impact model predictions.
They can be divided into: gradient-based methods, perturbation-based, and those relying on the attention mechanism.The gradient of the model's output with respect to the input embeddings is often used as a baseline of faithfulness interpretation (Jain and Wallace, 2019).Atanasova et al. (2020); Zaman and Belinkov (2022) show that gradientbased methods perform better than other interpretability methods, regarding different faithfulness metrics.Finally, perturbation-based methods (Zeiler and Fergus, 2014;Ribeiro et al., 2016) compute attributions by replacing the original sentence with a modification.Zaman and Belinkov (2022) show that erasure-based methods, such as comprehensiveness and sufficiency favor perturbationbased methods attributed to noise due to the OOD perturbations.
Gradient.Considering the model f taking as input a sequence of embeddings X 0 ∈ R d×J , f can be approximated by the linear part of the Taylorexpansion at a baseline point (Simonyan et al., 2014), gives a score per embedding dimension, which is often considered as how sensitive the model is to each input dimension when predicting a certain class.To get per token saliency scores (Li et al., 2016), we obtain the gradient vector corresponding to the j-th token . Then, we aggregate the gradient vector into a scalar using the ℓ 2 norm (Grad ℓ 2 ): Recently, Bastings et al. ( 2021) showcased (in BERT and SST-2) the high degree of faithfulness of Grad ℓ 2 method.
Gradient × input.This method (Shrikumar et al., 2016) performs the multiplication of the gradient and the corresponding input embedding.Each component of the gradient vector gets multiplied by the corresponding component of the embedding.Following (Atanasova et al., 2020;Zaman and Belinkov, 2022), we aggregate the component scores into a single scalar by taking the ℓ 2 norm (G × I ℓ 2 ) as in Eq. 14 or by taking the mean (G × I µ ) as follows: Integrated Gradients.Integrated gradients (Sundararajan et al., 2017) approximates the integral of gradients of the model's output with respect to the inputs along the straight line path from a baseline input B, to the actual input.The attribution score for each embedding dimension is defined as: where , and m number of steps.As baseline, we use repeated [MASK] vectors for each word except for [CLS] and [SEP] (Sajjad et al., 2021), and 100 steps.We aggregate (IG ℓ 2 and IG µ ) the attribution scores of the embedding dimensions of Eq. 16 to obtain attr(x j ).
Finally, we normalize the obtained attribution scores in the range so that they sum 1.We use the Captum library implementations (Kokhlikyan et al., 2020).
Attention.Attention-based methods that provide input attributions include Attention Rollout (Abnar and Zuidema, 2020), as described in Section §2.2.Concurrent to this work, Modarressi et al. (2022) propose Globenc, which combines Attention Rollout aggregation technique with Kobayashi et al. (2021) layer-wise contributions (Eq.7), with the addition of the layer normalization of the FFN module.In Section §5.4 we compare ALTI to Globenc.

Results
In this section, we present quantitative and qualitative results comparing ALTI with other input attribution methods.

Faithfulness Results
In Table 2 we show comprehensiveness and sufficiency results for the three models and four datasets.It can be seen that across every different configuration, our proposed ALTI method outperforms other input attribution methods.Regarding comprehensiveness, datasets with short sentences like SST-2 and SVA (Figure 6 (c)) provide small differences between methods.This is expected since these datasets are simpler, and therefore, interpretations can more easily find the important tokens.However, for datasets containing longer inputs with multiple sentences, like IMDB and Yelp, ALTI clearly stands out.This can be observed in Figure 6 (a) and (b), where the probability drop in the model prediction is shown when removing one token at a time.We observe small differences in performance within gradient-based methods across datasets and models, with IG ℓ 2 performing the best on average among them, agreeing with the observations of Atanasova et al. (2020).However, ALTI outperforms IG ℓ 2 by 58% on average in comprehensiveness, and by 38% in sufficiency.Results of RoBERTa and DistilBERT on every dataset can be found in Appendix B.
Previous research concluded that faithfulness results for evaluating different interpretability methods are task and model dependent (Bastings et al., 2021;Madsen et al., 2021a   across models and tasks, we observe ALTI repeatedly performs the best across different tasks and models.
In the qualitative examples in Tables 1 and 3 we can observe that gradient-based methods often miss the relevant tokens that drive the model's negative prediction.ALTI consistently assigns high relevance to spans of text that have a negative connotation, such as 'depressing', 'don't plan on returning' in Table 3, or 'food not good', 'stomach aches' in Table 1, as expected from a negative sentiment prediction.We observe that, as opposed to ALTI, gradient-based methods become less accurate with longer sequences.A very large example with a positive sentiment prediction can be found in Appendix C Table 9, where ALTI accurately picks as important tokens those with positive meanings.

Robustness Analysis
We perform a study to investigate the robustness of different interpretability methods based on the implementation invariance property defined by Sundararajan et al. (2017).Given a set of models with the same architecture, and trained with the same data, but only differing in their random weight initialization, it compares how different are the input attribution scores between the models, for the same interpretability method.If the predictions of the models are also identical, i.e. models are functionally equivalent, we would expect input attributions also to be identical.Zafar et al. (2021) perform this test on two identical neural text classifiers (i, j), differing in their random weight initialization.Since the vast majority of the predictions are the same for both models, they consider them to be almost functionally identical models.Then, they measure the Jaccard similarity score between the top-25% tokens ranked based on their importance as specified by an input attribution method, for model i and Grad ℓ 2 um .it ' s okay , i guess .they have food at decent prices , but the isles are narrow , everything needs a good cleaning and repainting , and it just felt dark and depressing .otherwise it ' s all right , but i don ' t plan on returning here .IG ℓ 2 um .it ' s okay , i guess .they have food at decent prices , but the isles are narrow , everything needs a good cleaning and repainting , and it just felt dark and depressing .otherwise it ' s all right , but i don ' t plan on returning here .ALTI um .it ' s okay , i guess .they have food at decent prices , but the isles are narrow , everything needs a good cleaning and repainting , and it just felt dark and depressing .otherwise it ' s all right , but i don ' t plan on returning here .
Table 3: Saliency maps of BERT generated by two common gradient methods and by our proposed method, ALTI, for a negative sentiment predictions of Yelp dataset.j.If the top-25% tokens by the two attributions coincide, Jaccard-25%(i, j) = 1.In case tokens don't overlap, Jaccard-25%(i, j) = 0.
We perform the robustness test with 10 pretrained BERT models from the MultiBERTs (Sellam et al., 2022), which only differ in their random weight initialization.For each interpretability method we compute Jaccard-25% score between all the different pairs of models.In Figure 7 we show the distribution of the obtained scores.We also compute the Spearman's rank correlation coefficient (Figure 8), which evaluates how well the relationship between the two ranked interpretations can be described using a monotonic function.We can observe that ALTI provides more homogeneous interpretations across identical models in terms of similarity and correlation, suggesting that it is a more robust interpretability method than gradientbased methods.

Ablation Study
We inspect the importance of the different components conforming ALTI.
Layer-wise token contributions.We compare the effect of our token contributions' measurement in Eq. 9 and Eq. 10 against previous approach by Kobayashi et al. (2021)   by aggregating each type of contributions with the Rollout method.To isolate the influence of the norm choice, we use the ℓ 2 in Eq. 9 and Eq. 10 (ALTI ℓ 2 ).Results in Table 4 show our proposed layer-wise contribution measurement largely improves previous approach.
Norm choice in ALTI.We also evaluate the election of the norm in our proposed approach.In Table 5 we show faithfulness results considering ℓ 1 and ℓ 2 .In almost every setting ℓ 1 outperforms ℓ 2 .Remarkably, the advantage of the ℓ 1 is less noticeable on BERT, which we hypothesize is explained by the reduced anisotropy of its representations (Figure 3).

Addition of Layer Norm 2
Concurrent work (Modarressi et al., 2022) present Globenc method, which aggregates the contributions obtained by (Kobayashi et al., 2021) in Eq. 7 with Attention Rollout method.Moreover, they add the layer normalization (LN2) of the Feed-forward module of the Transformer layer into their method.
We evaluate the faithfulness of the interpretations provided by Globenc in Table 2, and although it improves the Rollout baseline, is far from the results obtained with ALTI.We consider analyzing the influence of the second layer normalization by including it in ALTI method.The probability drop in SST-2 across 10 BERT seeds (Figure 9) shows the influence of LN2 is negligible.We observe similar patterns across models and datasets.

Conclusions
In this paper, we have presented ALTI, an input attribution method that quantifies the mixing of information in the Transformer.We have demonstrated that with accurate layer-wise token-to-token contribution measurements relying on ℓ 1 -based metrics, the interpretable attention decomposition of the attention block is a powerful tool when combined with the rollout method.Empirically, we show that ALTI outperforms every input attribution method we have experimented with in two common faithfulness metrics, while showing greater robustness.Overall, we believe this opens new possibilities for studying contextual information aggregation across the Transformer.

Limitations
ALTI measures the amount of contextual information in each layer representation of the Transformer.
From the influence of each input token to the last layer representation we extract input attributions for the model prediction.However, our method does not consider the classifier on top of the Transformer.Therefore, our proposed method doesn't provide explanations for each of the output classes, as opposed to gradient-based methods.We also underline that faithfulness in this work is evaluated via sufficiency and comprehensiveness metrics.

Ethical Considerations
ALTI provides explanations about input attributions in the Transformer.By itself, we are not aware of any ethical implications of the methodology, which does not take into account any subjective priors.
To prove its usefulness, we have used two different benchmarks, text classification, and subjectverb agreement.As far as we are concerned, these benchmarks have been used in the past without raising major ethical considerations.Therefore, we do not have any major issue to report in this section.
Grad ℓ 2 low budget horror movie .if you don ' t raise your expectations too high , you ' ll probably enjoy this little flick .beginning and end are pretty good , middle drags at times and seems to go nowhere for long periods as we watch the goings on of the insane that add atmosphere but do not advance the plot .quite a bit of gore .i enjoyed bill mcghee ' s performance which he made quite believable for such a low budget picture , he managed to carry the movie at times when nothing much seemed to be happening .nurse charlotte beale , played by jesse lee , played her character well so be prepared to want to slap her toward the end !she makes some really stupid mistakes but then , that ' s what makes these low budget movies so good !i would have been out of that place and five states away long before she even considered that it might be a good idea to leave ! if you enjoy this movie , try committed from 1988 which is basically a rip off of this movie .
G×I ℓ 2 low budget horror movie .if you don ' t raise your expectations too high , you ' ll probably enjoy this little flick .beginning and end are pretty good , middle drags at times and seems to go nowhere for long periods as we watch the goings on of the insane that add atmosphere but do not advance the plot .quite a bit of gore .i enjoyed bill mcghee ' s performance which he made quite believable for such a low budget picture , he managed to carry the movie at times when nothing much seemed to be happening .nurse charlotte beale , played by jesse lee , played her character well so be prepared to want to slap her toward the end !she makes some really stupid mistakes but then , that ' s what makes these low budget movies so good !i would have been out of that place and five states away long before she even considered that it might be a good idea to leave ! if you enjoy this movie , try committed from 1988 which is basically a rip off of this movie .
IG ℓ 2 low budget horror movie .if you don ' t raise your expectations too high , you ' ll probably enjoy this little flick .beginning and end are pretty good , middle drags at times and seems to go nowhere for long periods as we watch the goings on of the insane that add atmosphere but do not advance the plot .quite a bit of gore .i enjoyed bill mcghee ' s performance which he made quite believable for such a low budget picture , he managed to carry the movie at times when nothing much seemed to be happening .nurse charlotte beale , played by jesse lee , played her character well so be prepared to want to slap her toward the end !she makes some really stupid mistakes but then , that ' s what makes these low budget movies so good !i would have been out of that place and five states away long before she even considered that it might be a good idea to leave ! if you enjoy this movie , try committed from 1988 which is basically a rip off of this movie .ALTI low budget horror movie .if you don ' t raise your expectations too high , you ' ll probably enjoy this little flick .beginning and end are pretty good , middle drags at times and seems to go nowhere for long periods as we watch the goings on of the insane that add atmosphere but do not advance the plot .quite a bit of gore .i enjoyed bill mcghee ' s performance which he made quite believable for such a low budget picture , he managed to carry the movie at times when nothing much seemed to be happening .nurse charlotte beale , played by jesse lee , played her character well so be prepared to want to slap her toward the end !she makes some really stupid mistakes but then , that ' s what makes these low budget movies so good !i would have been out of that place and five states away long before she even considered that it might be a good idea to leave ! if you enjoy this movie , try committed from 1988 which is basically a rip off of this movie .
Table 9: Saliency maps of BERT generated by three common gradient methods and by our proposed method, ALTI, for a positive sentiment prediction example of IMDB dataset.

Figure 2 :
Figure 2: Example of attention graph.The relevance R 2[CLS] of the input token [CLS] token to its second layer representation x 2 1 is obtained by summing all possible paths (coloured).

Figure 3 :
Figure 3: Average cosine similarity of attention block output representations (solid line) and transformed representations (dashed line) in 500 random samples of SST-2 dataset.

Figure 4 :
Figure4: The self-attention block (left) at each position i can be decomposed as a summation of transformed input vectors (middle).The closest vector (T 2 (x 2 )) contributes the most to y 2 .We obtain a matrix of contributions C (right) reflecting layer-wise token-token interactions.

Figure 6 :
Figure 6: Probability drop in BERT predictions when removing important tokens, obtained by different interpretability methods.We show results on three datasets.

Figure 7 :
Figure 7: Jaccard-25% similarity score between the interpretations of each method in 10 BERT's random seeds.

Figure 8 :
Figure 8: Spearman's rank correlation between the interpretations of each method in 10 BERT's random seeds.

Figure 9 :
Figure 9: Probability drop in BERT predictions when removing important tokens, results show mean and SD in BERT across 10 seeds in SST-2 dataset.
).Interestingly, although for the rest of the methods results vary

Table 2 :
Faithfulness results of the different interpretability methods for BERT, RoBERTa and DistilBERT on four different datasets.↑ means a higher number indicates better performance, while ↓ means the opposite.

Table 4 :
Faithfulness results of BERT, RoBERTa and DistilBERT comparing ALTI ℓ 2 with the Norms approach.