Telling BERT’s Full Story: from Local Attention to Global Aggregation

We take a deep look into the behaviour of self-attention heads in the transformer architecture. In light of recent work discouraging the use of attention distributions for explaining a model’s behaviour, we show that attention distributions can nevertheless provide insights into the local behaviour of attention heads. This way, we propose a distinction between local patterns revealed by attention and global patterns that refer back to the input, and analyze BERT from both angles. We use gradient attribution to analyze how the output of an attention head depends on the input tokens, effectively extending the local attention-based analysis to account for the mixing of information throughout the transformer layers. We find that there is a significant mismatch between attention and attribution distributions, caused by the mixing of context inside the model. We quantify this discrepancy and observe that interestingly, there are some patterns that persist across all layers despite the mixing.


Introduction
The inception of the transformer architecture has sparked significant progress across a wide range of language understanding tasks. Variants of transformers currently dominate the popular GLUE (Wang et al., 2019b) and Super-GLUE (Wang et al., 2019a) benchmarks and have even achieved super human performance on multiple tasks. The main innovations behind the transformer architecture are the stacking of selfattention layers into a multi-layer self-attention architecture, as well as an unsupervised pre-training phase that primes the model to be fine-tuned on a wide range of language tasks. Transformers and other self-attention-based models have been successfully adopted in other areas such as computer vision (Parmar et al., 2018), music processing (Huang et al., 2019) or protein research (Rao et al., 2019). Their extraordinary empirical success has led researchers to investigate transformers in order to better understand the source of this success, but also in an attempt to explain model decisions.
Much of the research around interpretability and explainability is focused on analyzing the selfattention operation (Clark et al., 2019). In multilayer self-attention, every input computes an attention distribution over itself and all other inputs to produce ever more complex feature representations. In the case of language, a word in a sentence attends to itself and to all other words in order to compute an updated contextual representation of itself. It is tempting to directly rely on attention distributions to explain the model's predictions. The rationale is that if the attention distribution aligns with human intuition, we can conclude that the model learned robust features and obtained a deep understanding of language, in contrast to simply overfitting on spurious patterns. For example, if a transformer classifies an online comment as hate speech, but we find that the model mostly attended to neutral or even positive words, we would conclude that the model did not actually understand the text and that the correct prediction was either due to chance or to the exploitation of an underlying statistical bias in the data (Niven and Kao, 2019).
However, recent studies (Brunner et al., 2020;Pruthi et al., 2019) question the ability of attention maps to provide a faithful explanation of the inner workings of transformer models. In particular, when the explanations refer to the model input, attention maps do not account for the mixing of information throughout the model. Since selfattention mixes information among all input tokens, the hidden layers attend over mixtures of tokens. Therefore, attention maps may be useful to investigate the local behavior of attention heads but not to draw conclusions about how input tokens relate to each other.
In this work we take a detailed look at the inner workings of BERT's attention heads, both by analyzing the self-attention distributions, and by using gradient attribution to account for the mixing of tokens throughout the model. We first show that self-attention distributions correlate strongly with Hidden Token Attribution (Brunner et al., 2020) (HTA) from hidden embedding to head output, this result validates HTA. We then present novel location based attention patterns, revealing that BERT, despite its bi-directional language modeling objective, attends to past embeddings in earlier layers, and to future ones in later layers. Next, we use HTA in order to extend the analysis to take the mixing of information into account, which allows to draw conclusions about the behaviour of an attention head with respect to the original input word. The patterns that emerge are different from the local attention-based patterns, giving us deeper insight into the operation of the model and emphasizing that local attention-based explanations are very different from global attribution-based explanations. Finally, we contrast attention and HTA distributions for individual examples. Our results further highlight the discrepancy between local attention patterns and global attribution patterns.

Related Work
The good performance of attention (Graves, 2013;Bahdanau et al., 2015) models in Natural Language Processing (NLP) arises from their ability to learn alignments between words. The transformer architecture (Vaswani et al., 2017) is a multi-layer multihead self-attention architecture that is pre-trained in an unsupervised manner. The extraordinary performance of transformer models has accelerated progress in the field of NLP. Currently, there is a growing number of different transformer models that vary in size, pre-training objective and/or other architectural elements (Radford et al., 2018(Radford et al., , 2019Liu et al., 2019;Lan et al., 2020;Yang et al., 2019;Sanh et al., 2019;Kitaev et al., 2020;Raffel et al., 2019).
The success of transformers and the possibility of visualizing attention distributions (Vaswani et al., 2017), has motivated a line of research aiming to understand the inner workings of transformers and explain their decisions. Many of these studies have focused on BERT (Devlin et al., 2019a), a wellknown transformer model, leading to a body of research grouped under the term BERTology (Rogers et al., 2020).
Aforementioned research builds on previous work on the interpretability of attention distributions in other models apart from transformers. In particular, Jain and Wallace (2019) examine the attention distributions of LSTM based encoderdecoder models and show a weak to moderate correlation between attention and dot-product gradient attribution. Furthermore, they show that adversarial attention distributions that do not change the model's decision can be constructed. In the same line, Serrano and Smith (2019) find, through zeroing out attention weights, that gradient attribution is a better predictor of feature importance with respect to the model's output than attention weights. Wiegreffe and Pinter (2019) find that although adversarial attention distributions can be easily obtained, they perform worse on a simple diagnostic task. All of these works raise concerns about the ability of attention distributions to explain the decisions of a model.
Despite existing concerns surrounding the interpretability of attention distributions, very few works have studied how this problem affects transformers. Pruthi et al. (2019) show that, just as in other attention models, it is possible to manipulate self-attention in transformers in order to generate different attention masks that cause only a small drop in performance. Brunner et al. (2020) find that attention distributions are not unique when the sequence length is larger than the head dimension and show that this can lead to the discovery of spurious patterns. Furthermore, they show that although it is possible to map hidden tokens back to their corresponding input tokens, there is a very large degree of information mixing inside the model, which raises questions about straightforward interpretations of attention maps. Recently, (Abnar and Zuidema, 2020) proposed a method to quantify information flow inside transformers. This method tracks the mixing of information due to attention but omits the effect of feed-forward networks.
Our work addresses this important issue by distinguishing between local and global aggregation patterns, where the former can be explained by attention distributions and the latter by attribution. We analyze BERT from both angles and quantify the mismatch between these interpretations. We show that attention correlates well with attribution locally but not globally and therefore attention maps are inadequate to draw conclusions that refer to the input of the model.

Background on Transformers
The original transformer architecture (Vaswani et al., 2017) is a sequence-to-sequence model consisting of an encoder and a decoder, both of which follow a multi-layer multi-head self-attention structure. Conversely, most of the pre-trained transformer models that can be fine-tuned on supervised language understanding tasks only consist of an encoder. Each transformer layer consists of a selfattention block and a non-linear feed forward block (MLP) with layer normalizations (Ba et al., 2016).
The input to a transformer layer is a sequence of embeddings E l = [e l 0 , ..., e l ds ] ∈ R de×ds , where l denotes the layer index, d e is the embedding dimension, and d s is the sequence length. We refer to the sequence of non-contextual input word embeddings as E 0 , and to the hidden contextual embeddings as E l , where l > 0. Note that E 0 refers to the word embeddings after position and sequence embeddings have been added. A self-attention block consists of n h separate attention heads. The attention heads independently perform the self-attention operation, and the results are then concatenated and projected back into the embedding space by a linear layer. The output of the attention block is then fed into the MLP.
The self-attention operation itself is implemented by projecting each input token e i ∈ R de into a query vector q i ∈ R dq , key vector k i ∈ R dq and value vector v i ∈ R dv . We present the selfattention operation from the perspective of a single token e i attending to all input tokens. For that, the key vectors k i are aggregated into the key matrix K = [k 0 , ..., k ds ] ∈ R dq×ds and the value vectors v i are aggregated into the value matrix V = [v 0 , ..., v ds ] ∈ R dv×ds . The attention distribution a i of token e i over all input tokens is then computed as The attention vector a i ∈ R ds now contains an attention weight for each input token. a i is then multiplied with the value matrix V to compute the output of the self-attention operation for a token i and a head h as The outputs of all heads {o 0,i , ..., o n h ,i } ∈ R de are then concatenated and fed through a linear layer to compute the output of the self-attention block for a single token. This linear layer can be thought of as an aggregation operation that projects the output of the independent heads back into embedding space. In practice, the attention distributions for all tokens are computed in parallel.

Extending Hidden Token Attribution
Hidden Token Attribution (Brunner et al., 2020) is a gradient-based attribution method that quantifies how much information from each input token is contained in a given hidden embedding. For each layer l, this method defines the relative contribution c l i,j of an input token e 0 i to a hidden embedding e l j as: The contribution c l i,j is normalized by the sum of the attribution values to all input tokens and hence, ranges between 0 and 1.
In this work, we apply Hidden Token Attribution to the individual attention heads of BERT. For a token e l j at layer l we back-propagate the gradients from the output o l h,j of each attention head h independently. This differs from the original method in that Hidden Token Attribution propagates the gradients from the layer output. In general, using Equation 1, we can compute the contribution between any two vectors in the model, as long as they are connected in the computation graph. We hence denote the contribution of any vector x to another vector y as C(x, y).
In particular, we calculate two different contributions to the head output: Previous layer contribution: Contribution from the hidden embeddings at the input of the attention head to the output of the attention head: Input contribution: Contribution from tokens at the input of the transformer model to the output of an attention head h at layer l: Previous layer contribution allows us to study how attention heads operate locally and how HTA distributions compare to attention distributions. Input contribution enables us to extend the head attention patterns all the way back to the input, thereby controlling for the effect of information mixing.

Setup
For our experiments we use the non-finetuned, uncased BERT base model (Devlin et al., 2019b) as provided in the original repository. 1 Despite the recent explosion of new transformer variants, BERT remains the most popular model for research into the interpretability of transformer models. The reason for this is that most of the newer models are architecturally similar to BERT, and therefore, studies carried out on BERT either are likely to generalize to these models or can be repeated with relatively little effort.
We perform our experiments on 1800 examples from the development set of the MNLI matched (MNLIm) dataset. Brunner et al. (2020) show that when the sequence length d s is larger than the head output dimension d v , the attention distributions are not identifiable. Therefore, to guarantee that in our experiments we do not find spurious patterns that do not influence downstream parts of the model, we restrict the examples in our dataset to sequences of maximum length of 64 tokens, which is the head dimension of BERT. Thus, the examples in our dataset have sequence lengths ranging between 6 and 64 tokens, with a median length of 34 tokens. In total, this subset contains 63,456 tokens.

HTA: Local Validation
The ability of attention distributions to provide explanations has been the target of a number of studies (Wiegreffe and Pinter, 2019; Serrano and Smith, 2019; Pruthi et al., 2019). In particular, Jain and Wallace (2019) show that attention distributions do not explain the model output and do not correlate well with attribution methods. However, if we are exclusively interested in how attention heads behave locally, i.e., without considering their impact on the model's decisions, it is sound to examine attention distributions. The reason for this is that self-attention is the only operation performed in attention heads, and hence, attention distributions precisely represent the information flow within the heads. As a consequence, we can use attention distributions as a reference to validate whether HTA accurately quantifies how information mixes within transformers. To verify this, we compare attention distributions to previous layer contribution by computing the correlation between attention maps a h,i and the contribution C(e l−1 i , o l h,j ) for each head. A high correlation value would validate HTA as accurately representing the flow of information within transformers. To calculate the correlation, first, we extract the attention maps for all the heads of BERT for each of the tokens in the examples of our dataset. Then, we pair each attention map to the corresponding contribution. Note, that both attention maps and contributions are distributions that lie in the probability simplex, i.e., all the values are between 0 and 1 and their sum is 1. Next, we calculate Pearson's correlation coefficient for each attention-contribution pair and we aggregate the results into one value per head by computing the mean of the correlation values. Figure 1 (Top) shows the mean correlation value per head. For all heads except for two, Pearson's correlation coefficient is larger than 0.7. Furthermore, 90% of the heads show a correlation between attention and Hidden Token Attribution of over 0.85. Similarly, we calculate Spearman's rank correlation coefficient r for each head. The results, displayed in Figure 1 (Bottom), show that only four heads have a Spearman's correlation smaller than 0.9, and that 75% of the heads have a correlation  coefficient larger than 0.95. Note that in any case gradient attribution is a local first order approximation and thus introduces a small error that prevents perfect correlation.
These high correlation values empirically demonstrate that HTA does indeed represent the flow of information within attention heads with respect to the head inputs. Therefore, to study the inner workings of transformers beyond attention heads one can rely on Hidden Token Attribution and apply it at different points of the model. Now that we have validated HTA, we can investigate the behavior of the heads in more detail: examining the local patterns revealed by attention, the global patterns revealed by HTA, and the discrepancies between both.

Local Head Analysis
In this section we take a closer look into the local behaviour of attention heads. Here, local means that we analyze how the intermediate tokens fed into the heads are processed, as opposed to how the model input propagates. To this end, we study attention distributions, but rather than studying each individual example, we aggregate the attention distributions, thus obtaining a general picture of how each head behaves. In particular, we study how much attention is paid to tokens in each relative position with respect to the attending token.
For each head, we extract the attention maps for each token. Then, we define the position of the attending token in the sentence as the origin (x = 0), thereby generating a histogram where the horizontal axis represents the position of the neighbours and the vertical axis the amount of attention paid to a token. We sum the histograms of all tokens and then normalize the result. To normalize, we divide the value of attention at each position by the number of times that a token is at that relative position; given that the median length of the examples is 36, this normalization ensures that distant positions are not penalized for having fewer occurrences. Figure 2 presents the histograms for the heads in layers 2, 5 and 10, the other layers can be found in Appendix A. From these histograms, a clear pattern is observable. In the first layers, heads tend to aggregate more information from past tokens than from future tokens. In fact, the attention of heads 2, 5 and 7 in Layer 2 to future tokens is negligible. However, this trend quickly reverses with increasing depth, and in later layers the aggregation of future hidden embeddings dominates for most heads. To illustrate this, we calculate the center of mass of the attention histograms per layer by The figure shows that BERT follows the same trend regardless of the language, i.e., the models first attend to past and then to future hidden tokens. This suggests that despite its bidirectional training, BERT tends to handle language like humans, from left to right. This is also inline with the sequential nature of language, i.e., the past context needs to be known to understand the future context.

Global Head Analysis
Although attention maps are an effective tool to understand the local behavior of attention heads, drawing conclusions that refer to the input words can be misleading. Transformers are complex models that mix information from the entire input sequence at each layer. Recent work (Brunner et al., 2020;Pruthi et al., 2019) has raised concerns about the interpretability of attention maps as representative of global context aggregation. In this section, we look into the individual heads and study what we call global patterns, i.e., aggregation patterns that refer to the model's input.
To this end, we follow the same procedure we use in the previous section, but to generate input contribution C(e 0 i , o l h,j ) histograms instead of attention histograms. In Figure 3 we show the histograms for layers 2, 5 and 10, i.e., the same layers as in Figure 2. The histograms for the whole model can be found in Appendix B. Furthermore, we calculate the center of mass per layer and compare them to the attention centers of mass in Figure 5. the global pattern of context aggregation is much more uniform than shown by the attention maps, especially after layer 4. This is intuitive: given that in the first layers the heads are attending mostly to the past context, on average, all the hidden tokens have a larger amount of past context. Therefore, when in later layers the attention shifts to the future hidden tokens, the past context already contained in these tokens balances the contribution, resulting in a uniform pattern of global context aggregation. The difference in the patterns revealed by this global analysis and the local head analysis from the previous section shows a strong mismatch between attention distributions and global context aggregation in attention heads. In fact, local attention patterns can easily lead to spurious conclusions when used to interpret global context aggregation. Next, we study this difference quantitatively.

Local Attention vs. Global Attribution
To quantify the discrepancy between attention distributions and input contribution, i.e., local and global patterns of context aggregation, we calculate the correlation between attention maps and input contribution C(e 0 i , o l h,j ). We follow the same methodology as in Section 6 and report Pearson's and Spearman's correlation coefficient in Figure 6. In line with the mismatch between attention and contribution histograms (Figures 2 and 3), we observe how the correlation between attention and input contribution quickly decreases in deeper layers. Particularly, after only four layers Pearson's correlation coefficient for most heads is smaller than 0.5 and in the last four layers the median head correlation value is smaller than 0.25. Furthermore, Spearman's correlation follows a very similar trend, with the median head correlation value falling un- der 0.7 already at layer 3, and under 0.25 at the last layer.
The results from this section point at the importance of information mixing: attention maps show how the heads behave locally, i.e., how they aggregate context, but not what context is in fact aggregated. Knowing how the heads behave locally can give us a better understanding of transformer models that could be leveraged to further improve the performance of these models (Wu et al., 2020). However, attention maps are misleading when drawing conclusions about what input words are being aggregated into the contextual embeddings.

Specific examples
The histograms studied in the previous sections give us a high level picture of what is happening inside the model. However, we averaged across examples with different sequence length and with different token types in different positions. To gain a more detailed understanding of the model's behaviour, we now look into specific input sequences randomly selected from our dataset.
Kovaleva et al. (2019)   observe the same five attention patterns. Nevertheless, to understand what input information these heads are actually aggregating, we need to look at the contribution from the input tokens.
In Figure 7, we compare the five patterns observed by Kovaleva et al. (2019) with the corresponding patterns revealed by Hidden Token Attribution with respect to the input, C(e 0 i , o l h,j ). A comparison for all heads is available in Appendix C. Remarkably, heads with the vertical pattern pay most attention to the SEP and CLS tokens. Nevertheless, the input contribution reveals that SEP tokens are used by the model to store general context, and by extracting information from the SEP token at intermediate layers, the model is in fact aggregating global context. Hence, with respect to the input, heads with vertical, diagonal and vertical-diagonal patterns have a similar behavior to heterogeneous heads. However, tokens around the diagonal tend to contribute the most given the prevalent aggregation of local context.
On the other hand, as shown in the first column of Figure 7, we observe that the block patterns prevail when we apply Hidden Token Attribution to the input. It is noteworthy that, while vertical and diagonal patterns fade away, the block pattern still remains visible. The fact that attending to tokens inside a block results in aggregation of context from within that block implies that up to that point, the context was mainly aggregated from within the blocks separated by SEP. We do not observe block patterns in the contribution maps for layers deeper than layer 4, which suggests that the first layers aggregate context within blocks and later layers aggregate context in a more global manner.

Conclusion
We provide justification for using HTA to study information flow within transformers. By studying attention distributions of BERT we uncover an interesting pattern: In earlier layers, attention heads attend mostly to earlier tokens, whereas this trend quickly reverses with increasing depth. This is surprising, since BERT is trained using bi-directional language modeling and it suggests that like humans, BERT understands language from left to right.
A problem with local attention patterns is that they do not reveal how the attention heads process the information contained in the input tokens. We thus use Hidden Token Attribution to compute perhead attribution distributions over the input words. Our results show that the mismatch between attention and attribution distributions increases with depth. This confirms the importance of accounting for information mixing when analyzing attention heads with respect to the input tokens. Finally, we show how five different attention head patterns differ from their token attribution equivalents. Our method and results are complementary to those in (Abnar and Zuidema, 2020), we believe a combination of attention flow and HTA may provide new interesting insights.
In this work, we aim to set a clear border that distinguishes between local and global aggregation in transformer models. This distinction is important when trying to interpret the behavior of these models, and we hope that it will help future studies in their analyses. Furthermore, our findings add new insights to the growing field of research on explaining transformers. This research, in turn, can help in guiding design decisions leading to further improvements of natural language processing architectures.