How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers

The attention mechanism is considered the backbone of the widely-used Transformer architecture. It contextualizes the input by computing input-specific attention matrices. We find that this mechanism, while powerful and elegant, is not as important as typically thought for pretrained language models. We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones -- the average attention weights over multiple inputs. We use PAPA to analyze several established pretrained Transformers on six downstream tasks. We find that without any input-dependent attention, all models achieve competitive performance -- an average relative drop of only 8% from the probing baseline. Further, little or no performance drop is observed when replacing half of the input-dependent attention matrices with constant (input-independent) ones. Interestingly, we show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success. Our results motivate research on simpler alternatives to input-dependent attention, as well as on methods for better utilization of this mechanism in the Transformer architecture.


Introduction
Pretrained Transformer (Vaswani et al., 2017) models have enabled great progress in NLP in recent years (Devlin et al., 2019;Liu et al., 2019b;Raffel et al., 2020;Brown et al., 2020;Chowdhery et al., 2022).A common belief is that the backbone of the Transformer model-and pretrained language models (PLMs) in particular-is the attention mechanism, which applies multiple attention heads in parallel, each generating an input-dependent attention weight matrix.
Interestingly, recent work found that attention patterns tend to focus on constant (inputindependent) positions (Clark et al., 2019;Voita et al., 2019), while other works showed that it is possible to pretrain language models where the attention matrices are replaced with constant matrices without major loss in performance (Liu et al., 2021;Lee-Thorp et al., 2021;Hua et al., 2022).A natural question that follows is how much standard PLMs, pretrained with the attention mechanism, actually rely on this input-dependent property.This paper shows that they are less dependent on it than previously thought.
We present a new analysis method for PLMs: Probing Analysis for PLMs' Attention (PAPA).For each attention head h, PAPA replaces the attention matrix with a constant one: a simple average of the attention matrices for h computed on some unlabeled corpus.Replacing all attention matrices with such constant matrices results in an attention-free variant of the original PLM (See Fig. 1).We then compute, for some downstream tasks, the probing performance gap between an original model and its attention-free variant.This provides a tool to quantify the models' reliance on attention.Intuitively, a larger performance drop indicates that the model relies more on the input-dependent attention mechanism.
We use PAPA to study three established pretrained Transformers: BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019b), and DeBERTa (He et al., 2021), each with BASE-and LARGE-sized versions.We evaluate these models on six diverse benchmarks, spanning text classification and structured prediction tasks.
Our results suggest that attention is not as important to pretrained Transformers as previously thought.First, the performance of the attentionfree variants is comparable to original models: an average relative drop of only 8%.Second, replacing half of the attention matrices with constant ones has little effect on performance, and in some cases may even lead to performance improvements.Interestingly, our results hint that better models use their attention capability more than weaker ones; when comparing the effect of PAPA on different models, we find that the better the model's original performance is, the more it suffers from replacing the attention matrices with constant ones.This suggests a potential explanation for the source of the empirical superiority of some models over othersthey make better use of the attention mechanism.
This work grants a better understanding of the attention mechanism in pretrained Transformers.It also motivates further research on simpler or more efficient Transformer models, either for pretraining (Lee-Thorp et al., 2021;Liu et al., 2021;Hua et al., 2022) or potentially as an adaptation of existing pretrained models (Peng et al., 2020a(Peng et al., , 2022;;Kasai et al., 2021).It also provides a potential path to improve the Transformer architecture-by designing inductive bias mechanisms for better utilization of attention (Peng et al., 2020b;Wang et al., 2022).
Finally, our work may contribute to the "attention as explanation" debate (Jain and Wallace, 2019;Serrano and Smith, 2019;Wiegreffe and Pinter, 2019;Bibal et al., 2022).By showing that some PLMs can perform reasonably well with constant matrices, we suggest that explanations arising from the attention matrices might not be crucial for models' success.
We summarize our main contributions.(1) We present a novel probing method-PAPA-which quantifies the reliance of a given PLM on its attention mechanism by "disabling" that mechanism for this PLM.(2) We apply PAPA to six leading PLMs, and find that our manipulation leads to modest performance drops on average, which hints that attention might not be as important as thought.(3) We show that better-performing PLMs tend to suffer more from our manipulation, which suggests that the input-dependent attention is a factor in their success.(4) Finally, we release our code and experimental results.2 2 Background: Attention in Transformers Transformers consist of interleaving attention and feed-forward layers.In this work, we focus on Transformer encoder models, such as BERT, which are commonly used in many NLP applications.
The (multi headed) self-attention module takes as input a matrix X ∈ R n×d and produces a matrix X out ∈ R n×d , where n denotes the number of input tokens, each represented as a d-dimensional vector.Each attention layer consists of H heads, and each head h ∈ {1, . . ., H} has three learnable matrices: the queries, keys and values, respectively).
The queries and the keys compute a n × n attention weight matrix A h between all pairs of tokens as softmax-normalized dot products:4 (1) where the softmax operation is taken row-wise.The value matrix V h is then left-multiplied by the attention matrix A h to generate the attention head output.
Importantly, the attention matrix A h is inputdependent, i.e., defined by the input X.This property is considered to be the backbone of the attention mechanism (Vaswani et al., 2017).
An intriguing question is the extent to which PLMs actually rely on the attention mechanism.In the following, we study this question by replacing the attention matrices of PLMs with constant matrices.We hypothesize that if models make heavy use of attention, we will see a large drop in performance when preventing the model from using it.As shown below, such performance drop is often not observed.

The PAPA Method
We present PAPA, a probing method for quantifying the extent to which pretrained Transformer models use the attention mechanism.PAPA works by replacing the Transformer attention weights with constant matrices, computed by averaging the values of the attention matrices over unlabeled inputs (Sec.3.1).PAPA also allows for replacing any subset (not just all) of the attention matrices.We propose a method for selecting which heads to replace (Sec.3.2).The resulting model is then probed against different downstream tasks (Sec.3.3).The performance difference between the original and the new models can be seen as an indication of how much the model uses its attention mechanism.

Generating Constant Matrices
To estimate how much a pretrained Transformer m uses the attention mechanism, we replace its attention matrices with a set of constant ones, one for each head.To do so, PAPA constructs, for a given head h,5 a constant matrix C h by averaging the attention matrix A h over a corpus of raw text.More specifically, given a corpus D = {e 1 , . . ., e |D| }, C h is defined as: where A h i is the input-dependent attention matrix that h constructs while processing e i .We note that the average is taken entry-wise, and only over non-padded entries (padding tokens are ingored).
We emphasize that the construction process of C h matrices requires no labels.In Sec.5.2 we compare our method of constructing constant matrices from unlabeled data to other alternatives that either use no data at all, or use labeled data.

Replacing a Subset of the Heads
Different attention heads may have different levels of dependence on attention.We therefore study the effect of replacing a subset of the heads, and keeping the rest intact.To do so, we would like to estimate the reliance of each head on the inputdependent attention, which would allow replacing only the heads that are least input-dependent for the model.
To estimate this dependence, we introduce a new weighting parameter λ h ∈ (0, 1), initialized as λ h = 0.5, for each attention head h.6 λ h is a learned weighting of the two matrices: the attention matrix A h and the constant matrix C h from (1) and ( 2) respectively.For each input e i , a new matrix B h is constructed as: We interpret a smaller λ h as an indication of h less depending on the attention mechanism.
We then train the probing classifier (Sec.3.3) along with the additional λ h parameters.We use the learned λ h s to decide which heads should be replaced with constant matrices, by only replacing the k% attention heads with the smallest λ h values for some hyperparameter k.7 Importantly, this procedure is only used as a pre-processing step; our experiments are trained and evaluated without it, where k% of each model's heads are replaced, and (1 − k%) remain unchanged.

Probing
Our goal is to evaluate how much attention a given PLM uses.Therefore, we want to avoid finetuning it for a specific downstream task, as this would lead to changing all of its weights, and arguably answer a different question (e.g., how much attention does a task-finetuned PLM use).Instead, we use a probing approach (Liu et al., 2019a;Belinkov, 2022) by freezing the model and adding a classifier on top.
Our classifier calculates for each layer a weighted (learned, non-attentive) representation of the different token representations.It then concatenates the different layer weighted representations, and applies a 2-layer MLP.For structured prediction tasks (e.g., NER and POS), where a representation for each token is needed, we concatenate for each token the representations across layers, and apply a 2-layer MLP.
When PAPA is applied to some input, we replace the attention matrices A h with the corresponding constant matrices C h . 8We then compare the downstream performance of the original model m with the new model m ′ .The larger the performance gap between m and m ′ , the higher m's dependence on the attention mechanism.

Method Discussion
Contextualization with PAPA PAPA replaces the attention matrices with constant ones, which results in an attention-free model.Importantly, unlike a feed-forward network, the representations computed via the resulting model are still contextualized, i.e., the representation of each word depends on the representations of all other words.The key difference between the standard Transformer model and our attention-free model is that in the former the contextualization varies by the input, and for the latter it remains fixed for all inputs.
Potential Computational Gains The replacement of the attention matrix with a constant one motivates the search for efficient attention alternatives.Using constant matrices is indeed more efficient, reducing the attention head time complexity from 2n 2 d ′ + 3nd ′2 to n 2 d ′ + nd ′2 ,9 which shows potential for efficiency improvement.
Several works used various approaches for replacing the attention mechanism with constant ones during the pretraining phase (Lee-Thorp et al., 2021;Liu et al., 2021;Hua et al., 2022), and indeed some of them showed high computational gains.Our work tackles a different question-how much do PLMs, which trained with the attention mechanism, actually use it.Thus, unlike the approaches above, we choose to make minimal changes to the original models.Nonetheless, our results further motivate the search for efficient attention variants.

Experiments
We now turn to use PAPA to study the attention usage of various PLMs.

Experimental Setup
Our experiments are conducted over both text classification and structured prediction tasks, all in English.For the former we use four diverse benchmarks from the GLUE benchmark (Wang et al., 2019): MNLI (Williams et al., 2018), SST-2 (Socher et al., 2013), MRPC (Dolan and Brockett, 2005), and CoLA (Warstadt et al., 2019).For the latter we use named entity recognition (NER) and part of speech tagging (POS) from the CoNLL-2003 shared task (Tjong Kim Sang and De Meulder, 2003). 10We use the standard train/validation splits, and report validation results in all cases. 11e use three widely-used pretrained Transformer encoder models: BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019b), and DeBERTa (He et al., 2021).We use both BASE (12 layers, 12 heads in each layer) and LARGE (24 layers, 16 heads per layer) versions.For each model and each task, we generate the constant matrices with the given (unlabeled) training set of that task.In Sec.5.3 we show that PAPA is not very sensitive to the specific training set being used.
All experiments are done with three different random seeds, average result is reported (95% confidence intervals are shown).Pre-processing and additional experimental details are described in App.A and B, respectively.

Probing Results
The results of the BASE and LARGE models are presented in Fig. 2a and 2b, respectively.We measure the performance of each model on each task using {1, 1 2 , 1 8 , 1 16 , 0} of the model's input-dependent attention matrices and replacing the rest with constant ones.
We first consider the original, fully-attentive, models, and find that performance decreases in the order of DeBERTa, RoBERTa, and BERT.This order is roughly maintained across tasks and model sizes, which conforms with previous results of finetuning these PLMs (He et al., 2021).This suggests that the model ranking of our probing method is consistent with the standard fine-tuning setup.
We note that the trends across tasks and models are similar; hence we discuss them all together in the following (up to specific exceptions).
Replacing all attention matrices with constant ones incurs a moderate performance drop As shown in Fig. 2, applying PAPA on all attention heads leads to an 8% relative performance drop on average and not greater than 20% from the original model. 12This result suggests that pretrained models only moderately rely on the attention mechanism.
Half of the attention matrices can be replaced without loss in performance We note that in almost all cases replacing half of the models' attention matrices leads to no major drop in performance.In fact, in some cases, performance even improves  compared to the original model (e.g., BERT BASE and DeBERTa LARGE ), suggesting that some of the models' heads have a slight preference towards constant matrices.This result is consistent with some of the findings of recent hybrid models that use both constant and regular attention (Liu et al., 2021;Lee-Thorp et al., 2021) to build efficient models.
Performant models rely more on attention Fig. 3 shows for each model the relation between the original performance (averaged across tasks) and the averaged (relative) reduced score when replacing all attention heads.We observe a clear trend between the models' performance and their relative reduced score, which suggests that better performing models use their attention mechanism more.

Further Analysis
We present an analysis of PAPA, to better understand its properties.We first discuss the patterns of the constant matrices produced by PAPA (Sec.5.1).Next, we consider other alternatives to generating constant matrices (Sec.5.2); we then examine whether the constant matrices are data-dependent (Sec.5.3); we continue by exploring alternative methods for selecting which attention heads to replace (Sec.5.4).Finally, we present MLM results, and discuss the challenges in interpreting them (Sec.5.5).In all experiments below, we use RoBERTa BASE .RoBERTa LARGE experiments show very similar trends, see App. C.

Patterns of the Constant Matrices
We first explore the attention patterns captured by different heads by observing the constant matrices (C h ).We first notice a diagonal pattern, in which each token mostly attends to itself or to its neighboring words.This pattern is observed in about 90% of the constant matrices produced by PAPA.Second, about 40% of the heads put most of their weight mass on the [CLS] and/or [SEP] tokens (perhaps in combination with the diagonal pattern described above).Lastly, while for some of the heads the weight mass is concentrated only in specific entry per row (which corresponding only to a specific token), in most of cases the weight mass is distributed over several entries (corresponding to several different tokens).These patterns are similar to those identified by Clark et al. (2019), and explain in part our findings-many of the attention heads mostly focus on fixed patterns that can also be captured by a constant matrix.Fig. 4 shows three representative attention heads that illustrate the patterns above.

Alternative Constant Matrices
PAPA replaces the attention matrices with constant ones.As described in Sec.3.1, this procedure requires only an unlabeled corpus.In this section, we compare this choice with constant matrices that are constructed without any data (data-free matrices), and those that require labeled data for construction (labeled matrices).
For the former we consider three types of matrices: (1) Identity matrix-in which each token 'attends' only to itself, and essentially makes selfattention a regular feed-forward (each token is processed separately); (2) Toeplitz matrix-we use a simple Toeplitz matrix (as suggested in Liu et al., 2021), where the weight mass is on the current token, and it decreases as the attended token is further from the current one (the entries of the matrix are based on the harmonic series); 13 (3) Zeros matrix-essentially pruning the heads.
We also consider two types of labeled-matrices: (4) initialized as the Toeplitz matrices from (2); and (5) initialized as our average matrices.These matrices are updated during the training procedure of the probing classifier. 14ab. 1 shows the performance of each attentionfree resulting model for all downstream tasks.We observe that for all tasks, our average-based model  outperforms all other data-free models by a notable margin.As for the labeled-matrices models, our model also outperforms the one initialized with Toeplitz matrices (4), and in most cases gets similar results to the model initialized with average matrices (5).It should be noted that the original models (with regular attention) do not update their inner parameters in the probing training phase, which makes the comparison to the labeled-matrices models somewhat unfair.The above suggests that our choice of constant matrix replacement better estimates the performance of the attention-free PLMs.

Are the Constant Matrices
Data-Dependent?
PAPA constructs the constant matrix for a given head C h as the average of the model's attention matrices over a given corpus D, which in our experiments is set to be the training set of the task at hand (labels are not used).Here we examine the importance of this experimental choice by generating C h using a different dataset-the MNLI training set, which is out of distribution for the other tasks.Results are presented in Tab. 2. The performance across all tasks is remarkably similar between generating the matrices using the specific task training set and MNLI, which suggests that the constant matrices might be somewhat data-independent.

Alternative Head Selection Methods
We compare our method for selecting which heads to replace (Sec.3.2) with a few alternatives.The first two replace the heads by layer order: (1) we sort the heads from the model's first layer to the  last and (2) from the model's last layer to the first.In both cases we use the internal head ordering per layer for ordering within the layer.We then replace the first k% of the heads.We also add (3) a random baseline that randomly replaces k% of the heads, and a (4) 'Reversed' one which replaces the heads with the highest (rather than lowest) λ h values (Sec.3.2).
Fig. 5 shows the MNLI performance of each method as a function of the fraction of heads replaced.We observe that our method, which is based on learned estimation of attention importance, outperforms all other methods for every fraction of heads replaced.Moreover, the 'Reversed' method is the worst among the examined methods, which suggests that our method not only replaces the least attention dependent heads first, but also replaces the most dependent ones last.Although our head replacement order outperforms the above methods, we note that our order is an overestimation of the model attention dependency, and better methods might show that even less attention is needed.

Effects on MLM Perplexity
So far we have shown that applying PAPA on downstream tasks only incurs a moderate accuracy drop.This section aims to explore its impact on masked language modeling (MLM).We find that while our models suffer a larger performance drop on this task compared to the other tasks, this can be explained by their pretraining procedure.
Fig. 6a plots the negative log perplexity (higher is better) of all BASE models on the WikiText-103 (Merity et al., 2017) validation set.When replacing attention matrices using PAPA, MLM suffers a larger performance drop compared to the downstream tasks (Sec.4.2).We hypothesize that this is because these pretrained Transformers are more specialized in MLM, the task they are pretrained on.As a result, they are less able to adapt to architectural changes in MLM than in downstream tasks.To test our hypothesis, we probe ELECTRA BASE (Clark et al., 2020) using PAPA.ELECTRA is an established pretrained Transformer trained with the replaced token detection objective, instead of MLM.It has proven successful on a variety of downstream tasks.ELECTRA BASE 's probing performance on MLM supports our hypothesis: We first note that its original performance is much worse compared to the other models (-3.51 compared to around -2 for the MLM-based models), despite showing similar performance on downstream tasks (Fig. 6b), which hints that this model is much less adapted to MLM.Moreover, the drop when gradually removing heads is more modest (a 0.44 drop compared to 1-1.5 for the other models), and looks more similar to ELECTRA BASE 's probing performance on MNLI (Fig. 6b).Our results suggest a potential explanation for the fact that some pretrained Transformers suffer a larger performance drop on MLM than on downstream tasks; rather than MLM demanding higher attention use, this is likely because these models are pretrained with the MLM objective.

Related Work
Attention alternatives Various efforts have been made in search of a simple or efficient alternative for the attention mechanism.Some works focused on building a Transformer variant based on an efficient approximation of the attention mechanism (Kitaev et al., 2020;Wang et al., 2020;Peng et al., 2020a;Choromanski et al., 2021;Schlag et al., 2021;Qin et al., 2022).Another line of research, which is more related to our work, replaced the attention mechanism in Transformers with a constant (and efficient) one.For instance, FNet (Lee-Thorp et al., 2021) replaced the attention matrix with the Vandermonde matrix, while gMLP (Liu et al., 2021) and FLASH (Hua et al., 2022) replaced it with a learned matrix. 15These works showed that pretraining attention-free LMs can lead to competitive performance.Our work shows that PLMs trained with attention can get competitive performance even if they are denied access to this mechanism during transfer learning.Figure 6: ELECTRA BASE model compared with other BASE models on MLM and MNLI.In Fig. 6a ELECTRA BASE behaves similarly to its behavior on MNLI, but not to the other models, which are MLM-based.In Fig. 6b ELECTRA BASE behaves similar to other models.In both graphs the x-axis represents the fraction of inputdependent attention heads, and the y-axis is the score of the specific task (higher is better).Pruning methods In this work we replaced the attention matrix with a constant one in order to measure the importance of the input-dependent ability.Works like Michel et al. (2019) and Li et al. (2021) pruned attention heads in order to measure their importance for the task examined.These works find that for some tasks, only a small number of unpruned attention heads is sufficient, and thus relate to the question of how much attention does a PLM use.In this work we argue that replacing attention matrices with constant ones provides a more accurate answer for this question compared to pruning these matrices, and propose PAPA, a method for constructing such constant matrices.

Conclusion
In this work, we found that PLMs are not as dependent on their attention mechanism as previously thought.To do so, we presented PAPA-a method for analyzing the attention usage in PLMs.We applied PAPA to several widely-used PLMs and six downstream tasks.Our results show that replacing all of the attention matrices with constant ones achieves competitive performance to the original model, and that half of the attention matrices can be replaced without any loss in performance.We also show a clear relation between a PLM's aggregate performance across tasks and its degradation when replacing all attention matrices with constant ones, which hints that performant models make better use of their attention.
Our results motivate further work on novel Transformer architectures with more efficient attention mechanisms, both for pretraining and for knowledge distillation of existing PLMs.They also motivate the development of Transformer variants that improve performance by making better use of the attention mechanism.

Limitations
This work provides an analysis of the attention mechanism in PLMs.Our PAPA method is based on probing rather than finetuning, which is more common use to PLMs.We recognize that the attention mechanism in finetuned PLMs might act differently than the original model, but our main focus is investigating the PLM itself, rather than its finetuned version.
Our analysis method is built on replacing the attention matrices with constant ones (Sec.3.1).We build these constant matrices by averaging the attention matrices over a given dataset.Because of this choice, our results reflect a lower bound on the results of the optimal attention-free model, and we acknowledge that there might be methods for constructing the constant matrices that would lead to even smaller gaps from the original model.A similar argument can be applied for our heads selection method (Sec.3.2).Importantly, better methods for these sub-tasks might further reduce the gap between the original models and the attention-free ones, which will only strengthen our argument.
Finally, we note that we used the PAPA method with six English tasks, and recognize that results might be different for other tasks and other languages.

A Pre-Processing
To make the replacement of the attention matrix with a constant one reasonable, we fix the position of the [SEP] token to always be the last token of the model's input, rather than separating the last input token from the padding tokens (i.e., it comes after the padding tokens rather than before them).
For tasks with two sequences per example (e.g., MNLI), which are typically separated by an additional [SEP] token, we fix this token to always be the middle token of the sequence, followed by the second sentence.We recognize that this might lead to suboptimal usage of the input's sequence length, e.g., if one of the sentences is substantially longer than the other and particularly if it is longer than half of the sequence length, it would thus be trimmed.In our experiments this only happened in less than 0.2% of input samples for a single task (MNLI), but we recognize that this might happen more frequently in other datasets.

B Hyperparameters
All of our code was implemented with the Transformers library (Wolf et al., 2020).Hyperparameters for the probing classifier on downstream tasks are shown in Tab. 3.

Figure 1 :
Figure 1: Illustration of the PAPA method, which measures how much PLMs use the attention mechanism.PAPA replaces the input-dependent attention matrices (left) with constant ones (right).We then measure the performance gap between the two.Moderate drop indicates minor reliance on the attention mechanism.

Figure 2 :Figure 3 :
Figure 2: Probing results (y-axis) with decreasing number of attention heads (x-axis).BASE models are shown in Fig. 2a, and LARGE models are shown in Fig. 2b.Higher is better in all cases.

Figure 4 :
Figure4: Generated constant matrices C h by the PAPA method for representative heads (layer, head).These matrices used for the attention-free variant of RoBERTa BASE for the SST-2 task.

Figure 5 :
Figure 5: Comparison between different heads selection methods over MNLI.Our method outperforms all other alternatives.The x-axis represents the fraction of inputdependent attention heads.
Some investigations of how attention patterns in Transformers work use probing techniques.Clark et al. (2019), Ravishankar et al. (2021) and Htut et al. (2019) studied the attention behavior in BERT.Unlike the above, which only focuses on the attention patterns of the PLM, our work sheds light on the dependence of PLMs on their attention mechanism.

Table 1 :
Probe task of performance of RoBERTa BASE with different constant matrix types as a replacement to the input-dependent attention matrix.Bold numbers indicate the best constant model for the task.Our approach based on an average of multiple attention matrices outperforms all other data-free matrix types across all tasks, and gets similar results to the best labeled-data based model.In all tasks higher is better.

Table 2 :
Comparison of probe task performance of RoBERTa BASE between two setups of constructing the averaged constant matrices C h : Per-Task uses the task training set, while MNLI uses the constant matrices generated with the MNLI dataset.The results are similar between the two setups, which indicates a low dependence of the constant matrices on the dataset used for constructing them.