RealFormer: Transformer Likes Residual Attention

Transformer is the backbone of modern NLP models. In this paper, we propose RealFormer, a simple and generic technique to create Residual Attention Layer Transformer networks that significantly outperform the canonical Transformer and its variants (BERT, ETC, etc.) on a wide spectrum of tasks including Masked Language Modeling, GLUE, SQuAD, Neural Machine Translation, WikiHop, HotpotQA, Natural Questions, and OpenKP. We also observe empirically that RealFormer stabilizes training and leads to models with sparser attention. Source code and pre-trained checkpoints for RealFormer can be found at https://github.com/google-research/google-research/tree/master/realformer.


Introduction
Transformer (Vaswani et al., 2017) architectures are the backbone of numerous state-of-the-art NLP models such as BERT (Devlin et al., 2019), GPT (Radford et al., 2019), and Meena (Adiwardana et al., 2020), and have seen wide success across both academia and industry. Typically, a Transformer network consists of a stack of residual layers. The original design follows a "Post-LN" structure which adds Layer Norm (LN) as a "postprocessing" step for each sub-layer, as shown in Figure 1 (a). It has been adopted by various state-ofthe-art models including BERT, XLNet (Yang et al., 2019), RoBERTa (Liu et al., 2019), ALBERT (Lan et al., 2019), Transformer-XL (Dai et al., 2019), and ETC . Another notable design is to reorganize the order of modules to create a "direct"/clean path to propagate embeddings of tokens in the input sequence through the whole network, as shown in Figure 1 (b). 1 This design adds LN as a "pre-processing" step for each sub-layer, and is often referred to as "Pre-LN" and used by some well-known extra large models such as GPT-2 (Radford et al., 2019) and Megatron (Shoeybi et al., 2019). In some respect, Post-LN and Pre-LN are analogous to ResNet v1 (He et al., 2016a) and ResNet v2 (He et al., 2016b) respectively in the Computer Vision literature. Although ResNet v2 is usually preferable to v1 for Computer Vision, it does not appear to be the case for Pre-LN Transformer in the NLP literature. It is likely that the particularities of self-attention modules and Transformer architectures potentially favor (at least slightly) different designs compared to traditional convolutional neural networks.
In this paper, we propose a simple and generic technique to show that it is beneficial to create a "direct" path to propagate raw attention scores through Transformer-based networks. Our technique is called Residual Attention Layer Transformer, or RealFormer in short. We also use RealFormer to denote the resulting Transformer networks whenever no confusion may arise. Without losing generality, taking the standard Transformer encoder as an example, each RealFormer layer takes the raw attention scores of all attention heads from the previous layer and adds "residual scores" (computed the same way as attention scores in regular Transformers) on top, as shown in Figure 1 (c). The sum of the two scores is then used to compute attention probabilities via softmax.
In other words, RealFormer can be seen as adding simple skip connections to a backbone Transformer. Since it does not add expensive multiplication ops, performance is expected to be comparable. 2 Note that our technique can also be applied straightforwardly for different Transformer varia- BERT; (b) Pre-LN layer used by (e.g.) GPT-2 that creates a "direct" path to propagate token embeddings; (c) Our RealFormer layer that creates a "direct" path to propagate attention scores (by adding a simple skip edge on top of (a)). Note that here we are showing Transformer encoder for demonstration purposes only; RealFormer can be applied straightforwardly for different Transformer variations (e.g., when decoders are involved).
tions and even when decoders are involved.
Specifically, our main contributions include: • We present RealFormer, a simple, generic, and cheap technique to improve Transformerbased networks. It adds no parameters or hyper-parameters, and usually takes no more than a few lines of code changes to implement.
• We show that RealFormer can be used as a drop-in replacement of Transformer in BERT, outperforming both Post-LN and Pre-LN Transformers across a wide spectrum of model sizes for pre-training. In terms of finetuning, it even achieves competitive downstream results when pre-trained with only half the number of epochs of the baselines.
• We further demonstrate the genericity of Real-Former by using it as a drop-in replacement of two recent state-of-the-art Transformer variation models: ADMIN (Liu et al., 2020) from the Neural Machine Translation (NMT) domain, and ETC ) that extends Transformer to handle long and structured inputs. We show that RealFormer can improve these models significantly on various tasks and lead to new state-of-the-art results.
• Qualitatively, we observe that attention in Re-alFormer tends to be sparser and more correlated across layers compared to baselines, which we believe may have some regularization effects that could stabilize training and benefit fine-tuning.

Related Work
Vaswani et al. (2017) proposed Transformer initially for NMT and it has profoundly changed the NLP field ever since. Radford et al. (2018) demonstrated that generative pre-training of a Transformer-based language model (GPT) on a diverse corpus of unlabeled text can give large gains to downstream NLP tasks that suffer from scarce labeled data. Following this thread, Devlin et al. (2019) proposed to pretrain a bidirectional Transformer encoder (BERT) with a novel Masked Language Modeling as the main optimization objective. Since then, advances on many NLP tasks have been dominated by the self-supervised general-purpose pre-training, taskspecific fine-tuning paradigm. Following BERT, there has been a large stream of work that explores better self-supervision objectives (e.g., Yang et al. (2019);Clark et al. (2020)), larger pre-training data and better hyper-parameters (e.g., Liu et al. (2019)), model parameter sharing (e.g., Lan et al. (2019)), multi-task pre-training (e.g., Sun et al. (2020); Raffel et al. (2020)). These efforts typically employ a Post-LN Transformer at their core. In this paper we adopt BERT to test different Transformer architectures because it is widely used and representative of this body of work.
Another notable thread of work focuses on improving the efficiency/scalability of Transformer. Typically, they try to reduce the quadratic complexity of the self-attention mechanism with respect to sequence length via low-rank methods (e.g., ), fixed strided attention patterns (e.g., Child et al. (2019)), learnable attention patterns (e.g., Kitaev et al. (2020); Roy et al. (2020)), memory-based global & local attention (e.g., ; Beltagy et al. (2020); Zaheer et al. (2020)), and so on. These methods are particularly useful when dealing with long documents that go beyond the capacity of standard Transformer models. We would refer the reader to Tay et al. (2020) for a detailed survey. Real-Former is orthogonal to these methods as it focuses on improving various Transformer networks with an universal technique which can apply to these models as well. In this paper, we will use RealFormer to improve a state-of-the-art model, ETC , from this line of work to demonstrate the universality of RealFormer.  2019)) has studied normalization and parameter initialization schemes for Transformers, though most evaluations focus only on NMT to the best of our knowledge. In this strand, Liu et al. (2020) recently proposed ADMIN, which achieved state-of-the-art results on multiple popular NMT benchmarks. In this paper, we will take ADMIN as an example to (1) evaluate Re-alFormer in settings involving decoders, and (2) show that it is possible to apply RealFormer on top of this line of work.

Standard Transformer
There is an encoder and a decoder in Transformer (Vaswani et al., 2017). Since they work in a similar way, here we only introduce the encoder and refer the reader to the original paper for complete details.
There are two sub-layers inside each layer of a Transformer encoder. The first sub-layer contains a Multi-Head Attention module that computes output embeddings of a set of queries (Q) by aggregating the embeddings (V ) of a set of keys (K): i are matrices that linearly project queries, keys, and values into the "attention space" of the i-th head. W O is a matrix that linearly transforms the concatenation of the outputs of all heads.
The attention function is typically implemented with a Scaled Dot-Product Attention module (Vaswani et al., 2017) which computes a weighted sum of the values: contains the raw attention scores for each (query, key) pair. These scores are normalized via the Softmax function for each query and then act as weights for the corresponding vectors in V .
The second sub-layer contains a fully-connected Feed-Forward Network (FFN) module with one hidden layer: where σ is an activation function usually implemented with ReLU or GELU (e.g., Devlin et al. (2019)). FFN is applied to each position in the sequence separately and identically. Finally, there are Layer Norm (LN) modules inserted into the above two sub-layers to stabilize training.
As shown in Figure 1, there are two canonical designs of the Transformer network which only differ in the ways they organize the modules. Post-LN is the original architecture proposed by Vaswani et al. (2017) which normalizes the outputs at the end of each sub-layer. In contrast, Pre-LN normalizes sub-layer inputs instead and creates a direct path (without LN in the way) to propagate embeddings of the tokens in the sequence.

Residual Attention Layer Transformer
RealFormer uses a Post-LN style Transformer 3 as backbone and adds skip edges to connect Multi-Head Attention modules in adjacent layers, as shown in Figure 1 (c).
More formally, it adds P rev, the pre-softmax attention scores from the previous layer with shape (#heads, f rom seq len, to seq len), 4 as one additional input to the Multi-Head Attention module in the current layer: and P rev i is the slice of P rev with shape (f rom seq len, to seq len) corresponding to head i . ResidualAttention adds "residual scores" on top of P rev i and then computes the weighted sum as usual: Finally, new attention scores Q K T √ d k + P rev are passed over to the next layer.
Implementing RealFormer takes no more than adding a few lines of code to the backbone Transformer. Note that the RealFormer technique can be straightforwardly applied for Transformer variations and even when there are more than one type of attention modules in the network. For example, there are encoder self-attention, encoder-decoder attention, and decoder self-attention modules for machine translation. In such cases, RealFormer simply adds skip edges to create multiple direct paths, one for each type of attention module.
Discussion. Adding skip edges is equivalent to using a softmax over the running sum of the attention scores (to get attention probabilities). This might be sub-optimal for very deep networks due to the linear scaling nature of sum. Empirically, we find it helpful to use running mean instead in such cases, which can be viewed as adding a temperature (i.e., #traversed layers) to the softmax function in Eq. 1 of each RealFormer layer.

Experiments
To demonstrate that RealFormer is general-purpose, we conduct comprehensive empirical studies on a variety of tasks including (masked) language 4 Batch dimension is omitted for ease of discussion. modeling, machine translation, and long document modeling, based on corresponding state-of-the-art models: BERT, ADMIN, and ETC. To evaluate its robustness, we only do minimal (if at all) hyperparameter tuning for RealFormer and initialize all parameters the same way as the backbone Transformers. More aggressive hyper-parameter tuning or better initialization might further improve Real-Former, though we leave them for future work. Details of our experiments are included in Appendix.

BERT
BERT (Devlin et al., 2019) has been the standard way of transferring knowledge from large unlabeled text corpora by pre-training a bidirectional Transformer encoder. Numerous downstream NLP tasks suffering from scarcity of supervised data have benefited considerably by fine-tuning a pretrained BERT model. This drives us to adopt BERT as the main evaluation setup for RealFormer.

Experiment setup.
Our experiments are based on the official BERT repository 5 . We follow the standard pre-training setup (dataset: Wikipedia + BookCorpus, vocab: uncased 30K, max sequence length: 512 6 , dropout: 10%, learning rate: 1e-4, learning rate schedule: warm up and then linearly decay to 0, weight decay: 0.01, optimizer: AdamW, objective: Masked Language Modeling + Next Sentence Prediction, etc.) to compare three Transformer models: Post-LN, Pre-LN, and RealFormer. We experiment with Transformer architectures with a wide spectrum of sizes as detailed in Table 1. For simplicity, all models are pre-trained 1M steps with a mini-batch size of 512 (except that xLarge uses 256 to avoid TPU OOM). Note that we use a larger mini-batch size than Devlin et al. (2019), i.e., doubling the amount of pre-training epochs, to show more complete behavior of different models.
We use exactly the same setup for all three Transformer architectures except that for the Pre-LN Transformer we follow the initialization strategy suggested by Radford et al. (2019)

Pre-training Results
To evaluate pre-trained models, we report Masked Language Modeling (MLM) accuracy 8 on a randomly held-out development set. As shown in Table 2, RealFormer outperforms the two baseline Transformers considerably with the gap increasing with model size. Our hypothesis is that larger models are inherently harder to train (e.g., we observe that BERT with Post-LN is unstable and sometimes even diverges for xLarge) and RealFormer can help regularize the model and stabilize training.
We also report the pre-training curves in Figure 2. One interesting finding is that the Pre-LN Transformer seems to favor the combination of extra large models and a small number of steps, though it is consistently outperformed by the other two in "regular-sized" settings or given enough pre-training budget.

Downstream Results
To evaluate downstream performance, we fine-tune the above pre-trained BERT-Large models on both sentence-level (i.e., GLUE) and token-level (i.e., SQuAD) NLP tasks.
GLUE. General Language Understanding Evaluation (GLUE) is a canonical benchmark proposed by Wang et al. (2019a) for evaluating models across a diverse set of NLU tasks. Following the finetuning recipe in Devlin et al. (2019), we use a minibatch size of 32 for all models on all tasks. For each (task, model) pair, we select number of fine-tuning epochs in {2, 3, 4} and learning rate in {6e-6, 8e-6, 1e-5, 2e-5, 3e-5, 4e-5, 5e-5}. 9 For each setup, we run the experiment five times and report the best median performance and the corresponding standard deviation on the development set.
Results are tabulated in Table 3. We exclude the problematic WNLI task following Devlin et al. (2019). For each task, we report metric(s) suggested by Wang et al. (2019a). RealFormer achieves the best overall performance and outperforms both baselines on most tasks, testifying its strength at tackling sentence-level tasks.
SQuAD. The Stanford Question Answering Dataset (SQuAD v1.1) is a reading comprehension dataset consisting of 100K crowd-sourced questionanswer pairs, where the answer to each question is for all three Transformer models on these two datasets without using any additional data such as TriviaQA (Joshi et al., 2017). For both v1.1 and v2.0, we select mini-batch size in {32, 48}, number of fine-tuning epochs in {2, 3, 4}, and learning rate in {2e-5, 3e-5, 4e-5, 5e-5}. For each setup, we run the experiment five times and report the best median performance and the corresponding standard deviation on the development set. As we can see from Table 4, RealFormer outperforms the two baselines considerably, attesting its strength at tackling token-level tasks.

Research Questions
How well does RealFormer perform with half the pre-training budget? Although RealFormer has outperformed both Post-LN and Pre-LN considerably when pre-training 1M steps, we are also interested in investigating its potential when the pre-training budget is more limited. For this purpose, we experiment with BERT-Large models. In particular, we take the 500K step checkpoint of the pre-trained RealFormer in Table 2 and fine-tune it on GLUE and SQuAD datasets using exactly the same procedure as described above. Comparison results against the strongest baseline, Post-LN Transformer pre-trained 500K (checkpoint) and 1M steps respectively, are collected in Table 5. We can see that RealFormer with merely half the amount of pre-training epochs can beat Post-LN (1M) on GLUE with a significant margin, and almost match its performance on SQuAD.
Does a larger learning rate help? As suggested by some recent work (e.g., Xiong et al. (2020)), Pre-LN Transformer may benefit from using larger learning rates. To this end, we follow the pretraining procedure detailed earlier and switch to a larger learning rate, 2e-4, to pre-train BERT-Large with the three Transformer models. Development set MLM accuracy with training steps can be found in Figure 3. We find that both Pre-LN and Re-alFormer can reap some benefits of using larger learning rates with RealFormer seeming to benefit slightly more in this case (

Is attention sparser in RealFormer?
We conduct one empirical study to observe the qualitative differences between RealFormer and Post-/Pre-LN Transformers. We randomly sample 8,192 examples from the held-out development set and visualize the distribution of attention probabilities of each token in these examples across heads in all layers. In particular, for each (token, layer, head) triplet, we compute the entropy of the attention probabilities as the "sparsity measure" of attention. Intuitively, as entropy gets lower, the attention weight distribution becomes more skewed and therefore attention is sparser. In a similar fashion to Ramsauer et al. (2020), we use violin plots to show the entropy distributions of the pre-trained BERT-Base model with RealFormer from Table 2 (see Figure 4). Plots for the two baseline Transformers in Table 2 are included in Appendix A.4. Each row is a layer in BERT-Base and each column is an attention head.
We find that attention tends to get sparser for later (upper) layers for all three Transformers. However, RealFormer differs from the two baselines in the following ways: • RealFormer has significantly sparser attention for top layers (layer 9-11); • RealFormer tends to have lower variance across all layers, which means that attention density is less input-dependent.
We hypothesize that the above two properties might be a sign of stableness and benefit fine-tuning.  Do attention heads in layer L resemble those in layer L − 1? Since RealFormer uses a residual attention scheme, it is interesting to show to what extent an attention head is "relying on" the corresponding head in the previous layer. To this end, we take each of the three pre-trained BERT-Base models in Table 2 and compute the Jensen-Shannon Divergence (JSD) between attention probabilities in each pair of vertically adjacent heads, i.e., JSD (head L i , head L−1 i ), for 1 ≤ L < 12 and 0 ≤ i < 12.
Appendix A.5 demonstrates detailed JSD distributions of Post-LN and RealFormer respectively based on 8,192 held-out examples. We observe that RealFormer tends to have significantly lower JSD values (i.e., indicating more "similar" attention across layers), especially for heads in middle layers. This might mean that RealFormer has some regularization advantages and provides one hypothesis for why it tends to outperform Post-LN more for larger models. Note that head L i can still be useful even if it has exactly the same attention probabilities with head L−1 i because of the existence of the FFN sublayer and the potential differences in value matrices (i.e., V in Eq. 1).
Is residual attention really necessary? One may wonder whether increasing dropout rate can already regularize large models well so that residual attention is redundant. To this end, we experiment with different dropout rates for pre-training BERT-Large with different Transformers (following the procedures in Section 4.1.1). Results are collected in Table 6, from which we can see that (1) Real-Former outperforms the two baselines across all dropout settings, and (2) simply increasing dropout rate can not regularize Transformer models as well as what residual attention appears to be doing.

ADMIN
To evaluate the genericity of RealFormer, here we try it on top of ADMIN (Liu et al., 2020), a state-of- the-art NMT model without using either additional data or data augmentation. ADMIN adopts Post-LN as the backbone, which we simply replace with RealFormer. In particular, we add three types of skip edges for encoder-encoder, encoder-decoder, and decoder-decoder attention respectively to the Post-LN Transformer. Empirically, RealFormer with running mean of attention scores tends to outperform running sum for our experiments, therefore here we use the former exclusively for brevity.
We use two popular NMT benchmarks, WMT'14 En-De and WMT'14 En-Fr, and follow Liu et al. (2020) for all training setups on both benchmarks except that in all cases (1) we select the peak learning rate from {5e-4, 1e-3, 1.2e-3} and use a linear learning rate decay schedule (instead of inverse sqrt); 11 (2) we train RealFormer only 50 epochs (in contrast, ADMIN trains 100 epochs on En-De and 50 epochs on En-Fr); and (3) we average across the last 25 checkpoints (while ADMIN uses the last 10). More checkpoints are helpful for us (especially for large models) presum-11 With inverse sqrt decay, we find that RealFormer tends to favor larger peak learning rates than what Liu et al. (2020) uses, and we have also seen improvements in most cases.
ably because the last few are not "diverse" enough as learning rate decays to 0.
Our experiments are performed on NVIDIA A100 GPUs, based on the official ADMIN repository 12 . We follow Liu et al. (2020) to configure the amount of GPUs to use for different setups.
BLEU scores on test sets are collected in Table 7. For fair comparisons, we also run ADMIN using our above setups and report results in the same table. Following Liu et al. (2020), all networks (including both encoders and decoders) share the same width setup (hidden size 512, intermediate size 2048, 8 heads) and only vary in depth. Real-Former outperforms all baselines across all depths considerably with a new state-of-the-art BLEU score (43.97) on En-Fr for models not using additional data or data augmentation to the best of our knowledge. One interesting observation here is that RealFormer does not always lead to larger improvement gaps for larger models, which might be due to the checkpoint averaging mechanism (which potentially regularizes large models reasonably well).

En-De
En-Fr

ETC
Extended Transformer Construction (ETC) is a recent sparse attention mechanism proposed by  and Zaheer et al. (2020) to handle long context. It has achieved state-ofthe-art results on four natural language benchmarks requiring long and/or structured inputs. Here we evaluate RealFormer on top of ETC models on these benchmarks including WikiHop (Welbl et al., 2018), HotpotQA (Yang et al., 2018), Natural Questions (Kwiatkowski et al., 2019), and OpenKP (Xiong et al., 2019). They vary significantly in terms of dataset size, context length, and structure in text inputs. Please refer to  for more details.
Our experiments are based on the official ETC repository 13 . We take the ETC-Large model (24 layers, 1024 hidden size, 16 heads), add residual attention edges (i.e., using running sum), and follow all the pre-training and fine-tuning recipes as well as hardware setups detailed in . For each fine-tuning setup, we run the experiment five times and report the best median performance and the corresponding standard deviation on the development set in Table 8   Fine-tuning. We use 8 TPU v2 cores (i.e., 4 chips) to fine-tune each model. Best hyperparameter configurations for BERT-Large with Re-alFormer on GLUE and SQuAD are collected in Table 10. We include RealFormer pre-trained both 1M and 500K steps, corresponding to the results in Table 3, 4, and 5.

A.3 Training Details: ETC
All our experiments are conducted on TPU v3 cores based on the official ETC repository in Tensor-Flow: https://github.com/google-research/ google-research/tree/master/etcmodel.

Pre-training.
As is the case with ETC-Large , we find that pre-training ETC-Large with RealFormer can also benefit sig-