Do Transformer Modifications Transfer Across Implementations and Applications?

The research community has proposed copious modifications to the Transformer architecture since it was introduced over three years ago, relatively few of which have seen widespread adoption. In this paper, we comprehensively evaluate many of these modifications in a shared experimental setting that covers most of the common uses of the Transformer in natural language processing. Surprisingly, we find that most modifications do not meaningfully improve performance. Furthermore, most of the Transformer variants we found beneficial were either developed in the same codebase that we used or are relatively minor changes. We conjecture that performance improvements may strongly depend on implementation details and correspondingly make some recommendations for improving the generality of experimental results.


Introduction
Much of the empirical success of deep learning can be attributed to advances in methods for building and training neural networks. These advances include improved optimizers (Sutskever et al., 2013;Hinton et al., 2012;Kingma and Ba, 2014;Shazeer and Stern, 2018a), regularization schemes (Srivastava et al., 2014;Zhang et al., 2017;Neelakantan et al., 2015), and model architectures (He et al., 2016;Hochreiter and Schmidhuber, 1997;Vaswani et al., 2017). An aspiration underlying much of this work is that an improvement to a particular machine learning pipeline will yield equal-or-better performance on any task that the pipeline is applicable to. For example, residual connections in convolutional networks (He et al., 2016) are designed to ideally improve * Correspondence to sharannarang@google.com † Work completed while at Google performance on any task where these models are applicable (image classification, semantic segmentation, etc.). In practice, when proposing a new improvement, it is impossible to test it on every applicable downstream task, so researchers must select a few representative tasks to evaluate it on. However, the proposals that are ultimately adopted by the research community and practitioners tend to be those that reliably improve performance across a wide variety of tasks "in the wild".
The Transformer architecture (Vaswani et al., 2017) is an example of a seminal improvement in the field of deep learning. Currently, the Transformer is the de facto architecture of choice for processing sequential data and is starting to be applied to vision applications (e.g. Dosovitskiy et al. (2020)). Since being introduced three years ago, many modifications to the Transformer architecture have been proposed. However, the most widely-used applications of the Transformer architecture  2019)) incorporate few of these modifications. Instead, the standard practice is to use a slightly-modified version of the originally-proposed Transformer. One possible explanation for this is that the originally-proposed Transformer architecture was near-perfect, and there wasn't much that could be done to improve it. This is in contrast to, for example, convolutional neural networks, which have continually evolved over the past few decades (e.g. the replacement of pooling with striding (Springenberg et al., 2014), fullyconnected layers with convolutional layers (Lin et al., 2013), the addition of normalization (Ioffe and Szegedy, 2015) and residual connections (He et al., 2016), etc.). Another possible explanation is that the modifications proposed to the Transformer do not "generalize" across applications, i.e. the modifications only help on the limited experimental setting considered when the modification was proposed, and/or rely on specific details that are not common across implementations of the Transformer.
The main goal of this paper is to try to determine why most modifications proposed to the Transformer have not seen widespread adoption. To answer this question, we reimplemented and evaluated a wide variety of Transformer variants on a suite of tasks that Transformers are commonly applied to. Our main finding is that many Transformer modifications do not result in improved performance in our experimental setting. Moreover, those variants that did yield better performance tended to be those that were quite small changes and/or were developed in the codebase where we carried out our evaluation. This suggests to us the possibility that Transformer modifications exhibit a surprising lack of generalization across different implementations and tasks.

Modifications
In this section, we enumerate all of the architectural modifications we consider. For a description of the Transformer architecture, refer to the appendix D.
Due to space constraints, we are seldom able to thoroughly define each specific modification. Moreover, we limit our study to the encoderdecoder architecture. Please refer to the original sources for each modification for additional details.

Normalization
We explored "RMS (root-mean-square) norm" (Zhang and Sennrich, 2019) as an alternative to layer normalization as well as the Rezero (Bachlechner et al., 2020) initialization scheme, including combining Rezero with Layer Norm and RMS Norm. We also explored the Fixup (Zhang et al., 2019) initialization scheme which tries to solve the vanishing/exploding gradient problem by rescaling the initializations.

Depth
We explored the trade-offs between the width of the feedforward subblocks (d ff ) and depth (L). In order to ensure fair comparison, we scale d ff and the number of heads (H) in order to keep the total number of parameters constant when changing the depth.

Embeddings
The Transformer model includes multiple weight matrices of shape of d model × d vocab : one at the input of the encoder, one at the input of the decoder, and one at the output of the decoder. Chung et al. (2021) showed the benefits of untying the embeddings for the encoder-only models. We extend the analysis and explore various ways of sharing these parameters: tying only encoder input and decoder input embeddings, tying only decoder input and output embeddings, and untying all the embeddings.
In addition, we explored factorizing the embedding matrix into two smaller matrices (Lan et al., 2019). In other words, the embed- We tried both untied and tied decoder embeddings while encoder and decoder embeddings are shared.
The last technique we explored for the embeddings is the "Adaptive input embeddings" by Baevski and Auli (2019). Vocabulary items are clustered based on their frequencies. A cluster with more frequent ones has a larger embedding dimension. The embedding vectors are projected to the same dimension and concatenated.

Parameter sharing
We also explored sharing the parameters of the Transformer layers inspired by the "ALBERT" model of Lan et al. (2020). Each subblock (e.g., self-attention) has a unique set of weights shared across all l layers. Following Lan et al.
(2020), we factorized the embeddings (denoted as "Factorized embeddings") in addition to the parameter sharing. Note that these models have untied softmax and vocabulary embeddings in the decoder; we also tried tying them (denoted as "Shared embeddings"). Finally, we experimented with applying the parameter sharing to the encoder and decoder separately.

Softmax
Our work considers variations to the softmax computation that produces the final probability distribution as computed by the last layer embedding. Adaptive softmax (Joulin et al., 2017) uses the natural imbalance in word distributions (Zipf, 1949) to form clusters in a hierarchical model, which minimizes computation time. In the original implementation, each cluster is permitted to have a different capacity and the size of the representations for rare words is reduced via a projection matrix. We consider the original variant, as well as a version that ablates the projection operation. Mixture of Softmaxes (MoS) (Yang et al., 2017) improves the expressiveness of a single softmax operation by instead computing a linear combination over softmaxes, each weighted by learned coefficients.

Architectures
Transparent Attention One type of attention variant we experiment with is Transparent Attention (Bapna et al., 2018). Transparent attention (Bapna et al., 2018) creates weighted residual connections along encoder depth to facilitate gradient flow. In appendix A, we experiment with additional attention variants.

Evolved Transformer
The Evolved Transformer (So et al., 2019) was designed via evolution-based architecture search (Real et al., 2019) where the initial population was seeded with the original Transformer. The search space generalizes the one followed in NASNet (Zoph et al., 2018), but extended to be able to represent the Transformer.

Synthesizer variants
We explore the factorized, dense, and random Synthesizer variants from Tay et al. (2020), where self-attention is replaced with "synthetic attention" patterns. We denote "plus" when dot product attention is additively combined with the synthetic attention and plus alpha to denote when a scalar α is used to interpolate between synthetic and dot product attention.
Funnel Transformer Funnel Transformer progressively reduces the sequence length in order to efficiently encode the input sequence (Dai et al., 2020). We only applied this reduction to the encoder. Product Key Memory Similar to the expert model designs, product key memory networks (Lample et al., 2019) process inputs adaptively, selecting sparse values. In contrast, the mechanism of sparse computation isn't done via learned routing, but instead by an efficient k-nearest neighbor weighted sum.
Universal Transformer Similar to block sharing, the Universal Transformer (Dehghani et al., 2018) applies the same Transformer "block" over and over again to the input sequence. However, instead of applying it a fixed number of times, it recurrently refines the representation for each token until a halting mechanism (based on Adaptive Computation Time (Graves, 2016)) is triggered.

Experiments
In order to study the impact of each of the modifications described in section 2, we conduct a systematic study by comparing a baseline model to each modification while holding the task, hyperparameters, optimizer, and either the parameter count or FLOP budget (floating point operations per second) constant. We use the original Transformer model as our baseline model with two modifications: First, we apply layer normalization before the selfattention and feedforward blocks instead of after. This small change has been unanimously adopted by all current Transformer implementations because it leads to more effective training (Baevski and Auli, 2019; Xiong et al., 2020). Secondly, we use relative attention with shared biases (as used in Raffel et al. (2019)) instead of sinusoidal positional embeddings, which makes it easier to train the model. Our baseline model is a standard encoder-decoder with 12 layers in the encoder and decoder. The feedforward network in each layer consists of a dense layer with dimension of d ff = 3072. All attention mechanisms have 12 heads and "key" and "value" matrices have a dimension of d kv = 64. All other sublayers have a dimension of d model = 768 resulting in 223 million parameters in the model. We refer to this model as the "Vanilla Transformer".
We consider two experimental settings for evaluating the performance of each modification: Transfer learning based on the T5 (Raffel et al., 2019) and supervised machine translation on the WMT'14 English-German translation.
For transfer learning, we copy the methodology used by the T5 model, proposed in Raffel et al. (2019). For full details of this experimental setup, please refer to Raffel et al. (2019). We pre-train encoder-decoder models in a selfsupervised manner using the "span corruption" masked language modeling objective (Taylor, 1953;Fedus et al., 2018;Devlin et al., 2018) on the C4 dataset. We run experiments on version 2.3.1 of the C4 dataset available in TensorFlow Datasets 1 . We pre-train each architecture variant for 524, 288 steps with batches of 65, 536 tokens. As in T5, we use Adafactor (Shazeer and Stern, 2018b) for optimization and an inverse square root learning rate schedule during pretraining. We use a maximum sequence length of 512 for both the inputs and targets during pre-training. To evaluate the performance of pre-trained models, we compute the perplexity on a held-out portion of the C4 dataset for each pre-trained model, with the expectation that improvements in perplexity will correlate with performance on fine-tuned tasks. To capture the inter-run variance on these models, we run each model 5 times for 65, 536 steps ( 1 8 th of the total pre-training steps). We report the mean and standard deviation of the loss (log perplexity) on held-out data of these five experiments and also report the final loss at the end of pre-training (524, 288 steps). We do not use any regularization during pre-training.
In the transfer learning setting, after pretraining we fine-tune each model on three different tasks: the SuperGLUE (Wang et al., 2019) natural language understanding metabenchmark, the XSum (Narayan et al., 2018) abstractive summarization dataset, and the closed-book variant (Roberts et al., 2020) of the WebQuestions (Berant et al., 2013) questionanswering task. With these tasks, we hope to capture a broad variety of NLP problems including language understanding and classification, language generation, and knowledge internalization. For SuperGLUE and XSum, each model is fine-tuned for 262,144 steps. Since the WebQuestions dataset is much smaller, we fine-tune the model for only 30,000 steps. We use a constant learning rate of 0.0005 with a linear warm-up of 20, 000 steps. Similar to pre-training, each batch contains 65, 536 tokens. We save a checkpoint every 2, 500 steps (1, 000 steps for WebQuestions) and report results on the model checkpoint corresponding to the highest validation performance. We use a dropout of 0.1 during fine-tuning for all the tasks. All results are reported on the validation split of each dataset. For SuperGLUE, we report the average score across all tasks in the benchmark. We report ROUGE-2 (Lin, 2004) for XSum and accuracy for WebQuestions.
For supervised training on the WMT'14 English to German translation task (Bojar et al., 2014), we use the same model and batch size as for the transfer learning setting. We train for a total of 150,000 steps. We use the same data splits as were used in (Vaswani et al., 2017) and report the BLEU score of the highest-scoring checkpoint on the validation set. We use a vocabulary of 37,000 tokens learned by Byte Pair Encoding (Sennrich et al., 2016) for supervised training as opposed to 32,000 tokens (created using SentencePiece (Kudo and Richardson, 2018)) for the transfer learning experiments.
To compare the efficiency of the model, we also report the total number of parameters, the total number of floating point operations, and the measured steps per second in the pretraining experiments. Reporting these parameters can help us understand the trade-off between quality and efficiency. For each architectural modification, we attempt to keep either the parameter count or total operations in the model approximately the same to perform a fair comparison with the baseline model.
All hyperparameters are held constant for each architectural variant across pre-training and fine-tuning. However, we found that certain architectural (Rezero and Fixup) variants achieved significantly lower negative log perplexity than the baseline model with the Adafactor optimizer. Therefore, we use the Adam optimizer (Kingma and Ba, 2014) for these variants. For pre-training, we use an inverse square root learning rate schedule with a linear warm-up of 4, 000 steps. For fine-tuning, we use a constant learning rate of 5e − 5 with a linear warm-up of 20, 000 steps. We provide details of certain modifications in appendix B.
All experiments are run using the T5 library 2 on "slices" of Cloud TPU Pods. All model variants are implemented in the Mesh Tensor-Flow library (Shazeer et al., 2018).

Results
The results for all model variants are shown in table 1. The vanilla Transformer achieves a Su-perGLUE average of 70.97 and a BLEU score of 26.62 on WMT14 EnDe. This is comparable with the scores achieved by the equivalentlysized T5-Base model Raffel et al. (2019) and similarly-sized Transformer-Big from Vaswani et al. (2017), which confirms that our baseline is reasonable. As mentioned earlier, each variant has approximately the same number of parameters or total operations as the vanilla Transformer, with the following exceptions: For the Universal Transformer, the total number of operations is approximately 4× the baseline model. Since the Universal Transformer model is already significantly smaller than the baseline model, it would not be fair to shrink the model even further to match the number of operations with the baseline. Product key memories (Lample et al., 2019) should only slightly increase FLOPs over the vanilla Transformer, but the total number of operations is artificially extremely high due to an inefficient implementation in Mesh Tensorflow.
We find that several activation functions improve performance over the ReLU activation. Specifically, SwiGLU and GeGLU improve performance on pre-training, fine-tuning, and supervised training without sacrificing any efficiency in terms of speed. Replacing layer normalization with RMS normalization yields improvements while also improving training speed. Our experiments with varying the depth of the model indicate that deeper models tend to outperform shallower ones with a fixed parameter count. However, these deeper models are also more compute-intensive and therefore slower than their shallower counterparts. Sharing of parameters across layers tends to hurt performance. Interestingly, untying the encoder/decoder embeddings improve performance with only a modest increase in parameter count. Using mixture of softmaxes does improve performance but is almost 40% slower than the vanilla Transformer.
Among the different architectures, we find that two of the synthesizer variants are beneficial. Switch Transformer, mixture of experts, and product key memories all improve performance with significantly more parameters than the baseline model. However, these implementations only use a subset of the parameters during each step, so they are roughly equivalent to the vanilla Transformer in total number of operations. Surprisingly, all the other architecture variants generally performed poorly.
Overall, we found that most of the beneficial modifications conferred improvements across pre-training, fine-tuning, and supervised training, though a few variants (e.g. transparent attention, Synthesizer-random, fixup) harmed performance for transfer learning but not for WMT'14 EnDe. The modifications that led to significant improvements tended to fall into one of three buckets: relatively minor changes (i.e., activation functions, normalization and untying embedding matrices); those that increase parameter count (i.e., Switch Transformer, product key memory) or are slower (i.e., mixture of softmaxes, deeper models); or those that were originally invented in the Mesh Tensor-Flow codebase that we use for our experiments (i.e., mixture of experts, switch Transformer, synthesizer). To further ensure the correctness of the various architecture modifications, we reached out to authors of 12 techniques to review our implementation and provide their feedback and received responses from 6 of them. All of the authors who responded confirmed that our re-implementation was correct.

Impact of hyperparameter tuning
It is a well-established fact in deep learning that hyperparameters (and even random seeds (Dodge et al., 2020)) may have a huge impact on model quality. In our experiments, we intentionally kept hyperparameter fixed in order to measure whether a given modification improves performance regardless of hyperparameter settings. Given that this may be an overly idealistic constraint, we present a case study of trying to improve one of the model variants by tuning its hyperparameters. We selected Universal Transformers (UT) (Dehghani et al., 2018) because it was claimed to achieve better results than the vanilla Transformer, and the UT has a relatively large number of hyperparameters that we can adjust. Using our standard hyperparameters, we obtain a loss of 2.40 after training for 65,536 steps. Bearing in mind that our vanilla Transformer obtains a loss of 2.182 after the same amount of training, our goal was to at least achieve comparable performance using the UT.
To this end, we swept over 25 model configurations, varying the number of recurrent steps and the gating/transition functions in the UT. We also varied non-model-specific hyperparameters including the learning rate schedule and d model . Over these 25 sweeps, only 2 managed to outperform the initial results. The only settings that worked were the result of reducing the number of recurrent steps (from 16 to 2) and slightly increasing the model size. In the end, we managed to achieve an improvement of 2.40 → 2.265 (or 6% relative). While this is significant, many other hyperparameter settings failed to produce good results, and we were ultimately unable to match the performance of the vanilla Transformer. This exercise illustrates the challenge of tuning these models.

Correlation of perplexity and task performance
In order to understand the relationship between pre-training performance and fine-tuned task quality, we investigate the correlation between perplexity and quality on each task. As shown in fig. 1, quality on all three tasks seem to be correlated with pre-training perplexity, though the correlation is surprisingly weak given past results suggesting a stronger relationship (Adiwardana et al., 2020). Interestingly, the performance on SuperGLUE (Spearman's ρ = 0.87) and XSum (Spearman's ρ = 0.80) seems to be highly correlated with the pre-training perplexity, whereas the performance on WebQuestions (Spearman's ρ = 0.69) has a somewhat lower correlation. This may indicate that classification and generation tasks benefit more from improvements in perplexity than knowledgeintensive tasks like question answering.

Conjectures and Recommendations
As discussed above, we were surprised to find that so few of the architectural modifications produced improvements in the settings we considered. This largely contrasts the experiments included in the original papers that proposed each modification. We broadly grouped the modifications that actually did improve performance as either 1) being relatively simple (e.g. a change in activation function), 2) being developed in the same codebase where we ran experiments (e.g. the Synthesizer variants (Tay et al., 2020)), or 3) incurring an increase in parameter count or FLOPs (e.g. the Switch  ). Other modifications that don't fit into one of these categories generally didn't improve performance.
There are various possible explanations as to why our results bore out the way they did: 1. The Mesh TensorFlow codebase and implementation are just so different than standard practice that most architectural modifications do not work. We believe this is unlikely due to the fact that the Mesh TensorFlow Transformer implementation was created by one of the coauthors of the original Transformer paper and has been used to attain state-of-the-art results 2. The tasks we consider are non-standard or do not match the set of tasks used to vet the modifications in the first place. The Transformer model is used for a variety of NLP problems including classification and generation tasks. We included transfer learning experiments on SuperGLUE, XSum, and WebQuestions and supervised training on WMT'14 EnDe, which covers the majority of use-cases.
3. Not tuning hyperparameters handicapped other methods. While per-modification tuning might improve results (as verified in section 3.2), we argue that truly useful improvements to the Transformer should be reasonably hyperparameter-agnostic. Further, if hyperparameter sensitivity was the issue, it would be likely that a least a few of the compared methods "got lucky" with the hyperparameters, but very few modifications produced a boost.

4.
We implemented many of the modifications incorrectly. To rule out this possibility, we corresponded with many of the creators of the modifications we considered, who confirmed the correctness in all cases.

5.
Modifications to the Transfomer architecture often do not transfer across implementations and applications.
Following the above rationale, we believe the final option is a plausible explanation for our results. This possibility is supported by the fact that few of the modifications we consider in this paper have seen widespread adoption -if they transferred easily across implementations and applications, they would likely have been more widely adopted.
Given this sober take, we conclude our paper with some suggestions as to how to ensure the robustness of improvements for future architectural modifications. First, when proposing a new modification, try it out in multiple completely disparate codebases. Given the proliferation of Transformer implementations ), this should be straightforward. Second, apply it to a wide variety of downstream applications, including transfer learning, supervised learning, and language modeling -and, possibly, include domains beyond NLP too (e.g., computer vision (Dosovitskiy et al., 2020)). Third, when evaluating performance in different implementations and on different tasks, keep hyperparameters fixed as much as possible, or at least attempt to measure the robustness of the modifications to changes in hyperparameters. Finally, best-practice reporting of results should include mean and standard deviation across multiple trials, or at least avoid cherry-picking the best run (Dodge et al., 2020;Henderson et al., 2018). With these guidelines in mind, we hope future work on architectural modifications to the Transformer will be more likely to see widespread adoption and improve the performance of this powerful architecture.

A Experiments with positional embeddings
We also conducted a study of architectural variants using learned positional embeddings (Vaswani et al., 2017) in the baseline model instead of relative attention. Besides this change, the experimental setup remains the same (as described in section 3). The weighted Transformer architecture doesn't reliably converge using positional embeddings, so we do not report results using this architecture.
In addition to the modifications described in section 2, we also experiment with variations in attention. Sinusoidal positional embeddings (Vaswani et al., 2017) were proposed in the original Transformer to inject information of the order of the sequence into what was otherwise a set-operation transformation. Relative attention (Shaw et al., 2018) replaced the absolute position embeddings by those based on relative distance between tokens (clipped to a maximum distance hyperparameter k). The MeshTensorflow code base (Shazeer et al., 2018) introduces two changes to relative attention. In these changes, a bias is added to the selfattention logits (eq. 8) before multiplication with values, where the bias may be optionally shared across self-attention layers.
The results from this study are shown in table 2. Similar to relative attention, the only modifications that result in improvements are relatively minor modifications (e.g. activation function andnormalization), inefficient in terms of parameter count or FLOPs (e.g. the Switch Transformer) or were invented in the same codebase that we used (e.g. Synthesizer). Architectures with relative attention outperform those with positional embedding by a significant margin. Interestingly, certain architectures (Mixture of Softmaxes, tied decoder input and output embeddings) outperformed the vanilla Transformer with relative attention perform worse than the vanilla Transformer in this setup. Also, the absolute fine-tuned performance is worse for almost all the models compared with their relative attention counterparts.

B Implementation details for modifications
For factorized embedding, we use an inner dimension of 128 for models with and without block sharing of parameters.
In adaptive input embedding experiments, we use three clusters of size 2500, 6000, and 23, 628. For experiments with adaptive softmax, we split the third cluster into two clusters of 23, 500 and 128. Since we used a larger vocabulary (see section 3) for the supervised training on the WMT'14, we use the same number of clusters with the same relative cluster sizes.
We experimented with 10 and 15 softmaxes for the mixture of softmax models. In the paper, we only report results for the model with 15 softmaxes since it performs better.
For Lightweight and Dynamic convolutions, we use one-dimensional kernel with width 9. The depth of the kernel is determined depending on whether it is depthwise-convolution or vanilla convolution in which case its depth is d model . For Universal Transformer, we use number of recurrent steps of 24 and halting threshold of 0.5. We use 32 experts in the Mixture of Experts experiments.
In PKM experiments, we use knn = 32, 128 keys and 512 memory slots. In our experiments, we introduce a product key memory network before the last layer in the decoder.
In the Funnel Transformer experiments, we use mean pooling with 3 blocks in the encoder. The input sequence is pooled after every 4 layers in the funnel Transformer. In the weighted Transformer, we freeze the weights of the branched attention module for the last 20, 000 steps of pre-training.

C Reproducing the original
Transformer experiments Vaswani et al. (2017) reported the BLEU score of 25.8 (Table 3 of their paper) when evaluated on the dev set without checkpoint averaging. We ran a replication experiment with the same Transformer-Base architecture and achieved 25.52. With this, we believe that our Transformer codebase closely replicates the original one. Additionally, the baseline transformer model in our paper is comparable to the Transformer-Big model from Vaswani et al. (2017). The Transformer-Big model achieves a BLEU score of 26.4 (Table 3 of their paper) on the validation set of the WMT EnDe translation task. Our baseline model achieves a BLEU score of 26.62 on the same validation set which is marginally better than the results reported in the original paper.

D Transformer Background
In this section, we give a brief description of the original Transformer architecture. We primarily include this description so that we can refer back to specific components as we introduce different modifications. For a more in-depth description of the Transformer architecture, refer to the original paper (Vaswani et al., 2017) or follow-up tutorials 3,4 . In this work, we solely experiment with "encoder-decoder" Transformers, which ingest an input sequence of tokens and produce an output sequence conditioned on the input. We denote the tokens of the input sequence as where p[t] ∈ R d model is a "position embedding". In the original Transformer, this position embedding is computed as In general, we will use h e,l and h d,l to denote the output of the lth layer block of the encoder and decoder, respectively. For simplicity, we refer to the embeddings as if they are the output of a "zeroth" layer block.
Each layer block in the encoder comprises a multi-headed self-attention mechanism (Cheng et al., 2016) followed by a position-wise dense/nonlinearity/dense feedforward network. Both of these "subblocks" include a residual connection (He et al., 2016) and layer normalization (Ba et al., 2016). Layer normalization is defined as an operation over a sequence h[1], . . . , h[T ] as (2) where indicates elementwise multiplication and γ, β ∈ R d model are learned parameters that are unique to each instance of layer normalization.
Head h in the multi-headed self-attention of layer l produces, at timestep t, a e,l,h = softmax q e,l, where Q e,l,h ∈ R d model ×d k , K e,l,h ∈ R d model ×d k , and V e,l,h ∈ R d model ×dv are the "query", "key", and "value" projection matrices, respectively. The self-attention outputs a e,l,h for all H heads are then concatenated and projected against the matrix O e,l ∈ R Hdv×d model along with a residual connection and layer normalization as follows:    (9) The output of the multi-headed selfattention mechanism is then passed through a feedforward network that operates on each sequence element independently. Specifically, the feedforward network consists of a projection, a ReLU nonlinearity, and another projection as follows: f e,l [t] = max(0, s e,l [t]W e,l,1 + b e,l,1 )W e,l,2 + b e,l,2 where W e,l,1 ∈ R d model ×d ff , b e,l,1 ∈ R d ff , W e,l,1 ∈ R d ff ×d model and b e,l,1 ∈ R d model . The output of the feedforward network is then combined with the subblock's input via a residual connection and layer normalization: Overall, the decoder is structured similarly to the encoder, with the following changes: First, the self-attention mechanisms are "causal" which prevents the decoder from looking at future items from the target sequence when it is fed in during training. This is achieved by constructing an "attention mask" M ∈ R U ×U that zeros out attention entries that are nonpermissable; specifically replacing the operation in eq. (8) with where the d subscript denotes activations and parameters for the decoder. Second, the layer blocks in the decoder contain an encoderdecoder attention mechanism after the selfattention mechanism and before the feedforward network. Specifically, encoder-decoder attention computes The activations from each head a d,l,h are then fed into the residual/layer norm block (eq. (9)) and the feedforward network (eq. (10)) as usual.
At the output of the final layer of the decoder, each entry in the sequence of activations h d,L is projected via an output logit matrix G ∈ R d model ×d vocab .  Table 2: Pre-training and fine-tuning results for all architecture variants with learned positional embeddings. The early loss represents the mean and standard deviation of perplexity at 65, 536 steps. The final perplexity is reported at the end of pre-training (524, 288 steps). SGLUE refers to SuperGLUE and WebQ refers to WebQuestions dataset. We report average, ROUGE-2, and accuracy for SuperGLUE, XSum, and WebQuestions, respectively, on the validation sets. The scores which outperform the vanilla Transformer are highlighted in boldface.