B2T Connection: Serving Stability and Performance in Deep Transformers

From the perspective of the layer normalization (LN) positions, the architectures of Transformers can be categorized into two types: Post-LN and Pre-LN. Recent Transformers tend to be Pre-LN because, in Post-LN with deep Transformers (e.g., those with ten or more layers), the training is often unstable, resulting in useless models. However, Post-LN has consistently achieved better performance than Pre-LN in relatively shallow Transformers (e.g., those with six or fewer layers). This study first investigates the reason for these discrepant observations empirically and theoretically and made the following discoveries: 1, the LN in Post-LN is the main source of the vanishing gradient problem that leads to unstable training, whereas Pre-LN prevents it, and 2, Post-LN tends to preserve larger gradient norms in higher layers during the back-propagation, which may lead to effective training. Exploiting the new findings, we propose a method that can provide both high stability and effective training by a simple modification of Post-LN. We conduct experiments on a wide range of text generation tasks. The experimental results demonstrate that our method outperforms Pre-LN, and enables stable training regardless of the shallow or deep layer settings. Our code is publicly available at https://github.com/takase/b2t_connection.


Introduction
To prevent the vanishing (or exploding) gradient problem in the training of a deep neural network (DNN), various techniques, such as batch normalization (Ioffe and Szegedy, 2015) and residual connection (Srivastava et al., 2015;He et al., 2016a), have been proposed and widely used in almost all recent DNNs.Transformer (Vaswani et al., 2017) employs the layer normalization (Ba et al., 2016) for this purpose.Transformer is currently the most successful model architecture in DNNs.It was firstly developed for applying sequence-to-sequence tasks, such as machine translation (Vaswani et al., 2017), summarization (Takase and Okazaki, 2019), and automatic speech recognition (ASR) (Wang et al., 2020), and is currently used in speech, vision, and many other information processing research fields.
As reported in the batch normalization literature (He et al., 2016b), the position of the normalization layers primarily affects both the stability and resultant performance of a trained model.In Transformers, some previous studies have investigated the impact of the layer normalization positions (Wang et al., 2019;Xiong et al., 2020).There are currently two major layer normalization positions in Transformers: Pre-Layer Normalization (Pre-LN) and Post-Layer Normalization (Post-LN).Pre-LN applies the layer normalization to an input for each sub-layer, and Post-LN places the layer normalization after each residual connection.The original Transformer (Vaswani et al., 2017) employs Post-LN.However, many recent studies have suggested using Pre-LN (Wang et al., 2019;Baevski and Auli, 2019;Brown et al., 2020) because the training of deep Transformers (e.g., those with ten or more layers) using Post-LN is often unstable, resulting in useless models.Figure 1 shows loss curves for an actual example; the training of 18 layered Transformer encoder-decoders (18L-18L) on a widely used WMT English-to-German machine translation dataset.These figures clearly show that the Post-LN Transformer encoder-decoders fail to train the model.However, in contrast, Liu et al. (2020) reported that Post-LN consistently achieved better performance than Pre-LN on a machine translation task when they used 6 layered (relatively shallow, 6L-6L) Transformers.
This paper focuses specifically on such discrepancies between Pre-LN and Post-LN in configurations with various number of layers.We investigate the sources of the instability of training in deep configurations and the superior performance in shallow configurations for Post-LN, compared with that for Pre-LN, to understand the essentials of the differences between Pre-LN and Post-LN.We discover that the layer normalization in Post-LN is the main source of the vanishing gradient problem that leads to unstable training, whereas Pre-LN prevents it, as shown in Figure 1.In particular, we clarify that the layer normalization is a significant factor of the vanishing gradient problem by comparing the input/output vector norms of gradient flows for each layer normalization during back-propagation.These analyses bring us a novel idea that can satisfy higher stability by skipping over layer normalizations and provide better performance than Pre-LN regardless of their layer sizes.Consequently, we propose a method that is based on Post-LN Transformers but has additional residual connections to enable stable training.
We conduct experiments on a wide range of text generation tasks, namely machine translation, summarization, language modeling, and ASR.The experimental results lead to the following three new major findings: 1. Post-LN Transformers achieve better performance than Pre-LN Transformers on text generation tasks (not only machine translation (Liu et al., 2020) but also other tasks).

Post-LN and Pre-LN Transformers
We briefly describe Post-LN and Pre-LN Transformers in this section.The original Transformer (Vaswani et al., 2017) uses Post-LN, in which layer normalizations are located after each residual connection.Let x be an input of a sublayer, and F(•) be a sub-layer of a Transformer, such as a feed-forward network or multi-head attention.Post-LN is defined as follows: where LN(•) is the layer normalization function.
In contrast, Pre-LN places the layer normalization before an input of each sub-layer: (2)

Gradients of Transformer Layers
As described in Liu et al. (2020), the vanishing gradient problem often occurs in Post-LN Transformers.Figure 3 shows the gradient norms of each layer for the (a) encoder-side and (b) decoderside at the beginning of training, when 18L-18L Transformer encoder-decoders are trained on a widely used machine translation dataset (the WMT English-to-German dataset).Focus on the decoderside of .This figure shows that shallower layers have smaller gradient norms.In other words, the vanishing gradient occurs in the decoder-side of Post-LN because its gradient norms exponentially decay as they are back-propagated to shallower layers.This result is consistent with the previous study (Liu et al., 2020).We consider that this vanishing gradient causes the difficulty of stacking many layers with the Post-LN setting, as shown in Figure 1.
To investigate the vanishing gradient empirically in more detail, we measure the gradient norms of parts (1) -(5) of Figure 2 (a). Figure 4 shows the gradient norms of each part in the 18th layer1 .This figure shows that the gradient norms decrease drastically from (4) to (3) and ( 2) to (1).These parts correspond to layer normalizations, as shown in Figure 4.This suggests that layer normalizations in Post-LN Transformers are probably the cause of the vanishing gradient problem.To investigate the difference between the gradient flows of Post-LN and those of Pre-LN theoretically, we calculate the derivatives of Equations ( 1) and (2), as follows: where I is the identity matrix.As Equation (3), the derivative of Post-LN is equal to the product of two derivatives: one is the layer normalization, and the other consists of the residual connection and sub-layer F. In contrast, in Pre-LN, the derivative of the residual connection is isolated from the term related to the derivative of the layer normalization.The difference between these equations implies that the residual connection in Pre-LN prevents the vanishing gradient because it retains the gradients of upper layers even if the derivative of the layer normalization decreases gradients drastically.dient problem occurs.Although Pre-LN is more stable in training, Post-LN can achieve better performance if training succeeds (see Section 6).In this section, we explore the reason for this difference in performance.Focus Pre-LN in Figure 3.In contrast to Post-LN, in Pre-LN, a deeper (higher) layer has a smaller gradient norm.Thus, the parameters of higher layers are not required to change dramatically from their initial values.This implies that higher layers in Pre-LN are not sufficiently effective.
To investigate the effectiveness of higher layers, we focus on the transformations by each layer.Figure 5 shows the average cosine similarities between the outputs of each pair of layers for 6L-6L Transformer encoder-decoders trained on the WMT dataset when several sequences are input.This figure indicates that the lower-left similarities of Pre-LN are higher than those of Post-LN.This result means that the outputs of shallow layers are similar to the output of the final layer in Pre-LN, but not in Post-LN.Consequently, higher layers in Pre-LN are less effective than those in Post-LN if training succeeds.
We consider that the residual connection in Pre-LN causes this phenomenon.As Equation (2) shows, in Pre-LN, an input x skips over the sublayer F(•) by the residual connection.Thus, the input x is directly connected to the final layer output.This property makes the training stable, as described in Section 3, but causes high similarities between the outputs of the various layers.Therefore, we consider that Pre-LN underperforms Post-LN because the residual connection in Pre-LN reduces the effectiveness of its higher layers.In contrast, in Post-LN, larger gradient norms in higher layers (as shown in Figure 3) make higher layers more ef-fective (as shown in Figure 5) but it is necessary to prevent the vanishing gradient problem in shallow layers when we stack many layers.

Modification for Stable Training in
Post-LN: Bottom-to-Top Connection This section introduces a modification that makes the training of Post-LN more stable while preserving its high performance.This modification comprises an additional residual connection to mitigate the vanishing gradient in Post-LN by enabling many layers to be stacked.
As discussed in the previous sections, we need a term that retains gradients in the derivatives, as in Equation ( 4), to prevent the vanishing gradient.
To satisfy this requirement, we propose a residual connection that skips over all layer normalizations except the final one in each layer.Our introduced connection ties an input of a layer to the result of the feed-forward network (FFN), as illustrated by the red arrows in Figure 2 (c).We call this connection Bottom-to-Top (B2T) connection, which is formalized in the following equation: where x inp is an input of a layer, FFN(•) is an FFN, and x f f n is an input of the FFN.In short, x inp skips the layer normalizations after the selfattention and encoder-decoder cross-attention.Because the derivative of x inp is isolated from the terms related to the derivatives of the layer normalizations just behind the attention sub-layers, it retains gradients, as in Pre-LN.For example, in an encoder-side, x f f n is as follows: where SelfAttn(•) is a self-attention network.Thus, Equation ( 5) can be written as follows: The derivative of this equation is the following equation: Because this derivative contains I, which is unrelated to the derivatives of internal layer normalizations, our B2T connection (i.e., x inp ) helps to propagate gradients.For a decoder-side, we can prove this property in the same manner.Figure 3 (b) indicates that B2T connection mitigates the vanishing gradient of 18L-18L encoderdecoders.Moreover, we locate B2T connection before the final layer normalization in each layer to avoid a direct connection to the final layer output based on the discussion in Section 4. Thus, B2T connection preserves the property of Post-LN with respect to the transformations performed by each layer, as illustrated in Figure 5 (c)2 .

Experiments
Through experiments, we indicate following three findings.
• Post-LN Transformers achieve better performance than Pre-LN Transformers if their training succeeds.

• B2T connection enables the training of deep
Transformers with the Post-LN configuration.
• Our modification preserves the performance advantage of Post-LN Transformers, which therefore outperform Pre-LN Transformers.
We describe the essential experimental configurations in this section.Appendix A presents more details, such as the hyper-parameters and computational budgets.

Dataset
The machine translation task has been widely used to investigate the performance of Transformer-based methods since the original Transformer (Vaswani et al., 2017;Ott et al., 2018;Wang et al., 2019;Xiong et al., 2020;Liu et al., 2020).We adopted the widely used WMT English-to-German training dataset (Vaswani et al., 2017;Ott et al., 2018), which contains 4.5M sentence pairs.We applied the byte-pair-encoding (BPE) algorithm (Sennrich et al., 2016) to construct a vocabulary set in the same manner as previous studies.We set the number of BPE merge operations to 32K and shared the vocabulary between the source and target languages.We used newstest2010-2016 to investigate the performance, following Takase and Kiyono (2021).

Methods
We compare Post-LN, Pre-LN, and Post-LN with our B2T connection (B2T connection) Transformers.We used fairseq3 (Ott et al., 2019) as an implementation of Transformers.We stacked 6 and 18 layers for the encoders and decoders (6L-6L and 18L-18L) as the widely used configuration and deep configuration, respectively.We used the Transformer (base) setting for dimension sizes of internal layers.In addition to the above methods, we evaluate the following five methods, which are recent approaches that enable the training of deep Transformers.We used the same hyper-parameters for all methods except T-Fixup.For T-Fixup, we used the hyper-parameters reported in Huang et al. (2020) to prevent divergence.DLCL To make Transformers deep, Wang et al. (2019) proposed dynamic linear combination of layers (DLCL), which uses the weighted sum of the lower layers as an input of a layer.In contrast to our B2T connection, which is an additional connection within each layer, DLCL uses a connection among layers.We apply DLCL to Post-LN Transformers.We used the official implementation4 .Admin Liu et al. (2020) proposed adaptive model initialization (Admin), which uses additional parameters to stabilize the training of Post-LN Transformers.This method requires the variances of internal layers to initialize the additional parameters.Thus, this method first processes several forward steps for the initialization, and then conducts the actual training.In a nutshell, this method incurs additional computational costs.We used the official implementation5 .T-Fixup Huang et al. (2020) proposed an initialization scheme for Transformers, T-Fixup, to perform stable training without the learning rate warm-up and layer normalizations.Because this method can remove the cause of the vanishing gradient, we can stack many layers.We used the official implementation 6 .RealFormer To improve the performance of Transformers, He et al. (2021) proposed RealFormer, which introduces additional connections into attention sub-layers.Although their motivation is not addressing the vanishing gradient problem, their method is similar to ours with respect to the use of additional connections.
DeepNet Wang et al. (2022) proposed DeepNorm, which uses a weight that corresponds to the number of layers in a residual connection before layer normalizations to stabilize Post-LN based Transformers.They also provided the combination of the initialization scheme and DeepNorm as DeepNet.

Results
We measured case-sensitive detokenized BLEU scores with SacreBLEU (Post, 2018) 7 .Table 1 shows BLEU scores 8 of each method on newstest2010-2016 and the averaged scores of them.Since the BLEU score is precision-based n-gram overlapping between the model output and correct examples, a higher score represents better performance.
6 https://github.com/layer6ai-labs/T-Fixup 7The BLEU scores calculated by SacreBLEU are often lower than those calculated by the procedure of Vaswani et al. (2017) as reported in Ott et al. (2018).In fact, Pre-LN and B2T connection in the 18L-18L configuration achieved scores of 28.94 and 29.91, respectively, on newstest2014 when we used the same procedure of Vaswani et al. (2017).However, we used SacreBLEU to ensure the compatibility of results, as described in Post (2018). 8The signature of SacreBLEU is BLEU+nrefs:1+ case:mixed+eff:no+tok:13a+smooth:exp+version:2.0.0.The upper part of Table 1 shows results in the 6L-6L configuration.This part indicates that Post-LN achieved better scores than Pre-LN on all test sets.In addition, B2T connection outperformed Pre-LN on all test sets.Thus, these methods are superior to Pre-LN when the total number of layers is small.
The lower part of Table 1 shows results in the 18L-18L configuration.This part shows that the training of Post-LN failed, and thus we cannot successfully stack 18L-18L in the vanilla Post-LN.With the B2T connection, its training succeeded and it outperformed Pre-LN in the 18L-18L configuration.Figure 6 shows the negative log-likelihood (NLL) values of all methods when we regard new-stest2013 as validation data.This figure indicates that the NLLs of Pre-LN are worse than those of the other methods.These results demonstrate that our modification enabled the stacking of many layers without harm to its performance such as Pre-LN.
In the comparison with the recent methods, B2T connection outperformed them with respect to the averaged BLEU score.This result implies that our modification is superior to the recent methods.To make our findings more reliable, we also conduct a comparison with the recent methods on the summarization task.
Table 2 shows results in a much deeper configuration: 100L-100L.This table also indicates that B2T connection stabilized the training and outperformed Pre-LN.Appendix C describes the details of this 100L-100L configuration and shows a comparison with the latest method, DeepNet (Wang et al., 2022).

Dataset
The abstractive summarization task is one of the most famous sequence-to-sequence problems in NLP.In this study, we conduct the experiment on the headline generation task, which is the task of generating a headline from a given sentence (Rush et al., 2015).We used headlinesentence pairs extracted from Annotated English Gigaword (Napoles et al., 2012) by Rush et al. (2015).This dataset contains 3.8M headlinesentence pairs as the training set and 1951 pairs as the test set.In addition, we used 13M additional headline-sentence pairs extracted from REAL-NEWS (Zellers et al., 2019) and NewsCrawl (Barrault et al., 2019) for training deep Transformers, following Takase and Kiyono (2021).We applied BPE (Sennrich et al., 2016) to construct a vocabulary set.As in the machine translation experiments, we set the number of BPE merge operations to 32K and shared the vocabulary between the encoder and decoder sides.
would be premature to conclude that our modification is more effective than those methods from the results of experiments on the machine translation task alone.We set the numbers of layers of encoders and decoders to 6L-6L and 18L-18L as the base and deep configurations, respectively.

Results
Table 3 shows the ROUGE-1, 2, and L scores achieved by each method on the test set.Since these scores are computed by n-gram overlapping between the generated and correct headlines, a higher score represents better performance.
In the 6L-6L configuration, Post-LN achieved better performance than Pre-LN.Thus, Post-LN outperformed Pre-LN on the headline generation task if training succeeded.Moreover, B2T connection achieved scores comparable to those of Post-LN.
In the 18L-18L configuration, the training of Post-LN failed.In contrast, the training of B2T connection succeeded, and this method outperformed Pre-LN.Thus, our modification is more suitable than Pre-LN for training deep Transformers to perform the headline generation task.
B2T connection outperformed the recent methods in the 6L-6L configuration and achieved the best ROUGE scores in the 18L-18L configuration.According to the results on both the machine translation and headline generation tasks, B2T connection achieved performance that was better than, or comparable to, that of previous methods.It is worth emphasizing that, in addition to the performance, our modification does not incur additional computational costs, such as those incurred by DLCL and Admin.

Language Model
In addition to encoder-decoders, we investigate the effect of our B2T connection when used in the decoder side only, i.e., a neural language model.Because recent pre-trained models, such as the GPT series, are language models trained on a large amount of training data, experimental results in this section give an insight for pre-trained models.

Dataset
We used WikiText-103 (Merity et al., 2017), which consists of a large number of tokens.The training, validation, and test sets contain 103M, 0.2M, and 0.2M tokens, respectively.The vocabulary set contains 0.3M words.

Methods
We used a Transformer with adaptive input representations (Baevski and Auli, 2019), which is implemented in fairseq, as the base architecture in this experiment.For the base configuration, we stacked 6 layers, in the same manner as in the machine translation and summarization experiments.For the deep configuration, we used 16 layers, following Baevski and Auli (2019).For the dimensions of internal layers, we used the same values as those used by Baevski and Auli (2019).We compare Post-LN, Pre-LN, and B2T connection.

Results
Table 4 shows perplexities of each method on the validation and test sets of WikiText-103.Since the perplexity is computed based on the negative loglikelihood, a smaller value corresponds to better performance.The upper part of this  the machine translation and summarization tasks.Thus, our modification enables the training of deep Transformers for language modeling, and it is more effective than Transformers with Pre-LN.

Automatic Speech Recognition
In addition to experiments on natural language processing tasks, we conduct an experiment on another modality, ASR.

Dataset
We used LibriSpeech (Panayotov et al., 2015), which is the standard English ASR benchmark dataset.The dataset contains 1,000 hours of English speech extracted from audiobooks.We used the standard splits of LibriSpeech: we used all available training data for training and two configurations ('clean' and 'other') of development sets and test sets for evaluation.We applied the same pre-processing as that used by Wang et al. (2020).We constructed a vocabulary set for the decoderside with SentencePiece (Kudo and Richardson, 2018) by setting the vocabulary size to 10,000.To obtain speech features, we used torchaudio9 .

Methods
We used the Transformer-based speech-to-text model described in Wang et al. (2020) as the base architecture in this experiment.This model contains a convolutional layer to construct an embedding for the encoder-side but the other parts are identical to the Transformers used on the machine translation and summarization tasks.We used the same dimensions as those of T-Md, described in Wang et al. (2020).We set the numbers of layers to 6L-6L and 12L-6L as the base and deep configurations, respectively, because Wang et al. (2020) stacked many layers on the encoder-side only.We compare Post-LN, Pre-LN, and B2T connection.

Results
Table 5 shows the word error rates (WERs) of each method on each set.A smaller value of WER corresponds to better performance.The upper part of this table indicates that Post-LN and B2T connection outperformed Pre-LN on all sets in the 6L-6L configuration.The lower part of the table shows that B2T connection succeeded in training and achieved performance that was better than (or comparable to) that of Pre-LN in the 12L-6L configuration10 .These results are consistent with those of the other experiments in this study.

Related Work
Layer normalization (Ba et al., 2016) is a useful technique for training neural networks but its mechanism has been unclear (Xu et al., 2019).The Transformer, which is the standard architecture for various tasks, also contains layer normalizations.
To construct deep Transformers that achieve better performance, recent studies have focused on the behavior of layer normalizations.Wang et al. (2019) indicated the difficulty of training deep Transformers with Post-LN due to the vanishing gradient problem, and demonstrated that Pre-LN enables the stacking of many layers through machine translation experiments.In addition, they proposed a method to connect all layers to increase the effectiveness of deep Transformers.Bapna et al. (2018) and Dou et al. (2018) also proposed such connection methods to stack many layers.He et al. (2021) introduced additional connections into attention sub-layers to improve the performance.Xiong et al. (2020) explored the relation between the warm-up strategy and layer normalizations in Transformers.Through theoretical and empirical analyses, they indicated that Post-LN requires the warm-up strategy to stabilize the training.Liu et al. (2020) analyzed the training dynamics of Post-LN and Pre-LN Transformers.They then proposed Admin, which consists of additional weight parameters to control the variances of outputs from each sub-layer.In contrast, we indicated that we can stabilize the training of Post-LN Transformers by adding only a residual connection that skips over layer normalizations that cause the vanishing gradient.
Some studies have proposed initialization methods to make the training of deep neural networks stable (Zhang et al., 2019a,b;Huang et al., 2020).Zhang et al. (2019a) proposed the depth-scaled initialization to prevent the vanishing gradient problem in Transformers.Zhang et al. (2019b) proposed the fixed-update initialization to remove normalizations in neural networks.Inspired by these studies, Huang et al. (2020) proposed T-Fixup, which enables both warm-up and layer normalizations to be removed from Transformers.In addition to the initialization scheme, Wang et al. (2022) introduced weights into residual connections before layer normalizations, following Liu et al. (2020).

Conclusion
In this study, we addressed the stability of training Post-LN Transformers.Through theoretical and empirical analyses, we indicated that layer normalizations cause the unstable training when many layers are stacked.In addition, we investigated the reason for the different performance of Pre-LN and Post-LN by transformations of each layer.We introduced B2T connection to prevent the vanishing gradient while preserving the advantage of Post-LN.We conducted experiments on various tasks.The experimental results led to the following three findings; 1, Post-LN achieved better performance than Pre-LN if its training succeeded.2, Our modification enabled the training of deep Transformers (e.g., those with ten or more layers).3, Our modification preserved the benefit of Post-LN, and therefore outperformed Pre-LN.

Limitations
In this paper, we indicated that the vanishing gradient problem, caused by layer normalizations, makes the training of deep Post-LN Transformers unstable.We proposed the B2T connection to mitigate this vanishing gradient problem.However, the proposed B2T connection does not perfectly prevent the vanishing gradient, as shown in Figure 3. Therefore, the vanishing gradient might harm the training in extremely deep Transformers even if our B2T connection is used.
In addition, this study depends on empirical observations.In particular, we provided little theoretical justification of the reason for Post-LN outperforming Pre-LN when training succeeds.However, as discussed in Appendix C, the method with a theoretical justification often collapses in some situations.Because the behavior of deep Transformers in various situations is not fully understood, we believe that it is important to provide empirical findings for our research field to progress.
Although Appendix C includes a comparison between our B2T connection and the latest method, DeepNet (Wang et al., 2022), we could not investigate the behavior of all methods in the 100L-100L configuration because of our limited computational budgets.However, we are confident that we conducted sufficient experiments to verify our contributions.

Ethics Statement
The proposed method helps to construct deep Transformers.As discussed in Strubell et al. (2019) and Schwartz et al. (2019), such deep neural networks consume substantial amounts of energy.In fact, as discussed in Appendix A.2, we spent a large amount of computational resources on our experiments.Therefore, we also need to explore methods of improving energy efficiency while maintaining the good performance achieved by stacking many layers.
With respect to ethical considerations, the datasets used in our experiments are publicly available.LibriSpeech (Panayotov et al., 2015) is derived from audiobooks.The other datasets are mainly constructed from newswire texts and Wikipedia.Thus, in our understanding, our used datasets do not contain any personally identifiable information or offensive contents.

A Details of Experimental Settings
A.1 Hyper-parameters As described in Section 6, our hyper-parameters follow those used in previous studies.Table 6 shows hyper-parameters used for each experiment.For fair comparisons, we used the same hyperparameters for all methods except T-Fixup.For T-Fixup, we used hyper-parameters reported in Huang et al. (2020) to prevent divergence.

A.2 Computational Resources
We mainly used NVIDIA Tesla P100 GPUs for most of our experiments.Table 7 shows the number of GPUs and the computational time used to construct one model in our experiments.For the 100L-100L configuration, described in Section 6.1, we used 24 Tesla V100 GPUs and spent approximately 120 hours to train one model.

B Supplementary of Gradient Norms of Each Location
For gradient norms of each part in a layer, we check 1st and 9th decoders in addition to the 18th decoder for the 18L-18L Post-LN Transformer encoderdecoder as shown in Figure 4. Figure 7 shows the gradient norms of each part.This figure shows that the gradient norms decrease drastically through layer normalizations in the same manner as they do in the 18th decoder (Figure 4).Therefore, the vanishing gradient problem in Post-LN Transformers is probably caused by layer normalizations.

C Details of the 100L-100L Configuration C.1 Regularizations during the Training
As reported in Section 1, we constructed 100L-100L Transformers with widely-used WMT English-to-German dataset.In the preliminary experiments, we found that regularization is the key to preventing overfitting and achieving high performance in this situation.Figure 8 shows the NLL values of Pre-LN and B2T connection on validation data in the 36L-36L configuration when we used the same hyper-parameters as those used in 6L-6L and 18L-18L configurations.As this figure shows, the NLL values began to increase from the middle of training, and thus the overfitting occurred.In addition, the use of the same hyper-parameters as 6L-6L and 18L-18L makes it difficult to improve the performance of deeper configurations.Figure 9 shows the best NLL values on validation data when we varied the number of layers: 6L-6L, 12L-12L, 18L-18L, 36L-36L, and 50L-50L11 .This figure indicates that adding more layers to the 18L-18L configuration did not improve the performance.
To prevent overfitting during the training of 100L-100L Transformers, we increased the dropout rate from 0.3 to 0.5.In addition, we used word dropout, as described in Takase and Kiyono (2021).We set the word dropout rate to 0.1 for the encoder and decoder.We multiplied the initial parameter values, except those for embeddings, by 0.1.We set the gradient clipping to 0.1.Finally, we decreased the number of updates from 50K to 25K.These regularization techniques prevented overfitting and achieved better performance than 18L-18L, as described in Section 6.1.

C.2 Comparison with DeepNet
As described in Section 7, various studies have attempted to stabilize the training of deep Transformers.Each study indicated the effectiveness of their proposed method empirically, and some have provided theoretical justifications.However, Wang et al. (2022) demonstrated that the training of previous methods except DeepNet failed in a much deeper configuration than normally used, i.e., 100L-100L.Then, can we conclude that DeepNet is a silver bullet for deep Transformers?It is difficult to reach this conclusion because the training of DeepNet also fails in some configurations.For example, when we train deep Transformers, we might decrease the batch size because the trainable parameters occupy most of the GPU memories.When we tried this, the NLL value of DeepNet on validation data diverged, as shown in Figure 10.In other words, the training of DeepNet failed.In contrast, the training of our B2T connection succeeded in this situation.This result implies that there are problems in the training of deep Transformers that have not been solved in previous studies.Therefore, we believe that we should continue to add the empirical findings about new techniques, including B2T connection, to those of previous studies.

D B2T Connection without Layer Normalization
In addition to B2T connection, we also consider a further modification to prevent the vanishing gra-      The total number of layers  3, removing layer normalizations may provide stable gradients during back-propagation.However,

Figure 2
Figure 2 (a) and (b) illustrate Post-LN and Pre-LN Transformer architectures, respectively.

Figure 4 :
Figure 4: Gradient norms of each location in the 18th decoder for the 18 layered Post-LN Transformer encoderdecoder on WMT English-to-German translation training data.

Figure 5 :
Figure 5: Cosine similarities between the outputs of each pair of layers.

Figure 7 :
Figure 7: Gradient norms of each part in the (a) 1st decoder and (b) 9th decoder of the 18L-18L Post-LN Transformer encoder-decoder on WMT English-to-German translation training data.

Figure 8 :
Figure8: Negative Log-Likelihood (NLL) values of Pre-LN and our proposed B2T connection on validation data (newstest2013) in 36L-36L.We used the same hyperparameters as those used in 6L-6L and 18L-18L.

Figure 9 :
Figure9: The best Negative Log-Likelihood (NLL) values on validation data (newstest2013) when the total number of layers is varied.The total number of layers is divided equally between the encoder and decoder.

Table 1 :
BLEU scores of each method on WMT newstest2010-2016 and their averages.

Table 4 :
(Merity et al., 2017) with 6 layers, Post-LN and our B2T connection outperformed Pre-LN.When we stacked 16 layers, the training of Post-LN failed, but B2T connection achieved better performance than Pre-LN.These results are consistent with results on Perplexities on WikiText-103(Merity et al., 2017).

Table 5 :
Word error rates of each method on Lib-riSpeech.

Table 6 :
Hyper-parameters used in our experiments.

Table 7 :
The number of GPUs and computational time used to construct one model in our experiments.