Cascaded Head-colliding Attention

Transformers have advanced the field of natural language processing (NLP) on a variety of important tasks. At the cornerstone of the Transformer architecture is the multi-head attention (MHA) mechanism which models pairwise interactions between the elements of the sequence. Despite its massive success, the current framework ignores interactions among different heads, leading to the problem that many of the heads are redundant in practice, which greatly wastes the capacity of the model. To improve parameter efficiency, we re-formulate the MHA as a latent variable model from a probabilistic perspective. We present cascaded head-colliding attention (CODA) which explicitly models the interactions between attention heads through a hierarchical variational distribution. We conduct extensive experiments and demonstrate that CODA outperforms the transformer baseline, by 0.6 perplexity on Wikitext-103 in language modeling, and by 0.6 BLEU on WMT14 EN-DE in machine translation, due to its improvements on the parameter efficiency.


Introduction
Transformers (Vaswani et al., 2017) have advanced the field of natural language processing (NLP) on a variety of important tasks, including language modeling (Dai et al., 2019;, language understanding (Devlin et al., 2019;Yang et al., 2019b), and machine translation (Vaswani et al., 2017;Dehghani et al., 2019;. It has also found its place in computer vision (Dosovitskiy et al., 2020), and in intelligent agents  where sequence modeling plays a key role as well. The cornerstone of the transformer architecture is the multi-head attention (MHA) mechanism which models pairwise interactions between the elements of the sequence.
An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. A multi-head attention (MHA) mechanism extends the idea through performing multiple separately parameterized attention functions acting in parallel to contextualize the input representations. Their outputs are then gathered by an affine transformation, allowing the model to jointly attend to information from different representation subspaces at different positions.
Despite its massive success, the current framework ignores the interactions among different heads, leading to the problem that many of the heads are redundant in practice (i.e., attending to the same regions of the sequence), which underutilizes the capacity of the model (Voita et al., 2019;Michel et al., 2019a). At the same time, recent research (Tang et al., 2018;Clark et al., 2019;Voita et al., 2019;Wu et al., 2020, inter alia) demonstrates that heads in MHA have the potential to capture distinct information from input sequences, ranging from syntactic and semantic features to alignment information between source and target sentence pairs. These observations suggest that multiple heads should be encouraged to extract complementary information. Therefore, it is highly appealing to take into account the interactions among different attention heads from the perspective of parameter efficiency and the expressiveness of the model.
In this work, we introduce head-colliding attention ( §3). We formulate MHA as a probabilistic model, where each attention head is represented by a latent variable and all of them collide into the observed sequence data (Figure 1a). In this probabilistic graphical model structure, attention heads work as individual factors to explain the data. Although each factor is independent of each other a priori, they interact with each other automatically, conditioning on observations, thanks to the explaining-away effects (Pearl, 1989;Wellman and Henrion, 1993).
The head-colliding attention mechanism introduces new computational challenges in training the model. We will discuss how we tackle these using variational methods (Blei et al., 2017). We propose cascaded head-colliding attention (CODA, Figure 1b). As our main model, CODA adopts a hierarchical variational distribution (Ranganath et al., 2016) to allow both rich head interactions and effective computations ( §4).
We validate our method in language modeling and machine translation experiments ( §5). CODA outperforms the vanilla MHA transformer on both tasks, on Wikitext-103 by 0.6 perplexity and on WMT14 EN-DE by 0.6 BLEU. Further analysis shows that CODA learns to encourage diversity in different heads ( Figure 2) and to promote parameter efficiency when increasing the number of heads ( §5.3).

Background
Multi-head attention (MHA) mechanism plays an important role in modern transformer architecture (Vaswani et al., 2017). It extends the classical attention mechanism by running multiple attention function heads in parallel.
An MHA module is composed of h identical blocks (usually referred to as attention heads). Each head will generate a hidden state H i based on the input Query, Key and Value matrices, denoted as Q, K, and V respectively. The hidden states from different heads are then aggregated as the output of the MHA module: In the i-th head, the input matrices Q, K and V are first linearly projected into different subspace representations Q i , K i , and V i , based on different learnable parameters. After that, we compute the inner product over all projected queries and keys as the attention logits z i , which are then passed through a row-wise softmax 2 to obtain head attention weights a i : (1) 2 We omit the scaling factor for simplicity.
The final output of a single attention block is the weighted sum of V i : As we can see, the core of MHA is to calculate a i in each head. We thus refer to a i as the i-th attention head.
In sequence prediction tasks, the model takes as input a source sequence of length m and outputs a target sequence of length n in an autoregressive manner. It predicts each token Y within the target sequence through a categorical distribution p vanilla (Y|X), where X includes the source sequence as well as a previously generated prefix. With respect to an MHA block a 1 , . . . , a h , the model predicts target tokens Y by first feeding these heads into a complex non-linear transformation 3 denoted by φ(·), and then passing it through a softmax function over the entire vocabulary. Therefore, the output probability can be written as p vanilla (Y|X) = f (a 1 , . . . , a h ), where f (a 1 , . . . , a h ) := softmax(φ(a 1 , . . . , a h )).

Head-colliding Attention
In this section, we introduce head-colliding attention. Specifically, we formulate MHA as a probabilistic model, where each attention head is represented by a latent variable. The name reflects a "collider" in the context of probabilistic graphical models (Figure 1a). We will first explain how head-colliding attention permits the modeling of interactions among different heads and then discuss how vanilla MHA can be viewed as a marginalized version of head-colliding attention, which ignores any head interactions.
Considering a single MHA block, we cast each attention head a i as a latent variable. The probability of target Y conditioning on input X can be obtained by marginalizing over all heads A (we denote A := {a 1 , . . . , a h }): p(A|X) is the joint prior distribution. The corresponding directed graphical model is demonstrated  Although each head variable is independent a priori, they interact with each other after observing targets Y, which is referred as explaining-away effect. (b) Right: PGM diagram of a 3-layer cascaded head-colliding attention (CODA). a l i denotes the i-th attention head at transformer layer l. Note that all dependencies from X are omitted in these diagrams for simplicity.
in Figure 1a, where the links from different heads collide on the observation variable Y. A crucial property of this graphical model is the "explainingaway" effect (Pearl, 1989;Wellman and Henrion, 1993) of attention heads A when observing the output Y. In other words, if a head a i attends to part of the input which accords well with observation, it immediately discourages other heads from attending to the same part of the input but encourages them to look into complementary information. 4 This mechanism effectively reduces head redundancy and in turn improves parameter efficiency.
Vanilla vs. head-colliding attention We now take a closer look at the vanilla MHA ( §2). Recall that in vanilla MHA, all attention heads are deterministic. From the perspective of latent variable models, this is computationally equivalent to taking expectations of latent head variables. The output probability distribution p vanilla (Y|X) can then be expressed as: This means we are only interested in the individual expectations when using the attention heads in vanilla MHA for predictions. On the contrary, in head-colliding attention the distribution of Y is defined as: Note the inherent difference of when to take the expectation in vanilla and head-colliding attention.
Since f (·) is a complex non-linear function ( §2), these two formulations are not equivalent in general and may have a large gap between the two distributions. Concretely, vanilla MHA ignores any possible interactions among different heads. As indicated in equation 2, it first marginalizes out every single head before observing targets -one head will not learn what other heads are attending to despite the fact Y is observed. This is why vanilla MHA is prone to redundancy as many previous studies (Voita et al., 2019;Michel et al., 2019a, inter alia) discovered. Head-colliding attention, on the other hand, permits rich head interactions due to the expressive non-linear function f (·) inside the expectation over different latent variables a 1 , . . . , a h . However, the complexity of head interactions also leads to intractability in training the model, which we will discuss in the next section.

Training Head-colliding Attention
We train the model by performing maximum likelihood estimation. Here, the log marginal likelihood can be expressed as: Unfortunately, this is intractable in general because it requires marginalizing over all possible configurations of attention heads. The standard technique is to use variational inference, which optimizes the log marginal by maximizing its evidence lower bound (called ELBO) (Blei et al., 2017): where q(A|X) is the variational distribution 5 over latent variables A. p(A|X, Y) is the intractable posterior distribution of all heads given observations Y and the input X, which encodes the rich head interactions we desire, as discussed in §3. Therefore, an ideal variational distribution q(A|X) should be close to the true posterior p(A|X, Y). In this case, the samples would accurately reflect the head interactions and the variational distribution would yield a tighter bound to L to facilitate the training.
A straight-forward choice of q(A|X) is to use the mean-field approximation (Kingma and Welling, 2013): However, it has similar drawbacks as the vanilla MHA. 6 The mean-field approximation assumes the independence of different heads and hence the interactions are greatly limited.
Alternatively, one could parameterize q(A|X) using an auto-regressive model. 7 Although this is much more expressive, its sequential nature severely slows down training, making it infeasible in practice.
Cascaded Head-colliding attention Our solution to this problem is to employ hierarchical structures for head-colliding attention, where interactions among heads could be effectively incorporated into the model (Sønderby et al., 2016;Ranganath et al., 2016).
Conveniently, the hierarchical nature of the transformer architecture offers an effective way of constructing such proposal distributions. Given a transformer with L layers, we denote the set of all attention heads at layer l − 1 and l as A l−1 and A l , respectively. Following the bottom-up computation of the transformer, the distribution of A l must rely on the instantiated values of A l−1 . In this sense, A l−1 can be seen as the common variables that govern A l ( Figure 1b). Formally, we have: Despite the fact that each attention head a l i ∈ A l at l-th layer is conditionally independent given A l−1 , they become dependent when we marginalize A l−1 out. In particular, the marginal distribution of each A l becomes: This corresponds to an infinite mixture of the meanfield distributions q(A l |X, A l−1 ) and is able to capture rich head interactions (Ranganath et al., 2016). Our main model adopts this cascaded proposal distribution in figure 1b, and therefore we name it cascaded head-colliding attention (CODA).
The only problem left now is how to specify the conditional distribution q(A l |X, A l−1 ) for all l = 1, 2, . . . , L. We first impose the basic constraints on head values as in vanilla MHA, that is, all head values must range within a simplex ∆ n−1 : Here a l i,:k is the k-th column of the i-th attention head at layer l and 1 denotes the vector of all 1's. For efficient training and inference, we adopt Gaussian-logistic distributions (Blei and Lafferty, 2006;Cohen et al., 2008), which not only satisfy the constraints above but also benefit from the effective reparameterization trick (Kingma and Welling, 2013;Rezende et al., 2014;Titsias and Lázaro-Gredilla, 2014).
In particular, recall that in vanilla MHA, a i = softmax(z i ) = softmax( Q i K T i ) (equation 1). We also denote the attention logits at l-th layer as Z l := {z l 1 , . . . , z l h }. For head i at layer l, we first sample from a multivariate Gaussian distribution q(z l i,j: |z l−1 i,j: ) 8 and pass the samples into a row-wise softmax function to yield head values: where z l i,j: and a l i,j: represent the j-th row of the i-th attention logit and attention head at layer l respectively.
To explicitly model hierarchical structures among attention heads, we propose to add a direct connection between attention heads at adjacent layers ( Figure 1b). Such connections offer direct access to the information of attention in the previous layer. Specifically, for each head i at layer l we 540 set the mean µ i l as the sum of two parts: where σ i (·) is a two-layer multilayer perceptron (MLP) to fuse information from different heads Z l−1 (see the cascading connections in Figure 1b for an illustration). We set the covariance matrix Σ to the identity matrix for all attentive logits. We give the prior the same form as the variational posterior and parameters are shared between q(A 1 , ..., A L |X) and p(A 1 , ..., A L |X) for our objective (equation 3). With the help of parameter sharing, the KL term in equation 3 is also cancelled out due to the identical distributions. 9 This choice works well in practice, where it not only allows CODA to use almost the same amount of parameters as vanilla Transformer, but also eliminates the need to invoke advanced training techniques for amortized variational inference. 10 More details can be found in Appendix A.

Experiments
We conduct experiments on language modeling and machine translation tasks.

Setup
Datasets First, we conducted experiments for token-level language modeling on a large-scale benchmark dataset Wikitext-103 (Merity et al., 2016), which consists of articles from Wikipedia with the token number around 103M/218K/246K for the training/validation/testing splits respectively. The vocabulary size is 267,744. For machine translation, we consider two standard datasets: • WMT14 EN-DE (Bojar et al., 2014), which contains about 4.5M/3K/3K sentences pairs for training/validation/testing splits respectively. We follow  and Peng et al. (2020) to preprocess the dataset, and obtain a shared vocabulary between source and target language of around 32K byte pair encoding (BPE, Sennrich et al. (2016)) types.
Implementation details We implement our model with PyTorch (Paszke et al., 2019) and FairSeq toolkit (Ott et al., 2019). In particular, our model is based on the vanilla transformer architecture (Vaswani et al., 2017). For CODA, we replace all vanilla MHA blocks with the cascaded head-colliding attention, for both self attention and cross attention (if any). In language modeling, we use adaptive input embeddings  and set context size to 512 and 480 for training and testing respectively, due to constraints of computational resources. In machine translation, we set beam size to 5 and adopt the hyperparameters from (Peng et al., 2020) for IWSLT14 DE-EN. For WMT14 EN-DE we set beam size to 4, length penalty to 0.6, and average last 10 checkpoints for testing, following Vaswani et al. (2017). Further implementation details can be found in Appendix A.

Main results
The results of language modeling on Wikitext-103 dataset are reported in Table 1. As we can see from the table, CODA barely introduces any additional parameters. However, by taking into account head interactions, CODA significantly outperforms TRANSFORMER by over 0.6 perplexity. For reference, we also report the best setting (denoted by TRANSFORMER † ) in Baevski and Auli (2019), which uses a much larger context size (3072/2560 vs. 512/480 for training/testing), CODA still outperforms by a substantial margin of 0.3 perplexity. This indicates that encouraging head interactions can improve parameter efficiency.
To show whether CODA has promoted head interactions and reduced head redundancy, we qualitatively visualize the attention heads in both CODA and TRANSFORMER via heatmaps. Concretely, we compute the Jensen-Shannon Divergence (JSD) between each pair of attention heads at the same layer.
In particular, we assume head values define a categorical distribution in both TRANSFORMER and CODA model to facilitate comparison. That is, an   (2019) with the same context size as CODA (512/480 for training/testing), while TRANSFORMER † is the same model but with the best setting in their paper, which uses much larger context size (3072/2560 respectively); the result for TRANSFORMER † is as reported in Baevski and Auli (2019).
attentive head a i induces n categorical distributions for each query position. For the j-th distribution, it indicates how the j-th target position attends to all m source positions and is denoted by p(x|a i,j: ). For two heads i and i , we first compute their average distribution as m := p(x|a i,j: ) + p(x|a i ,j: ) 2 Then the JSD value between the i-th and i -th attention head is computed by summing all of n induced distributions: n j=1 1 2 KL(p(x|a i,j: )||m)+KL(p(x|a i ,j: )||m)) We average computed JSDs for all validation samples. Note that a larger JSD value (darker color) indicates that two heads are behaving more differently (i.e. less redundancy between them), and vice versa. As shown in Figure 2, JSD heatmaps in CODA are clearly darker than those in TRANSFORMER. This suggests that CODA permits richer head interactions, which fosters different heads to communicate with each other and encourages them to become complementary. Consequently, our model effectively reduces head redundancy in MHA and improves parameter-efficiency.
The results on IWSLT14 DE-EN and WMT14 EN-DE datasets are shown in Table 2  that CODA is more parameter efficient than vanilla Transformer due to the cascaded head-colliding attention we proposed. Similar to experiments on language modeling, we also visualize the head behaviors to measure attentive head interactions (See Figure 5 and Figure 6 in Appendix B), where we observe similar phenomena on translation tasks. Specifically, different heads in CODA are often complementary to each other and focus on quite different regions of sequences, rather than becoming redundant or even identical as observed in TRANS-FORMER models.

Analysis: the effect of the number of attention heads
Despite one would hope increasing the head number in MHA leads to a free-ride in achieving better performance, in practice it is often not the case as vanilla MHA suffers from the problem of parameter redundancy. Following Vaswani et al. (2017), we vary the number of attention heads (4,8,16,32), but keep the amount of computation constant. Our results on IWSLT14 DE-EN are shown in Table 3. We observe that the translation quality of baseline transformer (which uses vanilla MHA as its main building blocks) decreases almost linearly when increasing number of attention heads (Figure 3), which agrees with previous studies (Vaswani et al., 2017;Voita et al., 2019;Michel et al., 2019b). Intuitively, since the total number of parameters in the model remains unchanged, more heads indicate that the number of parameters allocated to each head is reduced, which limits the representational power of every single attention head. Due to the independence assumption between the heads, many of them tend to focus on similar regions of the sequence, leading to a great waste of modeling capacity.
In the case of CODA, we observe better BLEU scores in response to the increasing head number. Rich interactions in CODA could encourage different heads to cover broader regions of input sequence, which in turn offers more useful information for training. The perplexity (PPL) reflects    Figure 3: Left: BLEU scores on test dataset for base transformers and CODA under different number of attention heads (higher is better); Right: Perplexity on validation dataset for base transformers and CODA under different number of attention heads (lower is better). similar trends. The coordination between different heads in CODA greatly improves the model's parameter efficiency.

Ablation analysis
In this section, we present an ablation study to investigate effects of different components in CODA. Concretely, we compare four models on the IWSLT14 DE-EN machine translation task: (i) the full model CODA, (ii) a variant of CODA ablating the cascaded structure ( §4), (iii) a variant of CODA without using head-colliding attention ( §3) and (iv) the baseline TRANSFORMER model.
In more details, for model (ii), we remove the second term in equation 4, which turns off the direct cascading structure, despite still being a proper hierarchical latent variable model 11 . In model (iii), attention heads are deterministic (instead of being latent variables) as in vanilla Transformers, but cascading connections are incorporated. We observe its close connection with the recently proposed RE-ALFORMER (He et al., 2020), a TRANSFORMER model that adds a residual connection between attention logits at adjacent layers. Since in model (iii) all attention heads are deterministic, it is unnecessary to fuse different heads (see §4). In this case, we simply implement model (iii) as a REAL-FORMER (and thus referred to as REALFORMER hereafter) to demonstrate the effect of cascadinglike structures more clearly. 12 We report BLEU score for translation quality, and the Jensen-Shannon Divergences (JSD) averaged over all heads pairs of all MHA blocks for quantitative evaluation of head interactions. As demonstrated in Table 4 and Figure 4, even without cascading connections for explicit hierarchical structures, head-colliding attention has the ability (albeit limited) to induce reasonable correlations among different heads, reflected in the average JSD. This is due to the explaining-away effects and the native hierarchical structure in the transformers, as discussed in §3. In CODA, because individual heads have access to the other heads from a probabilistic perspective, they are more prone to offering complementary information for each other to jointly explain the observed data. This effect is further enhanced when cascading connections are added to the model. In contrast, if we simply incorporate such cascading connections into a vanilla TRANS-FORMER model, we found it does not significantly 11 Note that the first term Qi K T i in equation 4 also depends on the instantiated value of z l−1 i,j: , which induces an implicit hierarchical dependency for attention between adjacent layers. 12 The main difference between residual connections in RE-ALFORMER and cascading connections in CODA is that, the former directly performs a head-wise addition of previouslayer attention logits; in contrast, our cascading connection makes use of an MLP σ(·) to mix different attention heads, which enhances head interactions for CODA.

Model
Avg.  encourage head interactions and only improves the baseline marginally. In this case, the performance improvement might be mainly due to residual connections, which are often considered to be effective in facilitating training (He et al., 2016). Interestingly, we note a positive correlation between average JSD and BLEU, suggesting that encouraging complementary attention heads may help improve translation quality.

Related Work
Attention mechanisms were first applied to recurrent networks in (Bahdanau et al., 2014). It was then extended to multi-head attention (MHA) and became the key component in transformer architectures (Vaswani et al., 2017).
To study the utility of multiple attention heads, Voita et al. (2019) focused on identifying individual contributions of each attention head. Michel et al. (2019a) conducted extensive experiments to demonstrate that pruning out most heads after training does not lead to a drop in performance during inference. You et al. (2020) further revealed that replacing learnable attention heads with samples from fixed Gaussian distributions can achieve almost the same performance as original models. Additionally, Behnke and Heafield (2020) proposed to iteratively prune attention heads during training based on the lottery ticket hypothesis. These works indicate that there is a lot of head redundancy in the MHA transformer architectures.
Instead of pruning unnecessary parameters and down-sizing transformer models, there are also works that propose to improve parameter efficiency in transformers. For instance, Li et al. (2018) introduced a regularization term to explicitly promote diversity among different heads. Yang et al. (2019a) proposed to use convolutional kernels to capture correlations among not only local windows of sequences, but also different heads.  considered each head as a sample from the same distribution, and presented a sampling algorithm that avoids samples from collapsing into local modes. It hence explicitly encouraged the repulsiveness in MHA. Besides, MAE (Peng et al., 2020) converted a vanilla MHA to a mixture-of-experts model, where each expert component activates only a subset of attention heads. With learned probabilities, different experts could be specialized on different inputs. Different from these works, CODA does not explicitly promote head diversity nor specialize different heads. Instead, we focus on studying head interactions from a probabilistic perspective, which reveals the close connection between vanilla MHA and CODA.
Another research line relating to our work is to incorporate latent variables into attention modules. Xu et al. (2015) investigated the connection between vanilla deterministic single-head attention and its stochastic counterpart. Deng et al. (2018) explored this further and proposed to use variational inference techniques for training the model. They considered both cases of discrete and continuous latent variables. Bayesian attention modules (Fan et al., 2020) introduced continuous latent distributions for attention that are amenable to reparameterization tricks. Our work is different from them in that we mainly investigate the MHA mechanism and aim to improve parameter-efficiency by recovering potential interactions among different heads, which are ignored in vanilla MHA.
Concurrently, He et al. (2020) proposed to add residual connections between attention scores at adjacent layers, similar to our cascading connections. Nevertheless, our motivation for using the cascaded structure is quite different: we aim to construct direct hierarchical dependencies for latent variable models, while He et al. (2020) is mainly motivated to improve transformer architectures and obtain performance gains.

Conclusion and Future Work
We present CODA by re-formulating the multi-head attention (MHA) as a latent variable model from a probabilistic perspective. CODA explicit models of the interactions among attention heads through a hierarchical variational distribution. We conduct extensive experiments and demonstrate that CODA outperforms the transformer baseline in language modeling and machine translation. The analysis shows that CODA learns to encourage the diversity in different heads and to promote parameter efficiency when increasing the number of heads.
In this framework, we will be able to impose explicit constraints or regularization on different attention heads in a principal way (e.g. informative priors that promote diversity). Besides, we can also consider more expressive (data-driven) variational distributions. We leave these as the future work. Our code is publicly available at https://github.com/LZhengisme/CODA.

A Implementation details
For the σ network, it consists of a 2-layer MLP with LeakyRelu non-linear activation and a residual link from the input. It is a rather small network and only accounts for 0.01-0.02% of the total parameters. Recall that the number of attention heads is denoted by h, the source and target length is m and n respectively, and the batch size is denoted by b. The hidden size is set to α * h, where we select α from {2, 4, 8} based on the validation set. Note that the additionally introduced number of parameters is negligible compared to the model size, accounting for only 0.01-0.02% of the total parameters. Since we often represent the attention scores (or logits) z as a multi-dimensional tensor with shape (b, h, n, m), we first transpose it to shape (b, m, n, h) and feed it into the σ network. It then outputs h values so that each component σ i computes the fused information from all previous layer's attention heads. By adding its output to the current layer's attention logits, we could effectively construct a direct cascading connection for our hierarchical proposal. Note that σ network is neither shared among different heads nor different layers.

A.1 Machine translation
For WMT14 EN-DE, the transformer-base architecture in Vaswani et al. (2017) is used, where both the encoder and decoder consist of 6 layers with hidden size 512. For MHA blocks at each layer, the number of attention heads is set to 8 with the dimension of hidden layer representations being 512; For feed forward networks, the hidden size is set to 2048. The rate of dropout is set to 0.1. For training, we follow the same setup as in Vaswani et al. (2017), including that label smoothing with rate 0.1, the Adam optimizer (Kingma and Ba, 2014) is used for optimization, the inverse square root scheduling is utilized for learning rate and the number of warm-up steps is set to 4000. For IWSLT-14, we follow the configuration of hyper-parameters in Fairseq package 13 . In details, it mostly follows the same architecture and training setup as above, except that it uses a smaller feed forward network with hidden dimension 1024, a larger dropout rate 0.3 and less attention heads 4.
For both datasets, we apply a compound split post-processing to facilitate comparison. Addition-13 https://github.com/pytorch/fairseq/ tree/master/examples/translation ally, we use activation dropout with rate 0.1 for all used models on both datasets as we find it helps our model converge better.

A.2 Language modeling
For Wikitext-103, we base our model on  with the same hyperparameter configuration and training setup. The model architecture consists of 16 transformer layers, where it uses adaptive input representations, 8 heads for each MHA block, dropout rate of 0.3, hidden dimension of 1024, and hidden size of 4096 for feed forward networks. For training, Nesterov's accelerated gradient (NAG) method (Sutskever et al., 2013) is used with gradient norm clipping and a cosine learning rate schedule 14 .