Recurrent Attention for Neural Machine Translation

Recent research questions the importance of the dot-product self-attention in Transformer models and shows that most attention heads learn simple positional patterns. In this paper, we push further in this research line and propose a novel substitute mechanism for self-attention: Recurrent AtteNtion (RAN) . RAN directly learns attention weights without any token-to-token interaction and further improves their capacity by layer-to-layer interaction. Across an extensive set of experiments on 10 machine translation tasks, we find that RAN models are competitive and outperform their Transformer counterpart in certain scenarios, with fewer parameters and inference time. Particularly, when apply RAN to the decoder of Transformer, there brings consistent improvements by about +0.5 BLEU on 6 translation tasks and +1.0 BLEU on Turkish-English translation task. In addition, we conduct extensive analysis on the attention weights of RAN to confirm their reasonableness. Our RAN is a promising alternative to build more effective and efficient NMT models.


Introduction
Transformer models have achieved remarkable success in Neural Machine Translation (NMT) (Vaswani et al., 2017;Freitag and Firat, 2020;Fan et al., 2020). One of the most crucial component of Transformer is the dot-product multi-head selfattention, which is essential to learn relationships between words as well as complex structural representations. However, many studies have shown that the pairwise self-attention is over-parameterized and leads to a costly inference (Sanh et al., 2019;Correia et al., 2019;Xiao et al., 2019). Based on these observations, various improved networks are proposed by either pruning negligible heads (Voita * Corresponding author. et al., 2019;Michel et al., 2019) or replacing selfattention with more efficient one (Xu et al., 2019;Wu et al., 2019;Kitaev et al., 2020;Beltagy et al., 2020).
More recently, several researches take this direction to an extreme by replacing dot-product selfattention with fixed or trainable global positionbased attentions Tay et al., 2020;You et al., 2020;Raganato et al., 2020). For example, You et al. (2020) roughly modeled attention weights as hard-coded Gaussian distributions, based on the observation that most heads only focus their attention on a local neighborhood.
Another more representative method is Random Synthesizer proposed by Tay et al. (2020). Different from You et al. (2020), they simply treat attention weights of all heads in each layer as trainable parameters. At inference time, the attention weights are directly retrieved based on the index of the query without dot-product operation. However, these variants are not the ideal alternatives of selfattention due to the unsatisfactory performance.
In this paper, we go further along this research line and show that self-attention is empirically replaceable. We propose a novel attention mechanism: Recurrent AtteNtion (RAN). Specifically, RAN starts with an unnormalized Initial Attention Matrix for each head, which is randomly initialized and trained together with other model parameters. Then we introduce a Recurrent Transition Module, which takes the Initial Attention Matrices as the input and refines them by layer-wise interaction between adjacent layers. The motivation of Recurrent Transition Module is based on the observation that attention weights show a regular pattern and have certain correlation across layers (Xiao et al., 2019;He et al., 2020). Our RAN not only discards the expensive pairwise dot-product of self-attention but also exploit correlation between attention weights of different layers, achieving a more efficient and compact NMT model. 1 To verify the effectiveness of RAN, we conduct experiments on a wide range of translation tasks involving 10 language pairs. Compared with a vanilla Transformer, our RAN shows competitive or better performance with lower latency and fewer parameters. We conduct extensive analysis on the learned RAN weights showing that the learned attention pattern are reasonable and explainable, which gives credit for the improvement.

Our Method
In this section, we first give a brief introduction of self-attention and we refer readers to the original paper (Vaswani et al., 2017) for details. Then, we introduce the proposed recurrent attention mechanism in detail. Figure 1 depicts the scaled dot-product selfattention which only details the computation of the k-th head in the l-th encoder layer. Given a sequence of token representations with a length of n, the self-attention model first converts the representations into three matrices Q k l ∈ R n×d k , K k l ∈ R n×d k and V k l ∈ R n×d k , representing queries, keys, and values, respectively, where d k is the dimensionality of the vector in the k-th head. Then, the attention matrix is calculated via the dot product of queries and keys followed by rescaling:

Multi-Head Attention
where A k l is an n × n matrix. Finally, a softmax operation is applied on this unnormalized attention matrix and then the output is used to compute a weighted sum of values: where H k l is new contextual representations of the l-th layer. This procedure can be implemented with multi-head mechanism by projecting the input into different subspaces which requires extra splitting and concatenation operations. The output is fed into a position-wise feed-forward network to get the final representations of this layer.
While flexible, it has been proven that there exists redundant information with pair-wise calculation, which can be replaced by simpler positional 1 We release source code at https://github.com/lemon0830/RAN. Layer l ... Figure 1: An overview of standard self-attention.
attention patterns. In the next section, we propose an extreme version of multi-head self-attention by totally removing the dot-product.

RAN: Recurrent Attention
We propose a Recurrent AtteNtion (RAN) as an alternative of multi-head self-attention. RAN consists of a set of global Initial Attention Matrices and a Recurrent Transition Module. The original self-attention derives key, query from the same token representations and compute attention weights on the fly following Eq. (1), which is repeated in each layer. Our RAN instead learns a set of learnable global attention matrices A 0 = {A 1 0 , .., A k 0 , .., A h 0 }, A k 0 ∈ R n×n , where h denotes the total number of heads. We denotes A 0 as the Initial Attention Matrices, which are initialized together with other parameters. Then, we propose a simple but effective Recurrent Transition Module. This module takes A 0 as the input and recursively updated the attention matrices layer by layer. During training, we jointly optimize A 0 , the recurrent transition module, and other components. During inference, the attention matrices are completely agnostic to the input representations, and can be retrieved directly without recomputation.
For easier understanding, we illustrate the computation of the k-th head in the l-th encoder layer in Figure 2. Instead of producing attention weights by the dot-product of Q k l and K k l , we generate the attention matrix A k l by the recurrent transition module Rec( * ) with the attention matrix A k l−1 from the previous layer. After obtaining A k l , we generate the weighted sum of values by Eq. (2).
We introduce RAN to the encoder self-attention, the decoder self-attention and both of them in our experiment, respectively. We do not consider the cross-attention between encoder and decoder be-Layer l ... cause of the poor performance of applying fixed positional attention patterns shown in previous work (You et al., 2020;Tay et al., 2020). When applying RAN to the decoder self-attention, the modeling process is identical to that of the encoder except that only the lower triangular matrix of each Initial Attention Matrix is leveraged due to the causal language modeling objective.

Recurrent Transition Module
We detailedly introduce the composition of the Recurrent Transition Module in this section. The transition module can be implemented in various ways such as position-wise feed-forward networks (FFN) (Vaswani et al., 2017), GRU (Cho et al., 2014) or LSTM (Hochreiter and Schmidhuber, 1997). In this paper, we simply use a single feed-forward network with tanh as its activation function followed by a layer normalization and a residual connection: Notably, we share the parameters of the transition module across all heads and all layers. It is obvious that our RAN has no interaction between queries and keys and thus is more effi-cient than the dot-product self-attention. In contrast to fixed attention patterns (You et al., 2020;Raganato et al., 2020), the learnable Initial Attention Matrices and Recurrent Transition Module make the proposed RAN more flexible to learn different attention distribution for different translation tasks. Compared to Random Synthesizer (Tay et al., 2020), our RAN is more likely to learn better context representations thanks to the Recurrent Transition. In terms of parameters, RAN only needs h attention matrices and a linear layer, however, Synthesizer has h × L attention matrices. Thus RAN is superior in reducing the overall parameters.

Experiment
In this section, we evaluate RAN on WMT and NIST translation tasks including 10 different language pairs altogether. We apply RAN to the encoder (RAN-E), the decoder (RAN-D), or both of them (RAN-ALL), respectively. For baselines, we compare against the standard Transformer (TransF for short) (Vaswani et al., 2017), and two most related work that are Hard-coded Transformer (HC-SA) (You et al., 2020) and Random Synthesizer (Syn-R) (Tay et al., 2020).

Settings
Our corpora come from three sources, and the scales of bilingual corpus range from 210K to 36M: • WMT2014 (En⇔De, En⇒Fr). We use English-German and English-French corpus, which are comprised of 4.5 and 36 million sentence pairs. We choose newstest 2013 as the valid set and newstest 2014 as the test set.
• NIST12 (Zh⇒En). We use parts of the bitext of NIST OpenMT12 2 as the training set which consists of 1.9 M sentence pairs. The valid data is MT02, and the test sets are MT03, MT04, MT05, MT06, and MT08. We report the average score over all the test sets. In terms of data preprocessing, for Chinese, we segment all sentences with the word segmentation toolkit THULAC. 3 For the other languages, we run the official script of WMT for tokenization. All sentences of more than 256 words are removed and are encoded using byte-pair encoding. 4 We use a joint vocabulary of 40K tokens for En-De, En-Fr language pairs and 32K tokens for the others, and a separate vocabulary of 32K tokens for Zh-En.
We use standard BASE implementation of Transformer which consists of a 6-layer encoder and a 6-layer decoder. By default, we set d k =d v =512 and use 2,048 hidden units in the FFN sub-layers. The residual dropout is 0.1. As for RANs, we set the dropout of attention as 0.2 to avoid over-fitting except on En-Fr. For HC-SA, we follow the setting of You et al. (2020) to replace the encoder selfattention with distributions centered around i − 1 and i + 1 and the decoder self-attention with distributions centered around i − 1 and i, and set the standard deviation as 1.0. All models are trained for 150k steps except WMT14 (250k steps) and 3 https://github.com/thunlp/THULAC-Python 4 https://github.com/rsennrich/subword-nmt En-Tr (20k steps). Training is performed using 8 x V100 GPUs for all language pairs except En-Tr and Zh-En which use 2. When decoding, we use a beam width of 4 and a length penalty of 0.6 for the WMT tasks and a length penalty of 1.0 for the Zh-En task. We report the case-sensitive BLEU (Papineni et al., 2002) with Multi-bleu.perl 5 and detokenized BLEU score with SacreBLEU 6 (Post, 2018) of the best checkpoint in the validation set.

Main Result
First, we leverage RAN to replace the self-attention of encoder or decoder, respectively. Table 1 shows the overall results on the 10 language pairs. Compared with TransF, our RAN models consistently yield competitive or even better results against TransF on all datasets. Concretely, 0.13/0.16, 0.48/0.44 and 0.16/0.22 more average BLEU/SacreBLEU are achieved by RAN-E, RAN-D and RAN-ALL, respectively. Although different languages have different linguistic and syntactic structures, RAN can learn reasonable global atten- Figure 3: Translation speed (token/sec) varying batch sizes and beam sizes. We set the beam size as 4 when investigating effect of batch sizes, and set the batch size as 100 when explore beam size's influence. tion patterns over the whole training corpus.
It is interesting to see that RAN-D performs best, which significantly outperforms the TransF on most of the language pairs. The biggest performance gain comes from the low resource translation task Tr⇒En where RAN-D outperform TransF by 0.97/1.0 BLEU/SacreBLEU points. We conjecture that the position-based attention without tokenwise interaction is easier to learn and our RAN is able to capture more generalized attention patterns. By contrast, the dot-product self-attention is forced to learn semantic relationship between tokens, and may fall into sub-optimal local minima especially when the training scale is low. This observation is consistent with that in (Raganato et al., 2020). In brief, the improvement indicates that NMT systems can benefit from simplified decoders when training data is insufficient. Besides, although both RAN-E and RAN-D are effective, we find that their effects can not be accumulated.
Next, we compare RAN with two related methods. To be fair, we only compare RAN-ALL to them, where both encoder and decoder self-attention are replaced as done in the two papers. From the table, we can see the two methods significantly decrease the performance over TransF, while our model bridges the performance gap between Transformer and the models without the dot-product selfattention, demonstrating the effectiveness of RAN.

Decoding Speedups
We plot the decoding speed as functions of batch size and beam size in Figure 3. Each experiment is conducted on the same hardware environment and the numbers come from the average of 3 individual runs. To maximize the speedup, we consider RAN-ALL setting where both encoder and decoder are accelerated. We can see that RAN-ALL speedups the decoding by up to 23.6% with a batch size of 100. In terms of beam size, RAN-ALL shows consistent improvement about 1.2x. Note that the previous studies of simplifying attention mechanisms (You et al., 2020;Wu et al., 2019;Michel et al., 2019) also report efficiency improvement of similar magnitudes.

Analysis
In order to better understand RAN, we conduct comprehensive empirical studies on its behavior on the WMT14 En⇒De test set.

Distribution of RAN Weights
In this experiment, we investigate the difference between the learned attention distribution of the different models. To this end, first, we follow Tang et al. (2019) to measure the concentration of attention distribution with attention entropy (Ghader and Monz, 2017): where x i denotes the i-th token and A(x t , x i ) represents the attention distribution at timestep t. Then, we average the attention entropy over all timesteps and then average the attention entropy over all heads in each layer. Figure 4 displays the entropy of attention distribution. As for encoder, the attention distribution of the TransF has the lowest entropy, which gets distributed first and then becomes concentrated again. The attention entropy of Syn-R is clearly higher and the attention distribution is uniform. In contrast, the attention distribution of RAN-ALL is uniform in the first layer and becomes increasingly concentrated, indicating that the RAN encoder extracts more local information in the higher layers. The phenomena of TransF and Syn-R hold in the decoder, while RAN-ALL shows clearly low entropy in all decoder layers. Moreover, we show each model's attention histograms at layer 1, 3, and 6 in Figure 5. In the encoder, the weights of Syn-R and RAN-ALL tend to be distributed. In the decoder, the weights of TransF and RAN-ALL stay near 0 and have smaller variance, while Syn-R 's weights are still distributed.

Visualization of RAN Weights
Since the learned attention weight matrices of RAN are independent of input tokens, we can easily visualize the attention patterns of RAN over posi-

Self-attention of Encoder
Self-attention of Decoder tions. 7 In Figure 6, we find that in the encoder, RAN focuses their attention on a local neighborhood around each position. Specifically, in the last layer of the encoder, the weights become more concentrated, potentially due to the hidden representations being contextualized. Interestingly, except attending local windows to the current position, the weights of the decoder are most concentrated in the first token of target sequences. This may demonstrate the mechanism of decoder self-attention that the RAN decoder attends to source-side hidden states based on global source sentence representations aggregated by the start tokens.

Analysis of Attention Weights across Layers
To explore similarity of the attention weights under the different attention mechanisms, we display the Jensen-Shannon divergence (Lin, 1991) of attention between each pair of layers in Figure 7. The conclusions are as follows: First, the attention similarity in TransF is not salient but the attention distribution of adjacent layers are similar to some   extent. Second, there are no noticeable patterns found in Syn-R. Third, as for RAN-ALL, the attention similarity is high especially in the decoder (the JS-divergence ranges from 0.08 to 0.2), and is remarkable between adjacent layers.

RAN vs. Positional Embedding
The positional embedding is very important to Transformer, and lets the model be aware of word orders. Our RAN learns the input-agnostic global attentions which actually involves the positional information. To verify this point, in this section, we compare several variants of RAN by removing positional embeddings on EN⇒DE translation task, as shown in Table 2. Removing the encoder positional embeddings leads to a catastrophic performance degradation over 14 points. This gap can be recovered by replacing multi-head attention with RAN. TransF is merely affected marginally by removing the decoder position embeddings. After applying RAN to decoder, we obtain even better performance than TransF. This demonstrates that our RAN indeed captures positional information.

Ablation Study
To analyze the impact of different components of RAN, we investigate two variants: (1) RAN-ALL fixed, where we fixed Initial Attention Matrices by random initialization without training; (2) RAN-ALL w/o LN&RES, where we removed layer normalization and residual connection in Recurrent Transition Module. The results on En⇒De and En⇒Fr translation tasks are listed in Table 3. Surprisingly, we find that the fixed Initial Attention Matrices does not lead to significant performance degradation (-0.1 ∼ -0.2 BLEU). This shows we can further reduce the parameters by fixing the Initial Attention Matrices. Moreover, removing layer normalization and residual connection leads to a performance drop, which illustrates their effectiveness.

Effects on Sentence Length
We divide the WMT14 En⇒De test set into seven bins by source sentence lengths and target sentence lengths, respectively, and plot the performance of each model in BLEU for each bin in Figure 8. We observe that RANs yield better performance on the short and medium-length sentences, while is not good at processing long sentences. Specifically, RAN-E performs worse on long source sentences than TransF. The improvement of RAN-D mainly comes from the performance improvement in translation of the sentences shorter than 50 and promising performance on the long sentences.

Application to Other Generation Tasks
To investigate the generalization of RAN, we conduct additional experiments on abstractive summarization using the CNN/Dailymail dataset and dialogue generation using the PersonaChat dataset. 8 On summarization task, all of the models are trained for 300k steps on 2 GPUs with a batch size of 128 sentences. For dialog generation, we segment all dialog with BERT tokenizer 9 and train a SMALL Transformer for 20K steps. We use NLG_Eval 10 for evaluation and report the results in Table 4. RANs achieve competitive results compared to TransF, which demonstrates the generalization of RAN on other generation tasks.

Related Work
The introduction of attention mechanisms into NMT can be traced back to Bahdanau et al. (2015) and Luong et al. (2015), which are used to learn soft word alignments between language pairs. Due to the significant improvements in translation quality, the attention models have become an critical component of NMT models. More recently, Vaswani et al. (2017) proposed Transformer that achieved the state-of-the-art and soon becomes the most popular NMT architecture. The Self-Attention Network (SAN), playing an important role in the Transformer, has been investigated and analyzed by a number of recent studies (Sanh et al., Voita et al., 2019;Michel et al., 2019). These studies have shown that Transformer models are over-parametrized and the self-attention models learn redundant information that can be pruned in various ways.
The observations motivate lots of attempts in improvement of SAN, including 1) improving its computation efficiency and 2) completely replacing it with fixed or learnable global attention patterns. For the former thread, several studies bias attention distributions towards more local areas (Yang et al., 2018;Xu et al., 2019;Cui et al., 2019) or replace SAN with convolutional modules Wu et al., 2019), which are more in line with the linguistic expectation. Xiao et al. (2019) share attention weights in adjacent layers and enable efficient re-use of hidden states in a vertical manner.
On the other hand, given that most attention heads learn simple, and often positional patterns, many researchers turn to substitute instance-wise self-attention with global position-based attention patterns. Concretely,  use average attention models in the decoder of Transformer. You et al. (2020) model the attention distribution as hard-coded Gaussian ones, and Raganato et al. (2020) also replace all but one attention head of each encoder layer with totally position based attentive patterns. More recently, Tay et al. (2020) propose Random Synthesizer in which the attention matrices as trainable parameters that are random initialized and trained with other model parameters.
Overall, our work is related to the second type of approaches and most related to You et al. (2020) and Tay et al. (2020). Unlike You et al. (2020) applying hard-coded Gaussian attention focusing on local windows, the RAN can learn more flexible attention distribution. Tay et al. (2020) allocate different learnable attention matrix for every head in each layer. In addition, so many individual matrices are hard to train and do not reduce the overall parameters at all. In contrast, our RAN uses the recurrent mechanism to refine the learnable attention matrices layer by layer to improve the model capacity, and has the advantages of saving parameters and modeling relationships of attention between adjacent layers.

Conclusion
In this paper, we considered a simpler Transformer architecture for NMT without costly dot-product self-attention. For this goal, a novel recurrent atten-tion mechanism (RAN) is proposed, which takes the Initial Attention Matrices as a whole and update it by a Recurrent Transition Module recurrently. Experiments on 10 representative translation tasks show effectiveness of RAN. In the future, we will explore the application of RAN on cross-attention.