When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute

Large language models have become increasingly difficult to train because of the growing computation time and cost. In this work, we present SRU++, a highly-efficient architecture that combines fast recurrence and attention for sequence modeling. SRU++ exhibits strong modeling capacity and training efficiency. On standard language modeling tasks such as Enwik8, Wiki-103 and Billion Word datasets, our model obtains better bits-per-character and perplexity while using 3x-10x less training cost compared to top-performing Transformer models. For instance, our model achieves a state-of-the-art result on the Enwik8 dataset using 1.6 days of training on an 8-GPU machine. We further demonstrate that SRU++ requires minimal attention for near state-of-the-art performance. Our results suggest jointly leveraging fast recurrence with little attention as a promising direction for accelerating model training and inference.


Introduction
Many recent advances in language modeling have come from leveraging ever larger datasets and model architectures. As a result, the associated computation cost for developing such models have grown enormously, requiring hundreds of GPU hours or days per experiment, and raising concerns about the environmental sustainability of current research (Schwartz et al., 2020). As a consequence, it has become imperative to build computationally efficient models that retain top modeling power while reducing computational costs.
The Transformer architecture (Vaswani et al., 2017) was proposed to accelerate model training and has become the predominant architecture in NLP. Specifically, it is built entirely upon selfattention and avoids the use of recurrence to enable strong parallelization. While this change has 1 Our code, experimental setup and models are available at https://github.com/asappresearch/sru.  Figure 1: Bits-per-character on ENWIK8 dev set vs. GPU hours used for training. SRU++ obtains better BPC by using 1/8 of the resources. We compare with Transformer-XL as it is one of the strongest models on the datasets tested. Models are trained with single precision and comparable training settings. led to many empirical success and improved computational efficiency, we are interested in revisiting the architectural question: Is attention all we need for modeling?
The attention mechanism permits learning dependencies between any parts of the input, making it an extremely powerful neural component in many machine learning applications (Bahdanau et al., 2015;Lin et al., 2017). We hypothesize that this advantage can still be complemented with other computation that is directly designed for sequential modeling. Indeed, several recent works have studied and confirmed the same hypothesis by leveraging recurrence in conjunction with attention. For example, Merity (2019) demonstrates that single-headed attention LSTMs can produce results competitive to Transformer models in language modeling. Other work have incorporated RNNs into Transformer, and obtain better results in machine translation (Lei et al., 2018;Hao et al., 2019) and language understanding benchmarks (Huang et al., 2020). These results highlight one possibility -we could build more efficient models by combining attention and fast recurrent networks Zhang and Sennrich, 2019).
In this work, we validate this idea and present a self-attentive recurrent unit that achieves strong computational efficiency. Our work builds upon the SRU (Lei et al., 2018), a highly parallelizable RNN implementation that has been shown effective in language and speech applications (Park et al., 2018;Kim et al., 2019;Hsu et al., 2020;Shangguan et al., 2019). We incorporate attention into the SRU by simply replacing the linear transformation of input with a self-attention component. The proposed architecture, called SRU++, enjoys enhanced modeling capacity and remains equally parallelizable. Figure 1 compares its performance with the Transformer-XL model (Dai et al., 2019) on the ENWIK8 dataset. SRU++ achieves better results while using a fraction of the training resources needed by the baseline.
We evaluate SRU++ on standard language modeling benchmarks including the ENWIK8, WIKI-103 and BILLION WORD datasets. SRU++ consistently outperforms various Transformer models on these datasets, delivering better or on par results while using 3x-10x less computation. Our model do not use positional encoding, multi-head attention and other techniques useful to Transformer models. Furthermore, we demonstrate that a couple of attention layers are sufficient for SRU++ to obtain near state-of-the-art performance. These changes not only highlight the effectiveness of recurrence but also enable strong computation reduction in training and inference. Finally, we also showcase the effectiveness of SRU++ on the IWSLT'14 De→En translation task, and open source our implementation in Pytorch to facilitate future research.

Background: SRU
We first describe the Simple Recurrent Unit (SRU) in this section. A single layer of SRU involves the following computation: where is the element-wise multiplication, W, W and W are parameter matrices and v, v , b and b are parameter vectors to be learnt during training. The SRU architecture consists of a light recurrence component which successively computes the hidden states c[t] by reading the input vector x[t] for each step t. The computation resembles other gated recurrent networks such as LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Cho et al., 2014). Specifically, the state vector c[t] is a weighted average between the previous state c[t-1] and a linear transformation of the input W x [t]. The weighted aggregation is controlled by a forget gate f [t] which is a sigmoid function over the current input and hidden state. Once the internal state c[t] is produced, SRU uses a highway network to introduce a skip connection and compute the final output state h [t]. Similarly, the information flow in the highway network is controlled by a reset gate r[t].
Two important code-level optimizations are performed to enhance the parallelism and speed of SRU. First, given the input sequence vector, SRU combines the three matrix multiplications across all time steps as a single multiplication. This significantly improves the computation intensity (e.g. GPU utilization). Specifically, the batched multiplication is a linear projection of the input tensor X ∈ R L×d : where U ∈ R L×3×d is the output tensor, L is the sequence length and d is the hidden state size. The second optimization performs all elementwise operations in an efficient way. This involves Similar to other built-in operations such as attention and cuDNN LSTM (Appleyard et al., 2016), SRU implements all these operations as a single CUDA kernel to accelerate computation. Note that each dimension of the hidden vectors is independent once U is computed. The computation can run in parallel across each hidden dimension (and each input sequence given a mini-batch of multiple sequences).

SRU++
The key modification of SRU++ is to incorporate more expressive non-linear operations into the re-  current network. Note that the computation of U (Equation 1) is a linear transformation of the input sequence X. We can replace this linear transformation with self-attention operation to enhance modeling capacity. Specifically, given the input sequence represented as a matrix X ∈ R L×d , the attention component computes the query, key and value representations using the following multiplications, d is the attention dimension that is typically much smaller than d. Note that the keys K and values V are computed using Q instead of X such that the weight matrices W k and W v are significantly smaller. We also tested another variant in which we first project X = WX into the lower dimension d , and then apply three independent d -by-d matrix multiplications over X to obtain the query, key and value representations. This variant achieves similar results.
Next, we compute a weighted average output A ∈ R d ×L using the scaled dot-product attention introduced in Vaswani et al. (2017), The final output U required by the elementwise recurrence is obtained by another linear projection, where α ∈ R is a learned scalar and W o ∈ R 3d×d is a parameter matrix. Q + α · A is a residual connection which improves gradient propagation and stabilizes training. We initialize α to zero and as a result, initially falls back to a linear transformation of the input X skipping the attention transformation. Intuitively, skipping attention encourages leveraging recurrence to capture sequential patterns during early stage of training. As |α| grows, the attention mechanism can learn long-range dependencies for the model. In addition, W o W q can be interpreted as applying a matrix factorization trick with a small inner dimension d < d, reducing the total number of parameters. Figure 2 (a)-(c) compares the differences of SRU, SRU with this factorization trick (but without attention), and SRU++ proposed in this section. The last modification is adding layer normalization (Ba et al., 2016) to each SRU++ layer. In our implementation, we apply normalization after the attention operation and before the matrix multiplication with W o , This implementation is post-layer normalization in which the normalization is added after the residual connection. Alternatively, pre-layer normalization (Xiong et al., 2020) only applies to the nonlinear transformation. While pre-normalization tends to be less sensitive to different learning rates, we use post-normalization for better results following the observations in Liu et al. (2020b). We analyze the effectiveness of layer normalization in Appendix A.2.

Model
Batch

Experimental setup
Datasets We evaluate our model on four standard NLP benchmarks.
• ENWIK8 (Hutter, 2006) is a character-level language modeling dataset consisting of 100M tokens taken from Wikipedia. The vocabulary size of this dataset about 200. We use the standard 90M/5M/5M splits as the training, dev and test sets, and report bits-percharacter (BPC) as the evaluation metric.
• WIKI-103  is a wordlevel language modeling dataset. The training data contains 100M tokens extracted from Wikipedia articles. Following prior work, we use a vocabulary of 260K tokens, and adaptive embedding and softmax layers (Grave et al., 2017;Baevski and Auli, 2019).
• BILLION WORD (Chelba et al., 2013) is one of the largest language modeling datasets containing 768M tokens for training. Unlike WIKI-103 in which sentences in the same article are treated as consecutive inputs to model long context, the sentences in BIL-LION WORD are randomly shuffled. Following Baevski and Auli (2019), we use a vocabulary of 800K tokens, adaptive embedding and softmax layers.
• IWSLT'14 De→En is a low-resource machine translation dataset consists of 170K translation pairs. We showcase SRU++ can be applied to other tasks such as translation. We follow the same setup of Lin et al. (2020) and other previous work. The dataset uses a shared vocabulary of 14K BPE tokens.
Models All our language models are constructed with a word embedding layer, multiple layers of  SRU++ and an output linear layer followed by softmax operation. We use single-head attention in each layer and 10 SRU++ layers for all our models. We use the same dropout probability for all layers and tune this value according to the model size and the results on the dev set. By default, we set the hidden dimension d : d = 4 : 1. We report additional analysis and tune this ratio for best results in Section 5 and Appendix A. For simplicity, SRU++ does not use recent techniques that are shown useful to Transformer such as multi-head attention, compressed memory (Rae et al., 2020), relative position (Shaw et al., 2018;Press et al., 2021), nearest-neighbor interpolation (Khandelwal et al., 2020) and attention variants to handle very long context (Sukhbaatar et al., 2019a;. We compare with previous Transformer models that incorporate one or several these techniques. However, we do not compare with results that use additional data or dynamic evaluation (Graves, 2013;Krause et al., 2018), for a fair comparison between all models.
Optimization We use RAdam (Liu et al., 2020a) with the default β values as our optimizer. RAdam is a variant of Adam optimizer (Kingma and Ba, 2014) that is reported less sensitive to the choice of learning rate and warmup steps while achieving similar results at the end. We use a fixed weight decay of 0.1 and an initial learning rate of 0.0003 in our experiments. These values are selected based on ENWIK8 dev set and used for other tasks. See Appendix A.3 for more details. We use a cosine learning rate schedule following Dai et al. (2019). We do not change the initial learning rate unless otherwise specified. See Appendix B for the detailed training configuration of each model.  Figure 3: Dev BPC vs. total GPU hours used on EN-WIK8 for each model. Using automatic mixed precision (amp) and only one attention sub-layer achieves 16x reduction. To compute the dev BPC, the maximum attention length is the same as the unroll size M during training.
Each training batch contains B sequences (i.e. the batch size) and M consecutive tokens for each sequence (i.e. the unroll size), which gives an effective size of B × M tokens per batch. Following standard practice, the previous training batch is provided as additional context for attention, which results in a maximum attention length of 2 × M . For ENWIK8 and WIKI-103 datasets, the training data is partitioned into B chunks by concatenating articles and ignoring the boundaries between articles. For BILLION WORD dataset, we follow Dai et al. (2019) and concatenate sentences to create the training batches. Sentences are randomly shuffled and separated by a special token <s> indicating sentence boundaries.

Results
Does recurrence improve upon attention-only model? We first conduct a comparison with the Transformer-XL model (Dai et al.,   tention context length to 2048 for testing, similarly to the Transformer-XL baseline.  (2019) demonstrated that using a single attention layer with LSTM retains most of the modeling capacity compared to using multiple attention layers. We conduct a similar analysis to understand how much attention is needed in SRU++. To do so, we only enable attention every k layers. The layers without attention become the variant with dimension projection illustrated in Figure 2 (b). Note that k = 1 gives the default SRU++ model with attention in every layer, and k = 10 means only the last layer has attention in a 10-layer model. Table 2 presents the results by varying k. Our base model is the same 10-layer SRU++ model in Table 1. We see that using 50% less attention (k = 2) achieves almost no increase in test BPC. Moreover, using only a single attention module (k = 10) leads to a marginal loss of 0.01 BPC but reduces the training time by 40%. Our results still outperform Transformer-XL model and single-headed attention LSTM (Merity, 2019) greatly by 0.03 BPC. Figure 3 showcases the training efficiency of our model. SRU++ is
Where to use attention? Next, we analyze if the location of attention in SRU++ makes a non-trivial difference. Figure 4 (top) compares the results by enabling attention in only one of the SRU++ layers. Applying attention in the first bottom layer achieves significantly worse result. We believe this is due to the lack of positional information for attention, since SRU++ does not use positional encoding. Enabling attention in subsequent layers gives much better and comparable results because recurrence can encode positional information. Moreover, SRU++ consistently achieves worse results by moving the attention to lower layer closer to the input embedding. We also enable a second attention layer while fixing the first one in the 10th layer. The corresponding results are shown in Figure 4 (bottom). Similarly, SRU++ achieves worse results if the attention is added to one of the lower layers. In contrast, results are comparable once the attention is placed in a highenough layer. These observations suggest that the model should first learn local features before attention plays a most effective role at capturing longrange dependencies. More analyses can be found in Appendix A.
Does the ratio d : d matter? Transformer models by default use a FFN dimension that is 4 times larger than the attention dimension (Vaswani et al., 2017). We analyze the ratio of recurrence dimension d to attention dimension d for SRU++. A small value of d can reduce the amount of computation and the number of parameters used in attention layers but may limit the modeling capacity. Table 4 compares the results of using different d : d ratio given a similar amount of model parameters. We fix the model size to around 108M and use 10 SRU++ layers. Changing this ratio from 4 to a higher value gives better result. The best dev result is obtained with a ratio of 8.
Given this observation, we report SRU++ result using a default ratio of 4 as well as a ratio of 8 in the subsequent result sections. This ensures we conduct a comparison that uses a setup similarly to the default of Transformer models, but also showcases stronger results SRU++ can achieve.
ENWIK8 Table 3      Inference speed Table 7 compares the inference speed of SRU++ with other top-performing models on WIKI-103 test set. We use a single V100 GPU for inference. Our large model runs at least 4.5x faster than all baseline models except Shortformer (Press et al., 2021). In addition, our model achieves 0.9-1.1 perplexity lower than Shortformer and runs 50% faster when using 2 attention layers (k = 5).
IWSLT Does SRU++ work well for other tasks? We study this question by evaluating SRU++ on the IWSLT'14 De→En translation task.
We use the open-sourced training and evaluation code of Lin et al. (2020). The base model is an 8-layer Transformer model containing 20M parameters. We train SRU++ models using 6 layers and d = 1024, resulting in similar number of parameters. We use the original settings such as learning rate and batch size, except that we use RAdam optimizer for consistency and increase the number of training epochs to 50. Both architectures achieve much higher BLEU scores given more training epochs. 3 Table 8 presents the test results. Without additional hyperparameter tuning, SRU++ achieves 0.4 BLEU score higher and less training time compared to the Transformer model tuned in Lin et al. (2020).
Why does SRU++ reduce training cost in our experiments? Several factors contribute to the computation reduction observed in our experiments. First, combining attention and recurrence gives stronger modeling capacity. As shown in our experiments, SRU++ often achieves comparable results using fewer layers and/or fewer parameters. The required computation are much lower for shallower and smaller models. We also observe higher training efficiency, requiring fewer training steps and smaller training batch compared to several Transformer models. 3 Lin et al. (2020) reports a test BLEU of 35.2. We obtain 35.9 for the same Transformer model by training longer.
For example, SRU++ uses a maximum effective batch size of 98K tokens and 800K training steps on the BILLION WORD dataset, while the Transformer model in comparison (Baevski and Auli, 2019) uses 128K tokens and near 1000K steps. The reduced batch size and gradient updates cut down the training cost.
Finally, model implementation is an important factor for computation saving. Our implementation is highly efficient for two reasons. First, the fast recurrence operation of SRU is a reusable module that is already optimized for speed (Lei et al., 2018). Second, since recurrence encodes positional information, we can use simple singlehead attention and remove positional encoding.
On the contrary, advanced attention and positional encoding mechanism can generate nontrivial computation overhead. To see this, we measure the running time of SRU++ and Transformer-XL using Pytorch Profiler. Figure 5 (a) shows the average model forward time of a single batch. SRU++ runs 4-5x times faster compared to the Transformer-XL implementation. Figure 5 (b) breaks down the computation and highlights the most time-consuming operations in both models. The matrix multiplications are one of the most expensive operations for both models. Surprisingly, many operations in the relative attention of Transformer-XL are computationally expensive. For example, the relative attention requires shifting the attention scores and adding up different attention score matrices. Both require a lot of time but they are not needed in non-relative attention. In addition, the last column shows the running time of tensor transpose operators needed by batch matrix-matrix multiplications in attention. Again, the relative attention uses an order of magnitude more time compared to the simple single-head attention used in our model implementation. 4

Related Work
Accelerating common architectures for NLP has become an increasingly important research topic recently (Tay et al., 2020;Sun et al., 2020;. Our work is closely related to two lines of research under this topic.   First, previous works have tackled the speed problem of recurrent neural networks (RNNs) and have proposed various fast RNN implementations (Diamos et al., 2016;Campos et al., 2018;Zhang and Sennrich, 2019). Notably, the Quasi-RNN  and SRU (Lei et al., 2018) have invented highly-parallelizable recurrence and combined them with convolutions or highway networks respectively. The resulting architectures achieve equivalent parallelism as convolutional and attention models. This advancement eliminates the need of avoiding recurrence computation to trade model training efficiency, a design choice made by the Transformer architecture. Our model builds on top of SRU.

SRU++ Transformer-XL
Second, several recent works have argued that using attention alone is not the best architecture in terms of model expressiveness. For example, Dong et al. (2021) demonstrate theoretically and empirically that using pure attention results in performance degeneration. Gulati et al. (2020) have combined convolution and attention and obtained new state-of-the-art results for speech recognition. Moreover, RNNs have been incorporated into Transformer architectures, resulting in improved results in machine translation and language understanding tasks (Lei et al., 2018;Huang et al., 2020). Our work is built upon a similar hypothesis that recurrence and attention are complementary at sequence modeling. We demonstrate that jointly leveraging fast recurrence and attention not only achieves state-of-the-art modeling results but also obtain significant computation reduction.
Being orthogonal to our work, many recent works improve the efficiency of Transformer mod-els by accelerating attention computation (Zaheer et al., 2020;Katharopoulos et al., 2020;Vyas et al., 2020;Peng et al., 2021). Examples include Longformer (Beltagy et al., 2020), Reformer (Kitaev et al., 2020), Linformer  and Routing Transformer . In contrast, our work optimizes computational efficiency using recurrence combined with minimal attention and our model can incorporate these attention variants for additional speed improvement.

Conclusion
We present a highly-efficient architecture combining fast recurrence and attention, and evaluate its effectiveness on various language modeling datasets. We demonstrate fast RNNs with little attention not only achieve top results but also reduce training cost significantly. Our work shares a different idea to accelerating attention, therefore providing an orthogonal direction to advancing stateof-the-art model architecture. As future work, we believe the model can be improved using stronger attention or recurrent implementations, better normalization or optimization techniques.

A Additional results
A.1 Detailed analysis of attention Table 10 presents a more comprehensive analysis of attention in SRU++ models. First, we change the number of attention layers and their locations in the model. As shown in the top block of Table 10, using attention in 50% of the layers leads to no (or negligible) loss in model performance. This is consistent with the results in Table 2 using a smaller model. Enabling attention in higher layers performs slightly better than evenly distributing attention from the bottom to top layers. We also experiment with using more than one attention head in each of the attention layer, as shown in the middle block of the table. Unlike Transformer models however, we do not observe a significant improvement using multiple heads. We hypothesize that the recurrence states can already carry different features or information that are present in different input positions, making redundant heads unnecessary.
Finally, changing the ratio d : d from 4 to 8 gives similar improvements regardless of using 2 attention layers or 10 attention layers. This suggests that the amount of attention and the hidden size ratio can be tuned independently for best model performance.

A.2 The effectiveness of layer normalization
In our experiments, we have always used layer normalization to stabilize training. However, we also found layer normalization to achieve worse generalization for larger models that are more prone to over-fitting. Figure 6 showcases our empirical observation on the ENWIK8 dataset. Using layer normalization achieves more rapid training progress and lower training loss, but results in higher dev loss in the case of training a 108M model. This generalization gap remains even if we tune the dropout rate carefully. In addition, although using layer normalization in the smaller model with 41M parameters gives slightly better dev results, we still observe a larger generalization gap (indicated by the difference between training loss and dev loss) compared to the run without layer normalization. Similar over-fitting patterns are observed on Wiki-103 dataset, and also in previous work (Xu et al., 2019).
On the other hand, turning off layer normalization can achieve better generalization but makes training sensitive to learning rate and parameter initialization. For example, we have to use a smaller learning rate of 0.00025 or lower to avoid sudden gradient explosion during training. These results suggest possible future work by improving the normalization method (Shen et al., 2020;Brock et al., 2021).

A.3 Tuning weight decay and learning rate
We find that tuning the weight decay and learning rate critical to the success of training SRU++ and achieving best results. Table 9 provides a sensitivity analysis by testing different learning rates and weight decay values. Increasing the weight decay consistently gives better results for all learning rates tested. Tuning the learning rate is also needed to reach the best result. The non-trivial effect of weight decay seems to be unique for SRU++.
On the other hand, the performance of SRU++ remains robust once the appropriate weight decay and learning rate are set. As shown in previous results and analyses, SRU++ achieves strong and relatively stable results to various hidden sizes, number of attention layers and datasets. In particular, using the same weight decay value generalize well for all datasets (including language modeling and translation tasks) and model configurations tested. 0.10 0.01 0.00 3 × 10 −4 1.014 --2 × 10 −4 1.022 1.035 1.047 1.5 × 10 −4 1.030 1.038 1.040 Table 9: Dev BPC of SRU++ given a learning rate ∈ {1.5, 2, 3} × 10 −4 and a weight decay ∈ {0.1, 0.01, 0}. '-' means the training run diverged or got gradient explosion.

B Training details
Language modeling We use the RAdam optimizer 5 with the default hyperparameters β 1 = 0.9 and β 2 = 0.999 for all our experiments. We use a cosine learning rate schedule with only 1 cycle for simplicity. For faster training, we also leverage the native automatic mixed precision (AMP) training and distributed data parallel (DDP) of Pytorch in all experiments, except those in Table 1 and Fig-ure 1 for a fair comparison with the Transformer-XL implementation.  Table 11 shows the detailed training configuration of SRU++ models on ENWIK8 dataset. Most training options are kept the same for all models. We tune the dropout probability more carefully as we found training is more prone to over-fitting and under-fitting for this dataset. The large model is trained with 2x batch size. As a result, we increase the learning rate proportionally by a factor of √ 2 (Hoffer et al., 2017), which results in a rounded learning rate of 0.0004. Table 12 presents the detailed training configuration on WIKI-103 dataset. Similarly we use d = 3072 and d = 4096 for the base and large model respectively for a hidden size ratio d : d = 4 : 1. Following (Baevski and Auli, 2019), we use an adaptive word embedding layer and an adaptive softmax layer for our models, and we tie the weight matrices of the two layers. We keep the total number of parameters comparable when we use a different hidden size ratio d : d = 8 : 1.

Machine translation
We use the open-sourced code from Lin et al. (2020) for the IWSLT'14 De→En translation task. The Transformer model tuned by the original work uses 8 layers for both the encoder and decoder and a total of 20M parameters. Most of the training configuration remains the same as the original work 6 , except for a couple of changes. First, we use RAdam optimizer and the same β values for consistency with the language model task. We use the same weight decay value of 0.1 for SRU++. The Transformer model uses a weight decay of 0 that is tuned based on dev set performance. Second, we increase the number of training epochs to 50 (or equivalently 64K training steps) since all models achieve better BLEU scores by training longer. This ensures we compare models when they reach the maximum performance.
Our SRU++ model uses a hidden size d = 1024, an attention size d = 256 and 6 layers for the encoder and decoder, resulting in a similar number of parameters as the Transformer model in comparison. Let X src be the output representation of the SRU++ encoder. Each SRU++ decoder layer make uses of X src by simplying treating it as extra attention context. That is, the query, key and value 6 https://github.com/asappresearch/ imitkd/blob/master/configs/iwslt/ teacher.yaml representations are computed by concatenating the input of the current layer X tgt with X src , The resulting representations Q tgt , K and V are used for the rest of the attention computation. The attention mask is set such that each target token can only attend to all source tokens and preceding target tokens.  108M parameters Figure 6: Understanding the empirical effect of layer normalization. We show the training and dev loss of SRU++ models using 41M parameters and 108M parameters on ENWIK8 dataset. The model with layer normalization fits the training data better, but achieves worse generalization.