FastSeq: Make Sequence Generation Faster

Transformer-based models have made tremendous impacts in natural language generation. However the inference speed is a bottleneck due to large model size and intensive computing involved in auto-regressive decoding process. We develop FastSeq framework to accelerate sequence generation without accuracy loss. The proposed optimization techniques include an attention cache optimization, an efficient algorithm for detecting repeated n-grams, and an asynchronous generation pipeline with parallel I/O. These optimizations are general enough to be applicable to Transformer-based models (e.g., T5, GPT2, and UniLM). Our benchmark results on a set of widely used and diverse models demonstrate 4-9x inference speed gain. Additionally, FastSeq is easy to use with a simple one-line code change. The source code is available at https://github.com/microsoft/fastseq.


Introduction
Transformer-based model architectures have made tremendous impact in multiple domains. However, due to large model size and intensive computing involved in the decoding process, the inference speed is still a bottleneck for long sequences applications (Wu et al., 2016;Tay et al., 2020). A variety of model architectural innovations have been proposed to increase the generation speed from different perspectives. One trend is to change the model architectures, like model distillation (Shleifer and Rush, 2020) and sparse attention (Beltagy et al., 2020). Although these techniques can alleviate the performance issue, there may be still some tradeoff between model accuracy and speed. On the other hand, efficient infrastructures have been de- * Equal contribution † Corresponding author veloped to accelerate the inference speed, e.g., Ten-sorRT (Vanholder, 2016) and FasterTransformers 1 .
In this paper, we present FastSeq framework to make sequence generation faster. FastSeq can accelerate the sequence generation by 4x to 9x with a simple one-line code change for models in FairSeq (Ott et al., 2019) and Huggingface-Transformers (Wolf et al., 2020). The design principle of FastSeq is to improve the inference speed without losing model accuracy and usability.
Our optimization approaches include an attention cache optimization, an efficient algorithm for detecting repeated n-grams, and an asynchronous generation pipeline with parallel I/O. These optimizations are general enough for a wide range of Transformer-based model (Vaswani et al., 2017) architectures, including the encoder-decoder architecture (e.g., T5 Raffel et al. 2020, BART Lewis et al. 2020, ProphetNet Qi et al. 2020, the decoder-only architecture (e.g., GPT2 Radford et al. 2019), and the encoder-only architecture (e.g., UniLM Dong et al. 2019). FastSeq is also designed to be flexible for extension on supporting other models and frameworks. Our technologies are partially adopted by FairSeq 2 . A demo video can be found at https: //www.youtube.com/watch?v=jrdsEUxhSEE.

Preliminary Analysis
For models with similar size, the sequence generation is much slower than classification, regression or language score computation. Why is the generation so time-consuming? Before analyzing the reasons, let's recap the generation algorithms first.

Generation Algorithms
Encoder-decoder structure is used in the most competitive models for sequence-to-sequence genera-tion. The encoder side takes an input sequence of symbol representations (x 1 , ..., x n ) and outputs a sequence of continuous representations z = (z 1 , ..., z n ). Then the decoder side generates an output sequence (y 1 , ..., y t ) with one element at a time. At each step, the model is auto-regressive by consuming the previously generated symbols and then computing the probability scores to select the next element. Greedy search and beam search are two popular algorithms used for the selection of next element. The difference between them is that at each step, greedy search only selects one candidate with maximum score, but beam search selects the top k candidates as beams. As beam search maintains multiple beams during the generation, it usually outputs a better result than greedy search.
To avoid repeated computation in the attention layer, the key (K) and value (V ) from previous and current steps are usually cached to compute the next token. Equation (1) describes how the selfattention with the cache mechanism is implemented at step t.
(1) where B is the batch size; M is the beam size; D is the embedding dimension; Q t , K t , V t represent query, key, value respectively, and are in the shape of R B×M ×R T ×R D ; W q , W k , W v are the weights for the query, key, and value in the shape of R D×D ; attn t is in the shape of To simplify the equations, we do not consider multi-heads here, but these equations can be adjusted to be of multi-head style. Figure 1a shows the profiling results of running the official BART model implemented by FairSeq. It indicates that maintaining cache, blocking n-gram repeats, and post-process individually take longer time than decoding itself. Profiling is done by running the official BART implemented by FairSeq v0.0.9 on CNN DM dataset with default parameters (batch size 32, beam size 4, and no-repeat n-gram 3). Non-computation parts, like maintain cache, blocking n-gram repeats and post-process, cost more than 80% of the generation time. We analyze these time-consuming components below.  Cache Maintenance Along with better generation results, beam search introduces significant additional computational and memory cost. As Equation (1) indicates, the size of X t , Q t , K t , V t , and attn t in beam search is M times larger than those in greedy search. It results in more memory consumption, larger matrix operations (e.g., concat), and more expensive cache maintenance (e.g., reordering the top-k beams and the cached key and value at each step). Moreover, the batch size is constrained by large occupied memory, which results in a low GPU utilization.

Bottlenecks in Generation
Block N-Gram Repeats Blocking N-Gram Repeats is a widely used operation to avoid an n-gram appears more than once in natural language model (Paulus et al., 2018;Klein et al., 2017). It prohibits the repetitive generation of n-grams by setting their probability scores to zero. However, conventional implementation often needs to scan text sequentially and move data between GPU and CPU frequently. Its time complexity is quadratic in terms of sequence length. When processing long sequences, this operation becomes another bottleneck.
Post-process It deals with detokenization and final result output. Post-process performance is largely restricted by two parts: frequent exchange of small data between GPU and CPU and the detokenization efficiency. In addition, for a synchronized pipeline, post-process will block the generation for the next batch of samples, while there is no required dependency between these two components.

Design
In order to address above bottlenecks, optimizations need to be done at multiple levels, including operations, models, and pipelines, which basically touch every component of a sequence generation framework. It is a non-trivial burden for researchers and practitioners. As a result, we develop this Fast-Seq library to address these barriers and speed up end-to-end inference in sequence generation. FastSeq is designed with following features: (i) speed up the inference of sequence models without any accuracy loss; (ii) easy to use and compatible Python APIs with FairSeq and HuggingFace-Transformers; (iii) flexible to be extended to support new models and frameworks.
FastSeq is written in PyTorch (Paszke et al., 2019) and composed of (1) ops module: provide efficient implementations of kernels (e.g., block n-gram repeats); (2) optimizer module: optimize model implementations in run-time, where more efficient implementations will be automatically patched to replace the ones in existing NLP toolkits (e.g., FairSeq and HuggingFace-Transformers) or the deep learning libraries (e.g., PyTorch); (3) models module: define the model architectures (e.g., ProphetNet, UniLM). It is noteworthy that the models in FairSeq and HuggingFace-Transformers are natively supported as well. Only one-line code change is needed to make them work with Fast-Seq; (4) command line interfaces (CLIs) module: run the inference via commands with an asynchronous pipeline, including preprocess (e.g., tokenization), generation process, and post-process (e.g., detokenization). These CLIs are compatible with FairSeq and HuggingFace-Transformers as well. Users can use the same parameters to run their end-to-end inferences.
FastSeq is designed to be easy to use. Existing model usages (e.g., model content and parameter settings) in FairSeq and Huggingface-Transformers do not need to be changed. The example code can be found in below: • Python API

Optimizations
To address the bottlenecks discovered in Section 2.2, we develop following optimizations.

Attention Cache Optimization
This section introduces how the cache for the key and value in self-attention and encoder-decoder attention can be optimized to further speed up the inference. We describe the cache deduplication below, see more comprehensive analysis and a new attention method with faster speed in our work EL-Attention (Yan et al., 2021)

Cache Optimization in Self-Attention
For the decoder-only or encoder-only Transformer models (e.g., GPT2, UniLM), X is the prefix of the generated hypothesis. In conventional implementations, X is replicated along beam dimension, and the corresponding partial in the key (K) and value (V ) is same for each beam. This means, as- where N is the length of X, B is the batch size, M is the beam size, D is the embedding dimension. To optimize the cache in self-attention, we can split the cached key and value in Equation (1) in two parts: Cache K and Cache V for the prefix; Cache K t and Cache V t for the generated sequence up till the time step t. With this split, the size of Cache K and Cache V can be reduced This also helps decrease cache reorder complexity by a factor of M . However, the above split operation results in incompatible shapes between Cache K and Cache K t , and between Cache V and Cache V t . Instead of reshaping these cached keys and values, einsum is utilized to compute attn t . This way, the expensive concat operations on large tensors can be avoided.
With the above changes, the matrix operations will be conducted on the tensors with much smaller size, so the peak memory can be smaller, the operations can run faster, and then a larger batch size can be leveraged. For example, at the step t, the sizes of can be much quicker than before due to less GPU memory allocation, copy, and deallocation. The peak memory during concat is largely reduced as well. Meanwhile, this implementation will save the same amount of data movement when reordering the beams in Cache K t−1 and Cache V t−1 because Cache K and Cache V do not need to be frequently reordered since they are de-duplicated along beam dimension.

Cache Optimization in Encoder-Decoder Attention
The cached key and value in the encoder-decoder attention also have duplication. The reason is that the key and value in the encoder-decoder attention are calculated based on the final output hidden state (S) from the encoder side. Accordingly, the elements of cached key and value at the beam dimension are the same. Therefore, the size of Cache K and Cache V can be reduced by M times, from Then the optimization benefits mentioned in Section 4.1.1 can be achieved here as well, including peak memory reduction and larger batch size. Additionally, the cached key and value are not needed to be frequently reordered since the elements at the beam dimension are exactly the same. Notably, the above proposed optimizations are general and can be applied to a variety of models with different architectures if they share following features: 1) attention-based architectures, including self-attention or encoder-decoder attention; 2) auto-regressive decoding based on beam Algorithm 1 GPU version no-repeat-ngram algorithm with arguments -ngram length n, previously generated tokens tokens, current step token probability distribution probs.
The detailed implementations of the optimized self-attention and encoder-decoder attention is provided in the Appendix.

GPU-based Block N-Gram Repeats Algorithm
As observed in Figure 1a, the cost of block n-gram repeats algorithm is as high as 25% of generation time. To reduce the cost, a new GPU-based kernel (see Algorithm 1) is developed to leverage the power of parallel compute and achieves the following benefits: 1) avoiding data movement between GPU and CPU to alleviate the throughput bottleneck of PCIe bus interface. 2) scanning n-grams in parallel. Instead of sequentially scanning tokens for detecting repeated n-grams, they can be scanned in parallel using threads equal to the number of n-grams generated till the time step t. Furthermore, each sample in a batch can be processed in parallel using multiple thread-blocks. 3) using GPU shared memory for faster memory access.
Since each token needs to be read multiple times (equal to token length of n-gram), they are stored in shared memory instead of global memory for faster access. Jia et al. (2018) reports shared memory bandwidth for Volta V100 is 16x of global memory bandwidth. Although there are multiple ways to organize CUDA thread blocks, our approach is to assign each n-gram to a thread and each threadblock to handle a sequence stream. In this way, Block N-gram repeats is parallelized along horizontal and vertical dimensions of a batch.

Asynchronous Pipeline with Parallel I/O
As shown in Figure 1a, post-process takes significant time (6.8s) in the generation process. It is under-optimized in many existing seq2seq frameworks. One reason is that post-process is not a part of the training process, many efforts are spent on optimizing the training pipeline and the model structure rather than the generation speed. Another reason is, despite of works focusing on generation speed, like distilling model, the speed metric only covers the computation time but does not include the post-process part. For example, FairSeq does not consider the post-process time when it measures the speed. These biases result in a big overlooked speed-up opportunity.
To improve the efficiency of the pipeline, we develop an asynchronous pipeline with parallel I/O. Similar to pre-fetch technology which loads next batch of data to GPU while running inference on the current batch, we post-process the current batch in a background thread while running generation on the next batch.

Evaluation
In the benchmarks, FairSeq and HuggingFace-Transformers are used as the baseline to evaluate the performance. The selected models cover different kinds of architectures, including the encoderdecoder models (e.g., BART, DistilBART, T5, ProphetNet), the decoder-only models (e.g., GPT2), and the encoder-only models (e.g., UniLM). CNN / Daily Mail dataset (Hermann et al., 2015) and WMT'16 (Bojar et al., 2016) are used as the benchmark datasets. The benchmark experiments are split into two groups 1) HuggingFace-Transformers with/without FastSeq; 2) FairSeq with/without FastSeq. If both FairSeq and HuggingFace-Transformers have implemented the model, we choose the faster result as the baseline.

End-to-end Performance
The end-to-end benchmarks (including model loading, preprocess, model inference, and post-process) have been conducted to evaluate the performance. For each model, we use the same configuration except batch size. We search the largest batch size for each framework by doubling it per search run. Each experiment is executed 10 times and the average running time is computed as the final result. The speed number is measured in samples per second.
With the optimizations of FastSeq, the end-toend performance yields a roughly 4x to 9x speedup, see Table 1 for more details 3 . In the baseline, for summarization dataset CNN/DailyMail, the speed of all models (e.g., BART, DistilBART, ProphetNet, GPT2, UniLM) is between 1.7 and 3.4 samples per second. Enabling FastSeq boosts the speed to  In following sections, we will present analyses on the three optimizations used in FastSeq.

Analysis of the Cache Optimization
To evaluate effect of the cache optimizations introduced in Section 4.1, Table 2 compares the results of not using cache, using conventional cache, and using the proposed optimized cache. Although the computing complexity is the same for both cachebased approaches, the proposed cache optimization approach reduces the usage of GPU memory by 3.5 times. Such smaller cache memory can speed up concat operations and reduce the data movement during the beam reordering, and also allow a larger batch size. These advantages together increase generation throughput from 5.6 to 18.4 samples/s.

Analysis of Block N-Gram Repeats
To demonstrate the effectiveness of GPU kernel described in Section 4.2, the new method is compared with two other methods in Table 3: 1) the one implemented by FairSeq (called baseline). 2) a revised CPU-based kernel, which improves baseline by moving data from GPU to CPU before computing to avoid multiple data transfers (called CPU kernel). The time difference (4477.1 ms vs 584.9 ms) between baseline and CPU kernel indicates that data transfer optimization alone can speedup about 8x. Furthermore, the proposed GPU kernel, which avoids data transfer and uses parallel computation has about 75x speed gain compared to CPU kernel. As shown in Figure 1b, the computing time after optimization becomes quite small, from about 25% to 1% of the overall time.   44.20/21.17/41.30 44.20/21.17/41.30 ProphetNetlarge 44.17/21.17/41.28 44.17/21.17/41.28

Analysis of Asynchronous Pipeline with
Parallel I/O Table 2 measures the performances of the synchronized pipeline with single process implemented by FairSeq and the proposed asynchronous pipeline with parallel I/O in FastSeq. The throughput is increased from 2.4 samples/s to 3.6 samples/s (around 1.5x). The speedup comes from the better resource scheduling, where the asynchronous pipeline allows post-process to run in the background when running the model inference, and the support of multi-thread detokenization. As shown in Figure 1b, the post-process unique time is reduced from about 38% to 1% of the overall time.

Analysis of Generation Quality
All optimizations in FastSeq do not affect the model generation quality. As discussed in Section 4, the logic for detecting the repeated n-gram blocks is the same for the CPU-based and GPUbased kernels, and the asynchronous pipeline with Parallel I/O only optimizes the I/O efficiency, so these two optimizations do not change the model outputs in any fashion. For the attention cache optimization, it does not affect model outputs in theory. However, in practice, if using mix precision (e.g., floating point 16) for inference, there may be a few trivial differences in the outputs due to the numerical stability issue in GPU. Similar differences can be observed when changing batch size during floating point 16 inference. But if using floating point 32, the generated results are exactly the same. That means the minor differences are not caused by the proposed cache optimization itself. In FastSeq, the unit tests have been developed to make sure the inference outputs are the same with and without FastSeq when using floating point 32.
We also compare the output quality based on the CNN/DailyMail dataset (Table 4). The quite similar ROUGE scores demonstrate that FastSeq does not impact the model quality.

Related Work
A variety of efforts have been developed to improve the efficiency of Transformer models. From the perspective of model architectures, there are efforts on reducing attention matrix size by chunking input sequences into blocks (Beltagy et al., 2020), or using strided convolution over the keys and queries to compress memory (Liu* et al., 2018). Another kind of approaches focus on reducing model size and memory consumption by weight quantization (Zafrir et al., 2019), weight sharing (Dehghani et al., 2019), and weight pruning (Michel et al., 2019). Knowledge distillation is another popular approach (Hinton et al., 2015). On the other hand, a dozen of innovations on infrastructure side have been conducted to speed up serving of Transformer models. The fused chains of basic operators in the attention layers have been widely adopted in many frameworks (e.g., Onnx Runtime 5 , Deep Speed 6 ). It is also performance critical to optimize data layout and movement among the connected operations (Ivanov et al., 2020). In situation of varied input lengths, TurboTransformers (Fang et al., 2021) is developed to better serve online models by using dynamic batch scheduler, more efficient memory allocation and deallocation algorithms. FasterTransformers 7 deeply optimizes kernels of encoder, decoder and beam search to better utilize computer power of Tensor Core.

Conclusion
In this work, we present FastSeq, which provides general solutions for speeding up the sequence generation without accuracy loss. The proposed optimizations include an attention cache optimization, an GPU-based n-grams blocking algorithm, and an A Cache Optimization in Self-Attention First, we can split the cached key and value to two parts: Cache K and Cache V are for the prefix; Cache K t and Cache V t are for the generated sequence at the t step as below: , y t−1 · W k [D,D] ) V t [B×M,t,D] = concat(Cache V t−1 [B×M,t−1,D] , y t−1 · W v [D,D] ) The above split operation results in incompatible shapes between Cache K and Cache K t , and between Cache V and Cache V t . Instead of reorganizing these cached keys and values, Equation (3) is leveraged to compute attn t . By this way, the expensive concat operations on large tensors can be avoided. attn w 0 [B×M,1,N ] = einsum(Q t , Cache K ) attn w 1 [B×M,1,t] = Q t · K T t attn w [B×M,1,N +t] = concat(attn w 0 , attn w 1 ) attn prob [B×M,1,N +t] = sof tmax( attn w d kt ) attn prob 0 [B×M,1,N ] , attn prob 1 [B,M,1,t] = split(attn prob) attn t0 [B×M,1,D] = einsum(attn prob 0 , Cache V ) attn t1 [B×M,1,D] = attn prob 1 · V t attn t [B×M,1,D] = attn t0 + attn t1 (3)

B Cache Optimization in Encoder-Decoder Attention
The first step is to remove the duplication in Cache K and Cache V. For the incompatible shape between Q and Cache K, einsum is leveraged to avoid the reshape.
Cache K [B,1,N,D] = S [B,1,N,D] · W k Cache V [B,1,N,D] = S · W v attn w [B×M,1,N ] = einsum(Q t , Cache K) attn prob t [B×M,1,N ] = sof tmax( attn w d kt ) attn t [B×M,1,D] = einsum(attn prob t , Cache V ) As such, the size of Cache K and Cache V can be reduced by M times from B × M × N × D to B × 1 × N × D. Then the optimization benefits in self-attention can be achieved here as well, including peak memory reduction and larger batch size. Additionally, the cached key and value are not needed to be reordered since the elements at the beam dimension are exactly the same.