Easy and Efficient Transformer: Scalable Inference Solution For Large NLP Model

Recently, large-scale transformer-based models have been proven to be effective over various tasks across many domains. Nevertheless, applying them in industrial production requires tedious and heavy works to reduce inference costs. To fill such a gap, we introduce a scalable inference solution: Easy and Efficient Transformer (EET), including a series of transformer inference optimization at the algorithm and implementation levels. First, we design highly optimized kernels for long inputs and large hidden sizes. Second, we propose a flexible CUDA memory manager to reduce the memory footprint when deploying a large model. Compared with the state-of-the-art transformer inference library (Faster Transformer v4.0), EET can achieve an average of 1.40-4.20x speedup on the transformer decoder layer with an A100 GPU.


Introduction
Large-scale pre-training has been shown to be an effective way to obtain powerful and flexible language models. A simple transformer model backed by sufficient parameter size, data size, and computational budget can outperform complex algorithms. Kaplan et al.(Kaplan et al., 2020) have exhaustively studied this phenomenon and proposed that the loss of the language model scales as a power-law with model size, dataset size, and the amount of compute used for training. In recent years, organizations like Nvidia and OpenAI * Equal contribution † Corresponding Author have published many large-scale pre-trained models such as Megatron (Shoeybi et al., 2020), GPT-3(Brown et al., 2020), CLIP (Radford et al., 2021), and DALL·E (Ramesh et al., 2021), which have achieved excellent results in a series of tasks. Based on these powerful models, many innovative applications and startups have been incubated, showing great economic and social value. Although these large-scale pre-trained models are very effective, they are also very expensive to do inference. For example, a GPT-2 model with 0.3B parameters takes about 10 seconds to perform an inference step on one RTX 2080ti without any optimization. To reduce the cost of deploying pre-trained models, many methods have been proposed to reduce the computational overhead, including knowledge distillation (Hinton et al., 2015), model pruning (Voita et al., 2019), and quantization (Shen et al., 2019). Besides these methods, efforts have also been made to improve the utilization of hardware in performing model inference. TensorRT 1 and Faster Transformer developed by Nvidia and LightSeq (Wang et al., 2021) developed by ByteDance include many optimization techniques like kernel fusion, GEMM optimization, and low precision kernels, etc,. These works greatly improve the inference efficiency of the transformer. However, there still exist several issues. For example, they are not efficient to handle the context; some of them could not support model size beyond 1024; applying them into applications is also non-trivial.
In this paper, we present a novel transformer inference accelerating library based on Py-Torch (Paszke et al., 2019), the Easy and Efficient Transformer(EET). In EET, we implement a bunch of optimization techniques making us stand out from others. Firstly, we propose to use pre-padding for sequence padding. We argue that this leads to two fold benefits. pre-padding makes the input con-text and the decoded token have the correct relative position, consistent with the pre-training setting. It also simplifies the implementation of dynamic batching at inference. Secondly, we implement novel CUDA kernels which avoid explicit matrix multiplications of attention masking and attention weights. Finally, we extend all kernels to support a larger hidden size and longer context length.
Deploying a PyTorch model with TensorRT or Faster Transformer is sophisticated. We simplify the deployment workflow with EET. EET provides Python-level APIs, no need to convert models to ONNX or other formats. It also integrates a python web server that supports dynamic batching, further reducing the sophistication of deployment. The key idea is to use a dictionary of Incremental_state to save keys and values produced in the previous step inside the decoder, while incremental_state will participate in computing of masked-self-attention in the successive step, avoiding duplicated computation.

pre-padding decoding
Every token in the context is processed one by one, incremental decoding treats it as predict results, which means the decoding step count still equal to the time steps number no matter how many contexts are provided.  The context in the batch could be processed concurrently similar to the teaching force in the training phase. A simple idea is to take the contexts as a whole inputs vector feed into the network with a sequence mask. Keys and values for each token in contexts were saved for afterward incremental decoding. The number of decoding step is : However, it is not enough with different sequence lengths in a context, and an extra padding process needs to be applied. The padding-right side will cause a relative position shift upon generated tokens causing us must to handle it carefully, We instead ensure the relative position by padding on the left side. After that, we pack all the sequences in the context into a batch. See figure 2.
We have implemented pre-padding decoding by modifying Fairseq and compared it with the original incremental decoding. See figure 3. An obvious gain could be obtained by the parallel processing through context.

High Performance Kernels
Faster Transformer is the best one that has achieved the advanced transformer-based model inference performance on the NVIDIA GPU platform. We design our custom kernels based on the implementation of FT, making a further optimization: 1. Padded tokens are useless for the final predict results. Removing the padded token's computation in multi-head attention, instead of invalid calculation with padding mask, is beneficial for inference.
2. Sequence mask is important for implementing masked-multi-head attention when inference the sequences in a context in parallel, but constructing the mask frequently as context changes are also very time-consuming.
3. Pre-padding decoding could be implemented based on FT, performing parallel context inference and use the dynamic batch to improve the throughput capacity of the online service.
We firstly design kernels to remove the sequence mask which invoking complex construction and access to the device memory. Then we fuse the padding mask to save the invalid computation for padded tokens. The absence of this two mask not only improve the performance but also make it easier to achieve our customized decoding method and we implement the pre-padding decoding to take full usage of parallelism of context. Finally we extension the kernels to support the model size and sequence length beyond 1024.

Mask Fusion
Like kernel fusion, mask fusion integrates the mask into the kernel, no longer taking any explicit functional parameter of the mask. Each thread or block in CUDA will map a token's data in multi-head attention's calculation one by one when we design the kernels. Sequence mask fusion could be achieved by giving a -inf value as the Softmax input when the position of the query token being processed is larger than the position of the key token represented by the thread index. Padded token's computation could be avoided by starting or ending the loop from the padding offset according to the left or right side When the block sequentially processes each token. There we give pseudocode, and the c++ code could be found in our open-source framework(EET). This kind of mask fusion for padded token could be applied to the Softmax function of GPT-2, including the context forward part and incremental decoding forward part, as well as Bert encoding part.

prompt_len = min(seq_len_in_context)
Pre-padding decoding is an approach to effectively use the context and the mask fusion allows us to implement customized decoding without considering the diverse mask. Under these two favorable requirements, we have achieved the pre-padding decoding in EET.

Thread Block Folding
In the study of large models, most state-of-art transformer-based model sizes are greater than 1024. However, considering the CUDA block only supports maximum threads number of 1024, FT v3.1 designs the kernels has limited the model size and sequence length to 1024, causing unavailable in many cases. To break this limit, FT v4.0 directly specifies the number of threads in a block based on the model dimension, including 128, 256, 384, or 512, and then assigns multiple blocks to this model dimension data. For example, assuming the model dimension is 1280, and we specify the 128 as the threads number in a block, so 10 blocks for a model dimension data will be created. In some cases, this straightforward method will cause serious performance degradation as there are may not enough threads in one block, especially setting to 128 or 256.
We propose a Thread-Block-Folding approach to upgrade all kernels in FT, making it support the larger model size and longer sequence length. This approach allows us to extend any kernel to any model size and any sequence length with minimum changes as well as non-declining performance. Taking model size expansion as an example, the core point of the Thread-Block-Folding approach is to keep the (batch, seq_len) dimension task assignment unchanged over the grid and to redesign multiple blocks corresponding to (head_num, size_per_head) dimension task assignment.See figure 4. When head_num * size_per_head tasks could not be held by one block, they would be folded in half and assigned into another new threads block. If the two blocks still could not handle all the data tasks, then fold it again until all the data tasks can be assigned into the blocks. A folder coefficient is introduced here to characterize the number of folding times related to model size.
To give an example, assuming the model size is 1280, we fold it in half and create another block for it, so the 1280 data would be assigned into 2 blocks with 640 threads in every block. the fold_coeff is set to 2.
We use NT to represent the numbers of threads in a block, expressed by the following formula. Then we can deduce that the minimum NT is 512 when model_size is larger than 1024, beneficial for GPU task scheduling, making the most of threadparallelism.

Easy and Efficient Transformer
Easy and Efficient Transformer (EET) is a highperformance Pytorch inference plugin especially for large transformer-based models and long sequence scenarios. We developed EET from two aspects: usability and efficiency.
• Efficiency. As mentioned above, the prepadding decoding mechanism and custom CUDA kernels were designed for efficiency, Accompanying sophisticated GPU memory pre-allocation and management.
• Usability. Firstly, APIs both on op-level and model-level allow users to construct their model or upgrade partial algorithm flexible. Secondly, we propose a smart web pipeline based on the original engineering project directly.

Op-level and model-level APIs
EET provides two levels of APIs, Operator level and model level. See table 1. Users could assemble their models if needed by the APIs provided, and adjust their models according to the API's parameters. This allows for a more flexible model definition. What more, EET could be integrated into Fairseq and Transformers just by override the specifying the files, improving performance without any code change, bypasses the process of model conversion.
model APIs

Smart service deployment pipeline
EET builds the accelerate part based on the original PyTorch project, so it can use the python web framework to make the service directly. Service-Streamer(ShannonAI, 2021) is selected for its support collect requests to make a batch. EET develops dynamic batch and variable inputs to match this feature and achieve high throughput. See figure 5. We tested the performance of inference for GPT-2 using EET on A100 and 2080ti, comparing it with Fairseq and Fast Transformer. We designed two sets of experiments: 1) the first set of experiments tested the acceleration ratio of EET relative to Fairseq and Faster Transformer at different input lengths; 2) the second set of experiments tested the acceleration ratio of EET at different hidden sizes. For the first set of experiments, since Fast Transformer v3.1 does not support a model size larger than 1024, the hidden size is set to 1024.

Inference performance at different input lengths
The configuration for testing performance that varies with input length is as table 2.
batch size 4 hidden size 1024 maximum context length 1024 precision fp16 Table 2: Configuration for performance with prompt Speedup increases as the prompt ratio increases whether for Fairseq or FT. This is because the GPU has a better tolerance for parallelism. pre-padding decoding is an efficient way to improve the tokenparallelism which would eventually be converted to thread parallelism and instruction parallelism. With the prompt ratio increasing, the acceleration parts take up and the effect gets better and better.
Since FT has done CUDA kernels optimization, there was also a significant difference between the effect on FT and the effect on Fairseq. The gain brings by kernels optimization is about 3 to 4 times, while the total gain is about 4 to 50 times.

Performance with model dimension
The basic configuration is as table 3. batch size 4 prompt ratio 50% sequence length 1024 precision fp16 Table 3: Basic configuration for performance with model dimension Figure 8 and figure 9 show the results of A100 and 2080ti, respectively. We can see the same trend in both graphs: the larger the model dimension, the worse the acceleration effect. This is because all optimization techniques ultimately boil down to an increase in hardware utilization. As the dimensionality of the model increases, the utilization of hardware resources itself increases and the space for optimization becomes smaller and smaller, which is consistent with the trend shown in the figures.

Bert on A100 and 2080ti
We also experiment EET for Bert-base on NVIDIA A100 and 2080ti. Different configuration were selected as figure 10 and figure 11 show. From the figures we can find that: the speedup effect decreases with increasing batch size and sequence length. This is the same reason for the performance varying with model dimension on GPT-2. Whether adding a batch size or a sequence length, the utilization of computational resources also increases, resulting in a smaller optimization space. The acceleration effect on the Bert ranges from about 1.5 to 5 times. We also compare the performance between EET and FT on Bert. Since the main optimization point is to remove pad attention calculation, we compare it in a 50% padding to total sequence length. We made a slight acceleration of 1 to 1.27 times to FT.

Conclusion
We propose a comprehensive of optimization for transformer-based model invoking algorithm features and GPU hardware features, then we develop EET framework as a plugin of Pytorch, focus on the inference of transformer-based large model and long sequence. As a result, we obtained a 1.5 to 15 times improvement on the GPT-2 model and a 1.0 to 1.27 times improvement on the Bert model compared to Faster Transformer.