Sequence Parallelism: Long Sequence Training from System Perspective

Transformer achieves promising results on various tasks. However, self-attention suffers from quadratic memory requirements with respect to the sequence length. Existing work focuses on reducing time and space complexity from an algorithm perspective. In this work, we propose sequence parallelism, a memory-efficient parallelism to solve this issue from system perspective instead. Our approach is compatible with most existing parallelisms (e.g., data, pipeline, and tensor parallelism), which means our sequence parallelism makes 4D parallelism possible. More importantly, we no longer require a single device to hold the whole sequence. Besides, using efficient attention with linear complexity, our sequence parallelism enables us to train transformer with infinite long sequence. Specifically, we split the input sequence into multiple chunks and feed each chunk into its corresponding device (i.e., GPU). To compute the attention output, we integrated ring-style communication with self-attention calculation and proposed Ring Self-Attention (RSA). Experiments show that sequence parallelism performs well when scaling with batch size and sequence length. Compared with tensor parallelism, our approach achieved 13.7\times and 3.0\times maximum batch size and sequence length respectively when scaling up to 64 NVIDIA P100 GPUs. With efficient attention, sequence can handle sequence with over 114K tokens, which is over 27\times longer than existing efficient attention works holding the whole sequence on a single device.


Introduction
Attention-based models (e.g., Transformer) have achieved impressive performance on various natural language processing tasks (e.g., Q&A [Qu et al., 2019], relation extraction [Xue et al., 2020b;Xue et al., 2020a]). Recently, Transformer also achieved promising results on computer vision tasks [Dosovitskiy et al., 2020;Zhang et al., 2021] and even on bioinformatics tasks [Wang et al., 2021]. These Transformer-based models learn powerful context-aware representation by applying self-attention to all pairs of tokens from the input sequence. This mechanism captures longterm dependencies at the token level for sequence modeling. However, self-attention suffers from quadratic memory requirements with respect to sequence length. Existing system requires us to hold the whole sequence in one GPU, which limits the length of input sequence. Unfortunately, the long sequence is common in real-world applications. For instance, when we train Transformer for medical image classification, each image is much larger than it is in usual (e.g., 512×512×512 vs 256×256×3). Then, each medical image includes much more tokens, that is, each input sequence is much longer than usual. In this case, it is challenging to hold the whole sequence within single GPU.
In this paper, we designed and implemented sequence parallelism (SP), a novel parallelism aiming at breaking the limitation that we need to store the whole sequence in one GPU. The proposed system can train transformer-based models with longer sequences and a larger batch size. We first split the input sequence into multiple chunks along the sequence dimension and feed each sub-sequence chunk to one corresponding GPU. Each GPU thus only holds a part of the full sequence. To apply self-attention to the tokens from different chunks, the main challenge is to compute attention scores and outputs across GPUs efficiently. To tackle this problem, we proposed Ring Self-Attention (RSA), which circulates key and value embeddings across GPUs in a ring manner. In this case, each device is just required to keep the attention embeddings corresponding to its own sub-sequence. As a result, our sequence parallelism is memory-efficient, especially for long input sequences.
To model long sequences, existing works mainly focus on sparse attention (e.g., [Zaheer et al., 2020]) with linear instead of quadratic space complexity. In this paper, we aim to solve the long sequence modeling problem from the distributed system perspective. Compared with sparse attention, we devote ourselves to designing and implementing a system instead of a deep learning algorithm to train attentionbased models with longer sequences. Existing pipeline parallelism (PP) [Huang et al., 2018] and tensor parallelism (TP) [Shoeybi et al., 2019]) are designed to cope with a larger model size instead of longer sequences, although they can still process longer sequences to some extent. However, the challenge is, these existing parallelism methods keep the whole sequence on a single device, which limits the maximum length of the input sequence. In contrast, our approach splits the whole sequence into multiple devices, making it possible to fit longer input data.
In summary, our main contributions are three folds: (1) Our system breaks the length limitation of Transformer model training. SP splits long sequences into multiple chunks and feeds them into different devices. It is memory-efficient because each device only keeps the attention embeddings corresponding to its own sub-sequences. Theoretically, with linear space complexity attention, SP can help us train the at-tention model with infinite long sequences. (2) To our best knowledge, our work first proposed to use distributed system to handle long sequence training for attention-based models. Our implementation is fully based on PyTorch and is compatible with data parallelism (DP), PP, and TP without any extra compiler or library. This makes it possible to integrate SP with DP, PP and TP into 4D parallelism, and pave the way to train large-scale models with long sequences. (3) Our system achieves 3.0× maximum sequence length than SoTA (i.e., tensor parallelism) when scaling up to 64 NVIDIA P100 GPUs. On shorter sequence modeling, our system is still more memory-efficient, which achieves 13.7× maximum batch size.

Background
Self-attention We first briefly review the self-attention mechanism in Transformer. For an input sentence X = {x 1 , . . . , x N } with N tokens, we encode every token x into three attention embeddings (i.e., query q, key k, value v). To model the dependency among tokens, self-attention computes the attention scores for each token x i against all other tokens in X by multiplying q i with k of all tokens. The attention scores are then multiplied with v and summed up to give the attention output. Please see Appendix for the details.
Pipeline parallelism Huge deep neural networks [Fedus et al., 2021;Raffel et al., 2020] have shown their effectiveness on various tasks. However, it is challenging to hold the whole model on one single device due to memory limitations. To overcome this, [Huang et al., 2018] proposed pipeline parallelism, model parallelism splitting the model layers into different partitions on separate accelerators. As shown in Figure 1a, they split the data along the batch dimension into micro-batches, and each device can process one micro-batch received from the previous device at a time. When the computation is pipelined across micro-batches, pipelining schemes need to ensure that inputs use consistent weight versions for both forward and backward computation to ensure correct weight update and model convergence [Narayanan et al., 2021].
Tensor parallelism Different from PP which splits models by layer, tensor parallelism (i.e., Megatron) [Shoeybi et al., 2019]) introduces tensor splitting, where individual layers of the model are partitioned over multiple devices. Similar to our SP, TP is also designed for Transformer-based models. Each Transformer layer includes a self-attention block and a two-layer multi-layer perceptron (MLP) block. The MLP block can be formalized as: where GeLU is a non-linearity activation function, X is the input data, Z and Y are the outputs. Tensor parallelism splits the weight matrices A and B along columns and rows respectively. Then, the first and second GEMM in the MLP block above can be written as: (2) At the second GEMM, Z 1 and Z 2 need to undergo an allreduce operation to give the final output before the dropout layer in the Transformer layer.
Similarly, Megatron splits the tensors in the self-attention layer as well. For multi-head attention, attention heads are split by column and allocated equally to the devices. The linear layer after the self-attention computation is split by row. An all-reduce operation is needed at the linear layer output to aggregate attention output from all devices. Please refer to Megatron [Shoeybi et al., 2019] for more details about TP.

Sequence parallelism
We propose SP for training Transformer with longer sequences. The overview of SP is shown in Figure 1c. Input sequences are split into multiple chunks and the subsequences are fed to different corresponding devices. All devices are holding the same trainable parameters but different sub-sequence input chunks. We will introduce and analyze SP in detail below. We use the following notation in this section: (1) B: batch size; (2) L: sequence length; (3) H: hidden size of linear layers; (4) A: attention head size; (5) Z: number of attention heads; (6) N: number of GPUs.

Ring Self-Attention
The main challenge to distributing sub-sequences to multiple devices is calculating attention scores across devices. Therefore, we propose Ring Self-Attention (RSA) to compute attention output in a distributed setting. There are two steps in RSA to obtain the final output. Please note, we only consider bidirectional self-attention here to introduce RSA succinctly. We treat all heads equally so it can be extended to multi-head attention directly.
Given query embeddings {q 1 1 , q 1 2 , ..., q N L }, key embeddings {k 1 1 , k 1 2 , ..., k N L } and value embeddings {v 1 1 , v 1 2 , ..., v N L }, where q n s represents the key embedding of the s th token in the the sequence which is on n th device. We define all key embeddings on n th device as K n . In RSA, n th device holds the corresponding query embeddings Q n , key embeddings K n and value embeddings V n . The embeddings on n th device correspond to the n th chunk whose sub-sequence length is L/N . Our goal is to obtain Attention n (Q n , K, V ) which is the self-attention layer output on n th device. To this end, as shown in Figure 2a, we first transmit the key embeddings among devices to calculate the attention scores QK T in a circular fashion. Such communication needs to be conducted N − 1 times to make sure the query embeddings of each subsequence can multiply all the key embeddings. To be more specific, each device will compute the partial attention scores Since computing O n requires S n and all value embeddings, as we described in Figure 2b, we transmit all value embeddings instead of key embeddings in a similar way. For O n , we calculate S n V by: where V i = V n , S n i is S n after column splitting, which means S n i ∈ R L/N ×L/N but S n ∈ R L/N ×L .

Modelling
We analyzed and compared our SP with TP in both theoretical modeling and experiments, although this is not our direct baseline. To our best knowledge, SP is the first system designed for breaking the length limitation of sequence, so there is actually no direct baseline. Therefore, as a distributed training system designed for attention-based models, we compare with a SoTA model parallelism. TP [Narayanan et al., 2021] is compatible with DP, PP. Our SP is compatible with them. We expect our system can outperform TP with and without PP. In the future, we will integrate SP with DP, PP and TP into 4D parallelism.
Here, we mainly focus on memory usage and communication cost. According to the architecture of Transformer, the comparison is divided into two parts, MLP block and attention block. In this part, we consider multi-head attention instead of self-attention for a fair and accurate comparison. We assume the optimizer is Adam used in Megatron.   Table 1, for the MLP blocks, TP stores the matrices after row or column-style splitting of the whole sequence. Our SP stores the matrices without row or column-style splitting of only one single sub-sequence on each GPU. If we assume that our sequence parallelism is more memory-efficient: We can find that, in MLP blocks, SP is more memory-efficient when BL > 32H.
As for communication, an all-reduce operation is needed in both the forward pass and backward pass in the MLP block of Megatron due to tensor splitting. As our SP does not split the linear layer weights, no additional communication is required.

Multi-head attention block
We compared the memory usage of multi-head attention block in Table 2. TP splits the attention heads here, but our SP still splits the length dimension of the sequence data. By comparing the memory usages of multi-head attention block of the two parallelisms, we can find SP is more memory-efficient if BL > 16AZ. As for communication, TP needs an all-reduce operation in both the forward pass and backward pass when calculating the attention output. In our RSA, to facilitate tensor exchange between devices, our communication is equivalent to 2 all-reduce operations in the forward pass and 4 all-reduce operations in the backward pass. The extra communication cost of RSA can be offset by the lack of communication cost in the MLP block.
In both MLP block and multi-head attention block, SP is more memory-efficient when we train Transformer with a longer sequence and a larger batch size. We analyze the communication overhead of SP in Appendix.

Experimental setup
We conducted our experiments on the Piz Daint supercomputer provided by Swiss National Supercomputing Center (CSCS). The Piz Daint supercomputer provides one P100 GPU (16GB GPU RAM) for each compute node and the compute nodes are connected by a high-bandwidth network. We chose two bidirectional language models, namely BERT Base and BERT Large, to evaluate our SP. We also verified the convergence performance of SP in Appendix. Since we are using the original model but different systems, the accuracy should be same. The slight differences are from randomness. Due to page limitation, we use BERT base as the model to validate our system except for weak scaling, but the results of BERT large can be found in Appendix.

Maximum batch size
Since our SP is memory-efficient to handle larger batch sizes, we first investigated the maximum batch size we can reach with SP. In this section, for a comprehensive comparison, we : Scaling with sequence/tensor parallelism scaled with both the TP and SP and added PP to evaluate the performance of the BERT Base model. We used tokens per second as the metric for throughput. To this end, we trained the model for 150 iterations in total, and then we calculate the mean tokens processed per second within the last 100 iterations. Scaling with sequence/tensor parallelism We fixed all hyper-parameters except the batch size and the TP or SP size. We trained the model with a sequence length of 512 and no PP is used. The TP size in Megatron is limited by the number of attention heads and hidden size, because these two hyperparameters are required to be divisible by the TP size. Among them, the number of attention heads is small so it limits the TP. Thus, TP size is a maximum of 12 for the BERT Base model in Megatron. In contrast, for our SP, only the sequence length is required to be divisible by the SP size, so that we can scale SP to a larger size since it is a much larger hyperparameter than the number of attention heads.
Our SP outperforms TP in terms of memory consumption. Figure 3a shows that our system on 64 GPUs can achieve 13.7× larger batch size than Megatron on 12 GPUs. Even if we combine DP and TP to scale up to 64 GPUs for Megatron, our system would still support a larger batch size. In Figure 3b, we can observe SP achieved comparable throughput with the same parallel size, and our system can extend to a larger parallel size to achieve better performance. Scaling with pipeline parallelism To verify the compatibility with PP, we fixed the TP and SP size as 4 and scale the pipeline parallel size. We can observe that SP still outperforms TP on the maximum batch size in Figure 4a. It can be noted that SP also achieved higher throughput when using more pipeline stages as shown in Figure 4b. This is because Megatron incurs extra communication costs between pipeline stages. Megatron holds the activation for the full sequence on each device. Thus, it needs to split the activation, transmit the partial activation to the next device, and gather back the partial activation when sending the activation between pipelines. This incurs less communication overhead compared to transmitting the whole activation between pipelines. However, this still brings more communication costs than ours, as no splitting and all-gather operation is needed for our sub-sequence intermediate activation. Therefore, our SP achieved better throughput when scaling along with pipeline parallel size.

Maximum sequence length
Sequence parallelism is designed for training Transformerbased models with longer input sequences so we investigated the maximum sequence length it can handle. Similarly, we still compared TP with and without PP. We fixed batch size as 64 for BERT Base and no PP was used.
We show the maximum sequence length of the BERT Base model in Figure 5. If we scale up to 64 GPUs, we can achieve around 3× maximum sequence length on BERT Base. Another observation is splitting along the number of attention heads limits the input sequence length of tensor parallelism in Megatron, but our sequence parallelism can scale easily by  Table 3: Weak scaling results. P is the tensor or sequence parallel size. B and S are global batch size and sequence length, respectively. M and T denote max allocated memory/MB and tokens processed per second. OOM means that CUDA out of memory occurs. splitting a sequence into multiple chunks. When using the same 16 GPUs, our sequence parallelism still can achieve 1.4 times larger sequence length than tensor parallelism. The gap is expected to widen if we use 32GB GPUs instead of 16GB GPUs. Also, in Appendix, we investigate the maximum sequence length our system can handle when we use a smaller batch size. Our RSA focuses on full self-attention in this paper. According to Table 2, when we use sparse attention with linear memory usage, theoretically, our SP is expected to handle infinitely long sequences, because three terms of memory usage include L/N . We leave it as our future work.

Weak scaling
Strong scaling limits the upper bound of batch size and sequence length within a single device, so we mainly discuss weak scaling on BERT Large here. We scale the batch size and sequence length separately when increasing the number of nodes. We fixed the PP size as 8. In Table 3, SP achieved almost constant memory usage when scaling along with the global batch size, which outperforms TP by a large margin. As for weak scaling along the sequence length, our method still uses much less memory with comparable throughput.

Discussion
Although GShard and GSPMD are two libraries built for the Tensor-Flow community to partition model parameters in distributed training. GSPMD is developed based on GShard. These two methods rely on the static computation graph of Ten-sorFlow to train larger models while we provide a plug-andplay tool based on PyTorch's dynamic computation graph to train on longer sequences. The difference in the computation paradigms makes them unsuitable as our baseline.

Conclusion
In this paper, we proposed sequence parallelism for training Transformer-based models with longer sequence. Sequence parallelism is designed to break the limitation of sequence length on a single device (i.e., GPU). We have shown that sequence parallelism can handle longer sequence and is more memory-efficient than SoTA. In particular, sequence parallelism achieves 3.0× maximum sequence length and 13.7× maximum batch size than tensor parallelism when scaling up to 64 GPUs. Unlike both tensor and pipeline parallelism, sequence parallelism is not limited by the smaller hyperparameters (e.g., number of attention heads, number of layers). Therefore, our sequence parallelism can be adapted as long as the sequence length is divisible by sequence parallel size. We used a language model (i.e., BERT) to evaluate our system. However, sequence parallelism can also be adapted to computer vision tasks. This work paves the way to process large images [Hou et al., 2019] by ViT [Dosovitskiy et al., 2020] as a larger image means more patches or longer sequences. In the future, we plan to integrate data, pipeline, tensor and sequence parallelism to construct 4D parallelism. This would enable us to train extremely large models with very long sequences.

Self-attention
For parallel computing, q, k and v of all tokens are combined into three matrices: Q, K and V . The self-attention of an input sentence X is computed by the following formula: where d k is the dimension of the key. Multi-head attention is designed to jointly consider the information from different subspaces of embedding. Compared with self-attention below, multi-head attention has h query, key and value embeddings instead of the single one, where h denotes the number of heads. We obtain these embeddings with identical shapes by linear transformations. The multihead attention can be described as:

Communication overhead
Megatron-LM uses all-reduce in its MLP layer and selfattention layer while the communication overhead in sequence parallelism mainly lies in the self-attention layer. Using the same notation as given in Section 3, we are able to calculate the amount of data transferred in sequence parallelism and tensor parallelism.
In sequence parallelism, there is one collective communication in the forward pass and two collective communication in the backward pass when calculating QK T and AV respectively. The amount of data transferred is 2(N − 1) * B * Z * (L/N ) * A in the forward pass and 4(N −1) * B * Z * (L/N ) * A in the backward pass. The combined amount of data transferred in calculating QK T and AV will be 12(N − 1) * B * Z * (L/N ) * A.
In tensor parallelism of Megatron-LM, the amount of data transferred in the forward pass and backward pass is the same as given by 2(N −1) * B * Z * (L/N ) * A. Since there are 4 collective communication in the forward and backward passes of the MLP layer and self-attention layer, the total communication cost will be 8(N − 1) * B * Z * (L/N ) * A.
Thus, sequence parallelism has 1.5 times communication overhead compared to tensor parallelism in Megatron-LM. However, sequence parallelism has better compatibility with pipeline parallelism. In tensor parallelism, to save the communication bandwidth between pipeline stages which are of- ten over different nodes, the tensor is split before transmitting to the next stage and all-gathered after transmission. As tensor has already been split along the sequence dimension, there is no need to split and all-gather between pipeline stages. Thus, sequence parallelism can have one less all-gather operation per pipeline stage.

Convergence performance
We verified the convergence performance of sequence parallelism. We used the Wikipedia dataset [Devlin et al., 2018] and evaluated Megatron and our model on the development set every 1k iterations. We trained the BERT Large model for 50k iterations with the default hyper-parameters used by Megatron. Our goal here is to verify the correctness of our implementation so we trained the model for fewer steps. We set parallel size as 4 for tensor parallelism in Megatron and sequence parallelism in our model. No pipeline was used for both models. In Figure 6, Our sequence parallelism shows good convergence on both the masked language modeling (MLM) loss and the sentence order prediction (SOP) loss. Compared with Megatron, sequence parallelism has a similar trend in convergence and achieved lower values for both MLM loss and SOP loss for 50k iterations.

Maximum batch size on BERT Large
We show experiments about the BERT Large model here. Similar to the Main text, we trained the model for 150 iter- ations in total, and then we calculate the mean tokens processed per second within the last 100 iterations. Scaling with sequence/tensor parallelism The only difference with BERT Large model setting is that the tensor parallel size is a maximum of 16 for the BERT Large model in Megatron-LM.
Our method achieved 2.7 times larger batch size for BERT Large on 16 GPUs as shown in Figure 7a, and the batch size of sequence parallelism on 64 GPUs is 10.2 times larger than that of tensor parallelism on 16 GPUs. In Figure 7b, observe that our sequence parallelism achieved comparable throughput with the same parallel size, and our system can extend to a larger parallel size to achieve better performance. Scaling with pipeline parallelism For BERT Large, sequence parallelism still achieved higher tensor parallelism on the maximum batch size in Figure 8a. Sequence parallelism also outperforms throughput when using more pipeline stages as shown in Figure 8b.

Maximum sequence length on BERT Large
Similarly, we compared tensor parallelism with and without pipeline parallelism for BERT Large. We fixed batch size as 16 for BERT Large and did not use pipeline parallelism. As shown in Figure 9. When we scale up to 64 GPUs, we can achieve around 2× maximum sequence length and scale better through splitting a sequence into multiple chunks on BERT large. Also, to investigate the maximum sequence length our system can handle on the cluster with 64 P100 GPUs, we set both data and pipeline parallel size as 1 and global batch size as 16. Please note that we set the batch size as 64 in Section 4.3. We select BERT base as the Transformer based model. As shown in Figure 10, our sequence parallelism can even handle the sequence with over 5000 tokens using full multi-head attention.