PermuteFormer: Efficient Relative Position Encoding for Long Sequences

A recent variation of Transformer, Performer, scales Transformer to longer sequences with a linear attention mechanism. However, it is not compatible with relative position encoding, which has advantages over absolute position encoding. In this paper, we discuss possible ways to add relative position encoding to Performer. Based on the analysis, we propose PermuteFormer, a Performer-based model with relative position encoding that scales linearly on long sequences. PermuteFormer applies position-dependent transformation on queries and keys to encode positional information into the attention module. This transformation is carefully crafted so that the final output of self-attention is not affected by absolute positions of tokens. PermuteFormer introduces negligible computational overhead by design that it runs as fast as Performer. We evaluate PermuteFormer on Long-Range Arena, a dataset for long sequences, as well as WikiText-103, a language modeling dataset. The experiments show that PermuteFormer uniformly improves the performance of Performer with almost no computational overhead and outperforms vanilla Transformer on most of the tasks.


Introduction
The Transformer architecture (Vaswani et al., 2017) has achieved state-of-the-art on various fields of research, including natural language processing (Devlin et al., 2019;Raffel et al., 2020), speech processing (Baevski et al., 2020) and image processing (Dosovitskiy et al., 2020;Tan and Bansal, 2019). But Transformer does not scale well to long sequences, because the time complexity and memory complexity of the attention module in Transformer are both quadratic to the sequence length. Recently, several efficient Transformers (Kitaev et al., 2020;Wang et al., 2020;Zaheer et al., 2020;Xiong et al., 2021) have been proposed to speed up the model from quadratic complexity to linear complexity without significant performance loss. Generally, they utilize efficient algorithms to approximate attention. §2 briefly introduces these efficient Transformers and a more thorough review can be found in Tay et al. (2020c).
Among these efficient Transformers, it is suggested that Performer  is the fastest one (Tay et al., 2020b). In this paper, we denote as Performer the family of efficient Transformers similar to , e.g., Katharopoulos et al. (2020); Peng et al. (2021); Kasai et al. (2021); , not only  itself. Performer utilizes kernel method to avoid explicit calculation of attention weights. It applies a non-linear feature map to queries and keys to get query features and key features respectively and then multiplies query features, key features, and values together directly, without applying softmax. With the appropriate ordering of matrix multiplications, Performer achieves complexity linear of the sequence length. Moreover, some implementation of unidirectional Performer  even reduces memory footprint to constant at both training time and inference time.
Although Performer accelerates attention to linear complexity, the existing relative position encoding (Shaw et al., 2018;Dai et al., 2019;Raffel et al., 2020) still has quadratic complexity with respect to the sequence length. So Performer cannot benefit from relative position encoding, which has already been a common practice for a bunch of state-of-the-art Transformers (Yang et al., 2019;Raffel et al., 2020;He et al., 2020). Relative position encoding has several advantages over absolute position encoding. (1) Relative position encoding may be applied to sequences with arbitrary lengths, with no limitation imposed by training datasets. (2) Relative position encoding is more efficient and effective than absolute position encoding. (Shaw et al., 2018) Besides Performer, existing relative position encodings also do not fit with other efficient Transformers. Some relative position encoding (Raffel et al., 2020) adds a bias to the attention matrix, and others (Shaw et al., 2018;Dai et al., 2019) add a relative-position-dependent bias to key vectors. Both require explicit calculation of dot-products between query vectors and key vectors. This conflicts with the second and third categories of efficient Transformers described in Section 2 because they reduce the computation complexity by avoiding the explicit calculation of dot-products between query vectors and key vectors. As for the first category of efficient Transformers, LSH in Kitaev et al. (2020) may fail to locate major attention weights in the presence of relative position encoding; Zaheer et al. (2020); Beltagy et al. (2020) rely on global tokens heavily, whose relative positions to other tokens are not defined.
In this paper, we propose a Performercompatible relative position encoding that scales linearly on long sequences. Performer with this novel relative position encoding is named Per-muteFormer. PermuteFormer applies a positionaware transformation on query features and key features to encode positional information. More specifically, we choose a random permutation π : {1, 2, · · · , d} → {1, 2, · · · , d} where d is the dimension of query / key features per attention head, and applies the permutation i times to i-th token's query / key feature. 2 In this way, positional information is encoded into attention weights. We prove that, although the transformation applied to query feature and key feature of a token depends on its absolute position, the effects of absolute position on query features and key features cancel out with each other on calculating dot-product of them. Thus, the final attention weights do not depend on the absolute positions, and PermuteFormer encodes relative position only.
PermuteFormer is as efficient as Performer, with negligible computational overhead. Permuting of query features and key features can be implemented efficiently, with computational complexity proportional to their size. Since the size is far less than the computational complexity of the whole model, the cost of permutation in PermuteFormer is negligible compared to the overall computational cost of Performer. The analysis above is also confirmed by the experiment results.
We evaluate PermuteFormer on Long-Range Arena (Tay et al., 2020b) for bidirectional case and on WikiText-103 (Merity et al., 2017) for unidirectional case. Long-Range Arena is a benchmark designed to evaluate efficient Transformers on long sequences. We find that the new relative position encoding improves the performance of Per-muteFormer significantly on Long-Range Arena. It not only performs better than Performer but also out-performs the vanilla Transformer, as well as other efficient Transformers, e.g., Kitaev et al. (2020); Wang et al. (2020); Xiong et al. (2021). WikiText-103 is a language modeling dataset. Per-muteFormer reduces the performance gap between Performer and Transformer on WikiText-103. It also speeds up the convergence of the model.

Contributions
The main contribution of this paper is summarized as follows.
• We discuss possible ways to add relative position encoding to Performer. We theoretically propose three properties that Performercompatible relative position encoding should hold.
• We introduce PermuteFormer, a Performer model with relative position encoding that scales linearly to long sequences. It permutes elements of query features and key features to encode positional information. It is the only Performer-compatible relative position encoding with linear complexity, as far as we know. PermuteFormer is as efficient as Performer.  Although attention is multi-headed in all of them, only one head is illustrated for clarity. Transformer applies softmax on dot-products of queries and keys to get the attention matrix, and then multiplies attention matrix and values to obtain outputs of attention module. Performer applies feature map, a non-linear projection, to queries and keys to get query features and key features. Then, it multiplies query features, key features and values from right to left. Permute-Former applies a position-aware permutation on query features and key features first, and then do multiplications the same way as Performer. Each token's query / key feature is illustrated as a row of blocks in the figure, and its elements are marked with different colors. The position-aware permutation permutes elements of each token's query / key feature along the head size dimension in each attention head. Depending on the token's position, the permutation applied to query / key feature is different. Note that for Performer and Per-muteFormer, only the numerator in Equation 11 is illustrated, as the denominator is simpler than the numerator.
into three categories. The first category of efficient Transformer omits the calculation of part of the attention matrix, exploiting the sparsity of the attention matrix. Kitaev et al. (2020) groups queries into buckets by local sensitive hash and computes intra-bucket attention weights only. Zaheer et al. (2020); Beltagy et al. (2020) limit attention matrix to specific sparse shapes. The second kind of efficient Transformers lowers matrix rank to reduce computation. Wang et al. (2020) projects keys and values to constant length independent of sequence lengths. Tay et al. (2020a) generates attention weights without keys. The third category of efficient Transformers, named Performer in this paper, leverages kernel methods to speed models up. ; Peng et al. (2021) view attention weights as kernel function of queries and keys, so they can be approximated by random features. Katharopoulos et al. (2020)

Concurrent Work Su et al. (2021) introduces
RoFormer with a new kind of relative position encoding named RoPE, which is interoperable with Performer. Briefly, RoPE is a multiplicative sinusoidal absolute position embedding that rotates query (feature) vectors and key (feature) vectors according to their positions.
However, to make RoPE independent of absolute position, they sacrifice the property of attention ma-trices that every row sums to one. Moreover, they only discuss the possibility of integrating RoPE with Performer, but no experiment result is reported on such a model.
On the other hand, PermuteFormer's position encoding preserves the property of attention matrices mentioned above. In this paper, we compare the performance of PermuteFormer with RoFormer through experiment. The result shows that Permute-Former fits the data better than RoFormer.

Methods
We propose an efficient relative position encoding that is compatible with Performer architecture. Performer with this new relative position encoding is named as PermuteFormer, because it permutes elements of query feature and key feature to encode positional information. The difference among vanilla Transformer, Performer and PermuteFormer is illustrated in Figure 1.
In this section, we first introduce Transformer and Performer briefly, and then describe details of PermuteFormer. For brevity and clarity, discussions in this section focus on a single head in multi-head attention. They can be directly applied to the whole multi-head attention.

Transformer and Performer
We give a brief introduction of Transformer and Performer's attention module in this section. Other parts of Transformer architecture (Vaswani et al., 2017) are omitted as they are unmodified in Performer and PermuteFormer.
The attention module in Transformer is a mapping from a sequence of vectors with the same length L. In the attention module, the input vectors are first linearly mapped to three representations, named query, key and value. Formally, where W q , W k , W v are transformation matrices for query, key and value, respectively. Then, similarities between queries and keys are calculated. The similarities are normalized to produce attention weights where sim(q i , k j ) is the similarity of vector q i and vector k j . Finally, output vectors x out i are obtained by weighted sum of values with weight {α ij } L i,j=1 .
Vanilla Transformer (Vaswani et al., 2017) adopts the following function as the similarity metric of queries and keys. sim To reduce computation and memory cost, Performer's similarity function is approximated with kernel trick.
where φ(·) is a non-linear feature map from R d to R m for some model-specific m, so that the attention module can be expressed as follows.
x out We call φ(q i ) as query feature and φ(k i ) as key feature.
In this way, the O(L 2 ) attention weight matrix is not explicitly calculated, so that the attention module costs only O(L) time and memory, rather than the O(L 2 ) complexity as vanilla Transformer. Different Performers differ by the choice of the mapping φ(·). A simple working choice is the ReLU function φ(x) = max(x, 0) .

Relative Position Encoding for Performer
In this section, we discuss adding relative position encoding to Performer. We choose to modify the similarity function (Equation 5) to encode positional information. Specifically, we introduce an additional layer of position-dependent linear transformation over query features and key features. Now, the similary function becomes where M i , N j ∈ R m×m are matrices parameterized by token's position i and j.
To ensure the similarity function depends only on the relative positions rather than absolute ones, M i , N j must hold the following property.
To prevent the similarity function from exploding as the sequence length grows, we have Additionally, the similarity function should be positive; otherwise, the model would be numerically unstable. If the similarity function alters between positive and negative values, in some cases, the denominator in Equation 2 may be zero while its numerator is not zero, leading the output of attention module tend to infinity. To keep the similarity function positive, one simple but efficient solution is to make all elements of query features and key features positive Katharopoulos et al., 2020).
We prove that, M i N j must be in a specific form to fulfill the requirement of Property 1.
i=−∞ be a series of l × n matrices. Then, M i N j only depends on i − j, if and only if that, there is an integer l , matrices R ∈ R l ×m , Q ∈ R l ×n , and an invertible matrix P ∈ R l ×l , such that Proof is given in Appendix. Although this proposition does not impose any additional constraint on M i and N j , it suggests that effectively we only need to consider the case that

PermuteFormer
Based on the analysis of the previous section, we introduce PermuteFormer by selecting specific P, Q, R in Equation 9. To meet constraints imposed by Property 2 and Property 3, we choose the following solution for PermuteFormer.
where r = 1 for bidirectional models and 0 < r < 1 for unidirectional models, π : {1, 2, · · · , m} → {1, 2, · · · , m} is a permutation and P π is the corresponding permutation matrix. (A permutation matrix is a square binary matrix that has exactly one entry of 1 in each row and each column and 0s elsewhere. For permutation π the corresponding permutation matrix P π is the matrix that P π,ij = 1 if π(i) = j; P π,ij = 0 otherwise.) Note that different attention heads may have different P π and r, so that both long-term and short-term dependencies are captured. Substitute Equation 9, 10 into Equation 7, we get the similarity function of PermuteFormer PermuteFormer can encode relative positions up to the order of the permutation π. Goh and Schmutz (1991) proves that the order of random permutation grows exponentially with the head size. For a model with the same size as BERT-base (Devlin et al., 2019), the dimension of queries / keys per attention head is 64, corresponding to an average order of over 3000. To further extend Per-muteFormer's ability to encode long sequences, we choose different permutations for different attention heads, so that the longest distance Permute-Former can encode is the least common multiple of all permutations' orders, which can be up to 1e27 for a model with head size of 64.
There are two additional parameters Permute-Former introduces, π and r. As π is a discrete parameter that cannot be optimized by gradientbased methods, we treat it as a hyper-parameter of the model. We randomly sample π at initialization of the neural network and fix its value during the whole training process. Although the model may get a better performance on training π, we find that a random permutation is good enough for Permute-Former to work, so we do not tune π to save energy. Parameter r, on the other hand, can be optimized by gradient-based methods, but we also treat it as a hyper-parameter.

Computational Cost
We analyze computational cost of PermuteFormer in this section. PermuteFormer is as fast as Performer, which is the most efficient Transformer (Tay et al., 2020b) to our knowledge.
Let L denote the length of the sequence, H denote the number of heads in the model, and m denote the per-head hidden dimension of query features and key features.
The computational overhead introduced by Per-muteFormer includes the computation of P i π , the application of linear transformation P i π on query features and key features, as well as calculation of powers of r.
Multiplication of permutation matrices is equivalent to multiplication of corresponding permutations. In our case, it reads that where π i is the i-th power of permutation π that π i (x) = π(π i−1 (x)) and π 0 (x) = x.
We can compute these π i and cache them before training and inference. This takes O(LHm) time and O(LHm) memory. As P i π is a permutation matrix, there is no need to do cumbersome matrix-vector multiplication. Instead, a gather operation on query features and key features is enough. The memory and time complexity of this gather operation is equal to the size of query features and key features, i.e., O(LHm).
Powers of scalar r can be calculated easily. Thus, the total overhead introduced by Permute-Former is O(LHm). Since the complexity of attention in Performer is O(LHm 2 ), this overhead is negligible.

Trick for Two-Dimensional Case
As Transformer-based models are getting popular in fields other than natural language processing these days, it is worth noting that PermuteFormer is also applicable to 2D inputs like images and multi-modal documents (Xu et al., 2020).
One naive way to deal with two-dimension inputs is to follow the convention in benchmark Tay et al. (2020b). Pixels in the 2D space are first flattened to an 1D sequence before fed into the model. However, this causes problems for relative position encoding. It makes the rightmost pixel in the first row adjacent to the leftmost pixel in the second row, so the relative position of these two distant pixels is extremely close in the 1D sequence, which is incorrect. It is almost impossible for the model to learn something meaningful out of the wrong relative position.
To remedy this, we adapt PermuteFormer's attention for 2D inputs. We permute some elements of the query / key feature according to a pixel's horizontal position, while others according to its vertical position. More precisely, we modified equation 11 as follows where (x i , y i ), (x j , y j ) are coordinates of the i-th and j-th pixel, respectively. π x and π y are two permutations commutative with each other.

Experiments
We evaluate bidirectional PermuteFormer on Long-Range Arena, which consists of many longsequence tasks. Unidirectional PermuteFormer is evaluated on WikiText-103, a language modeling task. 3

Long-Range Arena
Long-Range Arena (Tay et al., 2020b) is a benchmark for efficient Transformers. It concentrates on efficient Transformers' performance on long sequences. The benchmark consists of five subtasks from various domains: byte-level text classification, byte-level document retrieval, image classification on sequence of pixels, Pathfinder, and long ListOps. We follow the evaluation protocol of Tay et al. (2020b), except that we exclude the long ListOps task from the benchmark, because a simple classifier on the first token 4 performs on par with the best model reported in Tay et al. (2020b). In the four selected tasks, image classification has 10 labels, while the others are binary classification tasks.

Setup and Implementations
We compare our PermuteFormer with the vanilla Transformer and Performer. A version of Su et al. (2021) is also implemented on Performer for comparison. In addition, we also list performances of other efficient Transformers from Xiong et al. (2021), including Reformer (Kitaev et al., 2020), Linformer (Wang et al., 2020) and Nyströmformer (Xiong et al., 2021). Conventional relative position encoding, such as Shaw et al. (2018)   almost computational infeasible to apply them to such long sequences. For efficiency, we choose a simple feature map for both Performer and PermuteFormer. is added to the features to ensure that the denominator in Equation 2 is not zero. We set = 0.001.
In this paper, all neural networks are trained from scratch. Learning rates are manually tuned on Transformer to match the results reported by other papers. Then, these hyper-parameters are fixed on training of Performer and PermuteFormer. Model sizes are the same as those described in Tay et al. (2020b). The hidden dimension of query features and key features are four times of that of queries and keys. Absolute position embedding is disabled for PermuteFormer. Models are optimized with Adam (Kingma and Ba, 2015). More details of hyper-parameters can be found in the appendix. Each experiment is run five times and the average accuracy is reported. Experiments are done on machines with 8 V100 GPUs.

Results
Performance The results are summarized in Table 1. It shows that the relative position encoding in PermuteFormer significantly improves the performance of Performer in all the tasks, including both language tasks and vision tasks. It not only achieves better accuracy than existing efficient Transformers without relative position encoding, but also performs better than vanilla Transformer, as well as Performer with Su et al. (2021)'s relative position encoding.
Efficiency We record the training time of each model on all the tasks, as well as their latency on inference. The result is listed in Table 2 and Table 3. It shows that Performer runs around two to three times faster than Transformer. The second line and the third line of the table indicate that Per-muteFormer's speed is almost the same as that of Performer. This aligns with our analysis in § 3.4 that the overhead of PermuteFormer is negligible compared to the computation cost of Performer itself.
We take T5 (Raffel et al., 2020) as an example to illustrate that existing relative position encoding is computationally infeasible for long sequences. We train Performer with T5 with a few iterations to estimate the running time for one epoch. The result

Model PPL
Transformer (Vaswani et al., 2017) 30.18 Performer     is shown in the last line of Table 2. It indicates that T5 is significantly slower than Transformer, not to say Performer.

Ablation Study
We evaluate whether 2D relative position encoding is useful for PermuteFormer. We train Permute-Former with 1D relative position encoding, and the result is shown in the second last line of Table 1. As expected, its performance drops significantly for tasks with 2D inputs. Thus, 1D relative position encoding is harmful to vision tasks as discussed in § 3.5. We also justify that Property 3 is necessary for PermuteFormer, i.e., the transformation should preserve positiveness of query features and key features. We train a PermuteFormer with the permutation matrix P π replaced by a random orthogonal matrix. The result is listed in the last line of Table 1, that PermuteFormer without Property 3 does not converge on most of the tasks.

WikiText-103
We evaluate unidirectional PermuteFormer on WikiText-103 (Merity et al., 2017). It is a language modeling dataset with about 103 million tokens extracted from verified articles on Wikipedia.

Setup and Implementations
We compare PermuteFormer with the vanilla Transformer and Performer. Models are implemented with fairseq (Ott et al., 2019). We adopt hyperparameters suggested by fairseq 5 : 6 layers, hidden dimension of 512, feed forward dimension of 1024, 8 attention heads. Feature map is the same as Equation 5. r takes its value in [0.88, 0.99]. For comparison with absolute position encoding, we set the sequence length to 512. Perplexity is measured on the test set. To avoid predicting tokens with little context at the beginning of a sequence, only the last 256 tokens are counted in the results. Effects of r and P π are measured separately through ablation studies, i.e., removing r or P π in Equation 11.

Results
The results for WikiText-103 are listed in Table 4. We also plot trending of perplexity during training in Figure 2. It shows that PermuteFormer lowers the performance gap between Transformer and Performer. It also speeds up convergence of models.
The last two lines of Table 4 indicate that performance of PermuteFormer drops without r or P π . Thus, both r and P π are crucial for PermuteFormer. r may be helpful for PermuteFormer to focus on local context, while P π is responsible for encoding relative positional information.

Conclusions
We discuss possible ways to add relative position encoding to Performer, a family of efficient Transformers scales linearly. Based on the analysis, we propose PermuteFormer, a variant of Performer with position-aware permutation to encode relative positional information. While improving the performance, this novel relative position encoding introduces negligible overhead compared to the overall computational cost of Performer. Experiments show that it runs as fast as Performer.
Extensive experiments are conducted on Per-muteFormer, including byte-level text tasks and pixel-level image classification of Long-Range Arena, as well as language modeling on WikiText-103. Bidirectional PermuteFormer is used for the former tasks, while unidirectional PermuteFormer is adopted for the latter one. Results show that PermuteFormer uniformly improves the performance of Performer, accelerates convergence, and achieves state-of-the-art on some tasks.

Ethical Considerations
This paper does not introduce new datasets. All the experiments and discussions are based on public datasets, which have been widely used for years. This paper focuses on speeding up NLP models generally. It is not directly connected to specific real-world applications.
The purpose of this paper is to reduce the computational cost of Transformer without performance drop. We hope our work will reduce energy consumption for future work of NLP. We also try our best to reduce carbon cost in experiments, such as minimizing hyper-parameter tuning. It takes about 10 days on 8 V100 GPUs to get all the figures in this paper. Assume Then, there exists such that

Case 1) It does not satisfy Equation 24. Without loss of generality, assume
Case 2) It satisfies Equation 24, but not Equation 25. Without loss of generality, assume unit vector x ∈ ker(N i ) for some i. Then, for any j, k, Therefore, for any j ∈ Z, N j = N j (I n − xx ).
Case 3) It satisfies both Equation 24 and Equation 25. Nothing to prove.
Assume Then, there exists Without loss of generality, we only need to discuss the case that n < l.
If n < l, im(N 0 ) = R l . On the other hand, ∞ i=−∞ im(N i ) = R l . So there is a column of N p for some p = 0 that in R l \im(N 0 ). More generally, there is a vector e ∈ R n and an integer p, that N p e ∈ R l \im(N 0 ).
By induction we haveM i ,Ñ i ,P,Q that ∀i, j, k ∈ Z,M i Ñ j =M i+k Ñ j+k , ∀i, j ∈ Z,M iÑ j = (M iP ) (Ñ jQ ), Proposition A.1. Let {M i } ∞ i=−∞ be a series of l × m matrices, {N i } ∞ i=−∞ be a series of l × n matrices. Then, M i N j only depends on i − j, if and only if that, there is an integer l , matrices P ∈ R l ×m , Q ∈ R l ×n , and an invertible matrix A ∈ R l ×l , such that Proof. (⇐) If part. M i N j = (A −i P) (A j Q) = P A j−i Q depends on i − j only.
(⇒) Only If part. By Lemma A.1 and Lemma A.2, there is P ∈ R l ×m , Q ∈ R l ×n , such that ∀i ∈ Z, rank(M i ) = l , rank(N i ) = l . (50) Since M 0 N i−1 = M 1 N i , where A ∈ R l ×l is an invertible matrix. Similarly, M i = B i M 0 . Substitute them into Equation 48, we have ∀i, j, k ∈ Z, Since A, B, N 0 and M 0 are invertible, ∀ki ∈ Z, B k A k = I.
Thus, B = A − . Thus, where P = M 0 P and Q = N 0 Q.