Explore Better Relative Position Embeddings from Encoding Perspective for Transformer Models

Relative position embedding (RPE) is a successful method to explicitly and efficaciously encode position information into Transformer models. In this paper, we investigate the potential problems in Shaw-RPE and XL-RPE, which are the most representative and prevalent RPEs, and propose two novel RPEs called Low-level Fine-grained High-level Coarse-grained (LFHC) RPE and Gaussian Cumulative Distribution Function (GCDF) RPE. LFHC-RPE is an improvement of Shaw-RPE, which enhances the perception ability at medium and long relative positions. GCDF-RPE utilizes the excellent properties of the Gaussian function to amend the prior encoding mechanism in XL-RPE. Experimental results on nine authoritative datasets demonstrate the effectiveness of our methods empirically. Furthermore, GCDF-RPE achieves the best overall performance among five different RPEs.


Introduction
Recently, the fully attention-based Transformer model (Vaswani et al., 2017) has achieved stateof-the-art results across a range of natural language processing (NLP) tasks, including reading comprehension (Yu et al., 2018), machine translation (Raffel et al., 2020), natural language inference (Guo et al., 2019), unsupervised pretraining (Devlin et al., 2019Yang et al., 2019;Liu et al., 2019), etc. Since the self-attention blocks in vanilla Transformer are entirely invariant to sequence order, which is one of the most important features of natural language, how to explicitly encode position information is crucial for the current Transformer based models.
The original method is to use absolute position embedding (APE), such as pre-defined sinusoidal functions (Vaswani et al., 2017) or fully data-driven learnable parameter embeddings (Devlin et al., * Corresponding author. 2019; Radford et al., 2019), to integrate position information into contextual representation. Although APE can significantly help the Transformer model learn the contextual representation of the tokens at different positions, Ke et al. (2020) pointed out that the coupled method in APE is unreasonable. Besides, APE itself also has many defects, such as the limitation of processing long sequences  and the gradual loss of position information (Al-Rfou et al., 2019).
To address the drawbacks mentioned above of APE, Shaw et al. (2018); Dai et al. (2019) further proposed the relative position embedding (RPE), which incorporates carefully designed temporal bias term into the self-attention module to encode the relative distance between any two tokens. RPE has been proven to be more effective than APE, and thus it is adopted by many excellent pre-trained language models (Yang et al., 2019;Song et al., 2020;Dai et al., 2020). Despite the success of RPE, the existing methods are not perfect. Although Huang et al. (2020) has made improvement to RPE, this improvement is only focused on the perspective of interaction rather than the perspective of encoding 1 . Moreover, to the best of our knowledge, there is currently no unified and comprehensive evaluation of various RPEs. Since almost every RPE is proposed for specific tasks, it is unknown whether these RPEs really have high universality and generalization ability.
In this paper, we focus on the most widely adopted Shaw-RPE (Shaw et al., 2018) and XL-RPE (Dai et al., 2019), and improve each of them from encoding perspective. Concretely, for Shaw-RPE, to overcome its weak ability to perceive the relative position at medium and long distance, we design an ingenious Low-level Fine-grained High-level Coarse-grained (LFHC) embedding strategy without changing the number of parameters. For XL-RPE, we recognize the potential problems of its prior sinusoidal encoding functions under the relative position setting and propose a more reasonable encoding mechanism based on the Gaussian Cumulative Distribution Function (GCDF). We conduct a unified evaluation of five RPEs on nine authoritative datasets, including language modeling, question generation, and text classification. The experimental results show that both LFHC-RPE and GCDF-RPE outperform their respective baseline, and GCDF-RPE achieves the best overall performance among the five methods 2 .

Vanilla Self-attention
The self-attention layer is the core component of Transformer, which provides a bridge for semantic interaction between tokens. In this layer, Transformer performs scaled dot-product self-attention over the input sequence by H individual attention heads and then concatenates the summary output of each head. For simplicity, we ignore the head index 2 The code and training scripts will be released at https://github.com/menghuanlater/LFHC-GCDF-RPE. in the following formula. The summary output of each head is calculated as follows: where I is the input sequence representations. W q , W k , W v ∈ R d model ×d head are three independent linear transformation matrices, and d head is the dimension of each head that satisfies d head = d model /H.

Relative Position Embeddings
Shaw-RPE (Shaw et al., 2018) is the earliest proposed RPE method. As shown in Figure 1(a), it employs fully data-driven embedding to represent different relative positions and incorporates them into the attention mechanism. In Shaw-RPE, Eq.
(2) is revised as follows: clip(x, k) = max(−k, min(k, x)) (4) where k is the maximum absolute value of relative distance and w i ∈ R d head . XL-RPE (Dai et al., 2019) offers a different derivation. It utilizes the sinusoidal encoding functions (Vaswani et al., 2017) to generate a prior vector embedding for each relative position (as shown in Figure 2(a)). In XL-RPE, Eq. (2) is revised as follows: where W r ∈ R d model ×d head and u, v ∈ R d head are trainable parameters.

Low-level Fine-grained High-level Coarse-grained Embedding
In Shaw-RPE, the authors discovered that precise relative position information is not useful beyond a certain distance, and this phenomenon has also been confirmed in subsequent work. Therefore, Shaw-RPE sets the maximum relative distance to a relatively small value. However, this phenomenon is more likely to be caused by: (1) more independent embedding parameters will increase model optimization difficulty.
(2) the greater the relative embedding distance, the more serious the optimization imbalance problem of this embedding strategy 3 . Moreover, it is necessary to distinguish the relative position at medium and long distances most of the time, especially for learning long-term dependency. To improve the model's ability to perceive medium and long relative distances without changing the number of parameters, inspired by the analysis conclusions of many works (Jawahar et al., 2019; Ethayarajh, 2019) on Transformer that the lower layers learn local syntactic features and the higher layers capture global semantic features, we propose the LFHC embedding strategy. Concretely, as shown in Figure 1(b), each embedding represents a relative position range instead of a single position. At the low layers, the range is small and the embedding granularity is fine, which keep the maximum relative distance consistent with Shaw-RPE. As the level of layers increases, the range becomes larger and the embedding granularity becomes coarser, which expand the maximum relative distance gradually. In LFHC-RPE, Eq. (4) in l-th layer is revised as follows:

Gaussian Cumulative Distribution Function Encoding
Intuitively and empirically, for RPEs using a prior encoding mechanism, the following two properties are important 4 : Property 1. For an offset k and two relative position i and j where 0 <= i < j, the proximity between the prior encoding vectors satisfies the following condition: Property 2. For two relative position i and j where 0 <= i < j, the changing trend of the Euclidean distance between the prior encoding vectors satisfies the following condition: However, the prior sinusoidal encoding mechanism in XL-RPE does not satisfy either of these properties, especially for property 1 5 . To design a prior encoding mechanism that can satisfy the above properties, we propose the GCDF encoding mechanism. Specifically, each dimension of all relative positions is encoded by the GCDF with different variances. As shown in Figure 2(b), the higher the dimension, the greater the variance. In GCDF-RPE, Eq. (6) and Eq. (7) are revised as follows: where λ is the scale factor, and its default value is 4 6 .

Experiments
In this section, we evaluate the performance of five different RPEs (T5 (Raffel et al., 2020)     10-heads 410-dimension Transformer-encoder is adopted. All our experiments are conducted on 4 RTX2080Ti or single V100 GPU. To eliminate randomness, we run each experiment ten times and report the average performance. For more detailed experimental settings, please see Appendix A.3.

Main Results
The performance of five RPEs on text classification is shown in Table 1. The dev set performance on question generation is shown in Table 2. Table  3 reports the test perplexity on language modeling 8 . As can be seen from the above experimental results, both LFHC-RPE and GCDF-RPE outperform their respective baseline on all datasets. On the long-term dependency language modeling task, LFHC-RPE has a significant improvement over Shaw-RPE, which fully proves the effectiveness of the LFHC embedding strategy. Even on datasets with relatively short sentence length, such as SST-2, SNLI and QQP, LFHC-RPE does not lose accuracy, but obtains a certain degree of improvement. GCDF-RPE has a stable improvement on all datasets compared with XL-RPE, and achieves the best overall performance among the five RPEs, demonstrating the reasonability of the Gaussian prior encoding mechanism. Besides, from the overall point of view, it is obvious that RPEs based on prior encoding mechanism are better than pure data-driven RPEs, especially on SST-2 and MNLI.

Discussion
From a qualitative point of view, each type of RPE has its advantages and disadvantages. For pure datadriven RPEs (e.g., Shaw-RPE, LFHC-RPE), all their positional embedding parameters are learned autonomously by the neural network according to the characteristics of the data, so in theory, their solution space has a very high degree of freedom and can be flexibly adapted to different tasks or datasets. However, in traditional machine learning and deep learning, a high degree of freedom usually means that the model easily falls into overfitting and obtains a local suboptimal solution (the experimental results on SST-2 and MNLI can corroborate this phenomenon). For RPEs based on prior encoding mechanism (e.g., XL-RPE, GCDF-RPE), their positional parameter optimization is constrained by the prior encoding mechanism, which is equivalent to regularize the freedom of the parameter space implicitly, thus reduce the complexity of the model space and enhance the generalization of the obtained model. The overall experimental results show that RPEs based on prior encoding mechanisms achieve better performance. However, if the prior hypothesis deviates too much from reality, adverse effects will appear (e.g., the poor performance on QQP dataset). From a quantitative point of view, it is evident from the experimental results that there does not exist any RPE that can perform best on all datasets. Even for GCDF-RPE, which has the best overall performance, there still exists a considerable gap between its performance and the optimal results on QQP dataset. Therefore, it is still very challenging and necessary to design an RPE capable of all tasks for Transformer models. We hope that our LFHC-RPE and GCDF-RPE will give some impetus to this direction.

Conclusion and Future Work
In this paper, we explore better RPEs from encoding perspective for Transformer models. For pure data-driven RPEs, we propose LFHC-RPE to strengthen the sensitivity at medium and long relative positions. For RPEs based on prior encoding mechanisms, we present GCDF-RPE with stronger generalization. Extensive experimental results on nine datasets show the effectiveness of our methods. We leave adjusting our methods to different kinds of pre-trained language models as our future work.

A.1 Optimization Imbalance Problem
For Shaw-RPE, if truncation is not considered, which means k is set to the maximum relative distance in the training set, then for an input token sequence with length L, when performing selfattention, as shown in Figure 3, the frequency of each relative position will gradually decrease when the absolute value of the distance increases. Since each relative position embedding parameters are independent in Shaw-RPE, this frequency decline phenomenon may lead to inner optimization imbalance problem. On the other hand, due to the unbalanced distribution of the input sequence length L itself (as shown in Figure 4, the length distributions on six different datasets all show characteristics similar to the long-tailed distribution), the number of samples used to optimize the medium and long relative positions is relatively small, which makes the relevant parameters easy to fall into overfitting state. We refer to this phenomenon as internal optimization imbalance problem.
The above two optimization imbalance problems have a greater impact on pure data-driven Shaw-RPE and LFHC-RPE when truncation is not considered. However, RPEs based on prior encoding mechanisms hardly suffer from these problems because the learnable parameters of these RPEs are shared for all relative positions. Besides, although T5-RPE is also purely data-driven, it is less affected because its parameters are only bias scalars. Perhaps in the future, we can learn from Baevski and Auli (2019) to combine Shaw-RPE and T5-RPE.

A.2 Prior Encoding Mechanism
As shown in Section 3.2, a good prior encoding mechanism should satisfy property 1 and property 2. Property 1 represents the translation attenuation: for the same interval between two relative positions, the divergence between two relative positions at a close distance is greater than that at a long distance. Property 2 means that as the interval increases, the divergence between two different relative positions will become larger, but the increasing trend should gradually stabilize. Both properties are summarized from intuition and various previous research work on representation learning. The core of these two properties is that the attention mechanism is more sensitive to relative position changes at close distances and less sensitive to relative position changes at long distances. For example, the discrepancy between R 1 and R 5 should be higher than the discrepancy R 101 and R 105 .
In XL-RPE, sinusoidal functions with different periods are used as the prior encoding matrix. For an offset k and a relative position i where i >= 0, the divergence (squared Euclidean distance) be-IMDB SST-2 SNLI  QNLI  QQP  MNLI  SQuAD CMRC WT103  batch size  64  64  64  64  64  64  32  32  60  FFN size  2048  2048  2048  2048  2048  2048  3072  3072  2100  lr    tween R i and R i+k is formulated as follows: From Eq. 18, it is extremely obvious that the sinusoidal prior encoding mechanism is translation invariant, which completely violates property 1. And according to this equation, we plot the divergence change curve between R 0 and other relative position encoding vertors in Figure 5. Although the sinusoidal encoding mechanism conforms to property 2 on the whole, it can be clearly seen that there are a lot of burrs on the curve, and there is a serious jitter at the medium and long intervals. In our GCDF-RPE, Eq. 18 is revised as follows: By converting the integral to the area, it can be easily concluded that GCDF-RPE satisfies property

A.3 Detailed Experimental Setup
For text classification tasks, we utilize Stanford CoreNLP toolkit  for word segmentation, and employ pre-trained GloVe (Pennington et al., 2014) word vectors 9 to initialize the word embedding matrix. Concretly, for words with a frequency greater than three and occurring in the GloVe vocabulary, the initial parameters are pretrained word vectors, while for other words, we treat them as unregistered words and mark them uniformly as [UNK]. For datasets where the input context is a single sentence, we use the max pooling representation of the output in the last layer as the classification feature. For datasets where the input context is composed by two independent sentences, we adopt the same input construction method in BERT (Devlin et al., 2019), and the representation of the [CLS] token in the last layer is chosen as the classification feature.
For question generation tasks, we employ the regular sequence-to-sequence structure (Sutskever et al., 2014). It should be noted that we test the performance of different RPEs only in the decoder part, which means their encoder parts are the same. For SQuAD dataset, we utilize bert-base 10 as the encoder. For CMRC dataset, we choose robertabase-wwm-ext 11 as the encoder. Besides, beam search, copy mechanism (Gu et al., 2016), length penalty, tri-gram blocking, and token embedding sharing (Inan et al., 2017) are also been adopted. We set the beam width to 5 and the length penalty to 0.6.
For auto-regressive language modeling task, we keep the same experimental setup as in Transformer-XL 12 (Dai et al., 2019). In training, the memory length is set to 150. In validation, we follow Transformer-XL's strategy to validate the perplexity when the memory length is 150 and 640, and the best perplexity is chosen as the final result.

A.4 Results with Different K
In this section, we report the full results of Shaw-RPE and LFHC-RPE with different k. Table 5 shows the results on text classification tasks. Table  6 shows the results on question generation tasks. Table 7 shows the results on language modeling.