SHAPE: Shifted Absolute Position Embedding for Transformers

Position representation is crucial for building position-aware representations in Transformers. Existing position representations suffer from a lack of generalization to test data with unseen lengths or high computational cost. We investigate shifted absolute position embedding (SHAPE) to address both issues. The basic idea of SHAPE is to achieve shift invariance, which is a key property of recent successful position representations, by randomly shifting absolute positions during training. We demonstrate that SHAPE is empirically comparable to its counterpart while being simpler and faster.


Introduction
Position representation plays a critical role in selfattention-based encoder-decoder models (Transformers) (Vaswani et al., 2017), enabling the self-attention to recognize the order of input sequences. Position representations have two categories (Dufter et al., 2021): absolute position embedding (APE) (Gehring et al., 2017;Vaswani et al., 2017) and relative position embedding (RPE) (Shaw et al., 2018). With APE, each position is represented by a unique embedding, which is added to inputs. RPE represents the position based on the relative distance between two tokens in the self-attention mechanism.
RPE outperforms APE on sequence-to-sequence tasks (Narang et al., 2021;Neishi and Yoshinaga, 2019) due to extrapolation, i.e., the ability to generalize to sequences that are longer than those observed during training (Newman et al., 2020). Wang et al. (2021) reported that one of the key properties contributing to RPE's superior performance is shift invariance 2 , the property of a function to not change its output even if its input is shifted. However, unlike APE, RPE's formulation 1 The code is available at https://github.com/ butsugiri/shape. 2 Shift invariance is also known as translation invariance.

Multi-Head Attention
Add & Norm strongly depends on the self-attention mechanism. This motivated us to explore a way to incorporate the benefit of shift invariance in APE.
A promising approach to achieving shift invariance while using absolute positions is to randomly shift positions during training. A similar idea can be seen in several contexts, e.g., computer vision (Goodfellow et al., 2016) and question-answering in NLP (Geva et al., 2020). APE is no exception; a random shift should force Transformer to capture the relative positional information from absolute positions. However, the effectiveness of a random shift for incorporating shift invariance in APE is yet to be demonstrated. Thus, we formulate APE with a random shift as a variant of position representation, namely, Shifted Absolute Position Embedding (SHAPE; Figure 1c), and conduct a thorough investigation. In our experiments, we first confirm that Transformer with SHAPE learns to be shift-invariant. We then demonstrate that SHAPE achieves a performance comparable to RPE in machine translation. Finally, we reveal that Transformer equipped with shift invariance shows not only better extrapolation ability but also better interpolation ability, i.e., it can better predict rare words at positions observed during the training. Figure 1 gives an overview of the position representations compared in this paper. We denote a source sequence X as a sequence of I tokens, namely, X = (x 1 , . . . , x I ). Similarly, let Y represent a target sequence of J tokens Y = (y 1 , . . . , y J ).

Absolute Position Embedding (APE)
APE provides each position with a unique embedding ( Figure 1a). Transformer with APE computes the input representation as the sum of the word embedding and the position embedding for each token x i ∈ X and y j ∈ Y .
Sinusoidal positional encoding (Vaswani et al., 2017) is a deterministic function of the position and the de facto standard APE for Transformer 3 . Specifically, for the i-th token, the m-th element of position embedding PE(i, m) is defined as where D denotes the model dimension.

Relative Position Embedding (RPE)
RPE (Shaw et al., 2018) incorporates position information by considering the relative distance between two tokens in the self-attention mechanism ( Figure 1b). For example, Shaw et al. (2018) represent the relative distance between the i-th and j-th tokens with relative position embeddings a Key i−j , a Value i−j ∈ R D . These embeddings are then added to key and value representations, respectively.
RPE outperforms APE on out-of-distribution data in terms of sequence length owing to its innate shift invariance (Rosendahl et al., 2019;Neishi and Yoshinaga, 2019;Narang et al., 2021;Wang et al., 2021). However, the self-attention mechanism of RPE involves more computation than that of APE 4 . In addition, more importantly, RPE requires the modification of the architecture, while APE does not. Specifically, RPE strongly depends on the self-attention mechanism; thus, it is not necessarily compatible with studies that attempt to replace the self-attention with a more lightweight alternative (Kitaev et al., 2020;Choromanski et al., 2021;Tay et al., 2020). RPE, which was originally proposed by Shaw et al. (2018), has many variants in the literature (Dai et al., 2019;Raffel et al., 2020;Huang et al., 2020;Wang et al., 2021;Wu et al., 2021). They aim to improve the empirical performance or the computational speed compared with the original RPE. However, the original RPE is still a strong method in terms of the performance. Narang et al.
(2021) conducted a thorough comparison on multiple sequence-to-sequence tasks and reported that the performance of the original RPE is comparable to or sometimes better than its variants. Thus, we exclusively use the original RPE in our experiments.

Shifted Absolute Position Embedding
(SHAPE) Given the drawbacks of RPE, we investigate SHAPE ( Figure 1c) as a way to equip Transformer with shift invariance without any architecture modification or computational overhead on APE. During training, SHAPE shifts every position index of APE by a random offset. This prevents the model from using absolute positions to learn the task and instead encourages the use of relative positions, which we expect to eventually lead to the learning of shift invariance.
Let k represent an offset drawn from a discrete uniform distribution U{0, K} for each sequence and for every iteration during training, where K ∈ N is the maximum shift. SHAPE only replaces PE(i, m) of APE in Equation 1 with PE(i + k, m). (2) We independently sample k for the source and target sequence. SHAPE can thus be incorporated into any model using APE with virtually no computational overhead since only the input is modified. Note that SHAPE is equivalent to the original APE if we set K = 0; in fact, we set K = 0 during inference. Thus, SHAPE can be seen as a natural extension to incorporate shift invariance in APE. SHAPE can be interpreted in multiple viewpoints. For example, SHAPE can be seen as a regularizer that prevents Transformer from overfitting to the absolute position; such overfitting is undesirable not only for extrapolation (Neishi and Yoshinaga, 2019) but also for APE with length constraints (Takase and Okazaki, 2019;Oka et al., 2020Oka et al., , 2021. In addition, SHAPE can be seen as a data augmentation method because the randomly sampled k shifts each instance into different subspaces during training.

Experiments
Using machine translation benchmark data, we first confirmed that Transformer trained with SHAPE learns shift invariance (Section 3.2). Then, we compared SHAPE with APE and RPE to investigate its effectiveness (Section 3.3).

Experimental Configuration
Dataset We used the WMT 2016 English-German dataset for training and followed Ott et al. (2018) for tokenization and subword segmentation (Sennrich et al., 2016). We used newstest2010-2013 and newstest2014-2016 as the validation and test sets, respectively.
Our experiments consist of the following three distinct dataset settings: Identical to previous studies (Vaswani et al., 2017;Ott et al., 2018).
(ii) EXTRAPOLATE: Shift-invariant models are typically evaluated in terms of extrapolation ability (Wang et al., 2021;Newman et al., 2020). We replicated the settings of Neishi and Yoshinaga (2019); the training set excludes pairs whose source or target sequence exceeds 50 subwords, while the validation and test sets are identical to VANILLA.
(iii) INTERPOLATE: We also evaluate the models from the viewpoint of interpolation, which we define as the ability to generate tokens whose lengths are seen during training. Specifically, we evaluate interpolation using long sequences since, first, the generation of long sequences is an important research topic in NLP (Zaheer et al., 2020;Maruf et al., 2021) and second, in datasets with long sequences, the position distribution of each token becomes increasingly sparse. In other words, tokens in the validation and test sets become unlikely to be observed in the training set at corresponding positions; we expect that shift invariance is crucial for addressing such position sparsity.
In this study, we artificially generate a long sequence by simply concatenating independent sentences in parallel corpus. Specifically, given ten neighboring sentences of VANILLA, i.e., X 1 , . . . , X 10 and Y 1 , . . . , Y 10 , we concatenate each sentence with a unique token sep . We also  Table 1: BLEU score on the sub-sampled training data of INTERPOLATE (10,000 pairs). In Original and Swapped, the order of input sequence is X 1 , . . . , X 10 and X 2 , . . . , X 10 , X 1 , respectively.
apply the same operation to the validation and test sets.
Evaluation We evaluate the performance with sacreBLEU (Post, 2018). Throughout the experiment, we apply the moses detokenizer to the system output and then compute the detokenized BLEU 5 . Quantitative Evaluation: BLEU on Training Data We first evaluated if the model is robust to the order of sentences in each sequence. We used the sub-sampled training data (10k pairs) of INTERPOLATE to eliminate the effect of unseen sentences; in this way, we can isolate the effect of sentence order. Given a sequence in the original order (Original), X 1 , . . . , X 10 , we generated a swapped sequence (Swapped) by moving the first sentence to the end, i.e., X 2 , . . . , X 10 , X 1 . The model then generates two sequences Y 1 , . . . , Y 10 and Y 2 , . . . , Y 10 , Y 1 . Finally, we evaluated the BLEU score of Y 1 . The result is shown in Ta  shift invariance as shown in Figure 2. The figure illustrates how the offset k changes the encoder representations of trained models APE and SHAPE. Given the two models and an input sequence X, we computed the encoder hidden states of the given input sequence for each k ∈ {0, 100, 250, 500}. For each position i, we computed the cosine similarity (sim) of the hidden states from two offsets, i.e., h k 1 i , h k 2 i ∈ R D , and computed its average across the positions as As shown in Figure 2, SHAPE builds a shiftinvariant representation; regardless of the offset k, the cosine similarity is almost always 1.0. Such invariance is nontrivial because the similarity of APE does not show similar characteristics 7 .

Experiment 2: Performance Comparison
We compared the overall performance of position representations on the validation and test sets as 7 Additional figures are available in Appendix C. shown in Table 2. Figure 3 shows the BLEU improvement of RPE and SHAPE from APE with respect to the source sequence length 8 .
On VANILLA, the three models show comparable results. APE being comparable to RPE is inconsistent with the result reported by Shaw et al. (2018); we assume that this is due to a difference in implementation. In fact, Narang et al. (2021) have recently reported that improvements in Transformer often do not transfer across implementations.
On EXTRAPOLATE, RPE (29.86) outperforms APE (29.22) by approximately 0.6 BLEU points on the test set; this is consistent with the result reported by Neishi and Yoshinaga (2019). Moreover, SHAPE achieves comparable test performance to RPE (29.80). According to Figure 3a, both RPE and SHAPE have improved extrapolation ability, i.e., better BLEU scores on sequences longer than those observed during training. In addition, Figure 3a shows the performance of SHAPE with the maximum shift K = 40 that was chosen on the basis of the BLEU score for the validation set. This model outperforms RPE, achieving BLEU scores of 23.12 and 29.86 on the validation and test sets, respectively. These results indicate that SHAPE can be a better alternative to RPE.
On INTERPOLATE, we were unable to train RPE because its training was prohibitively slow 9 . Similarly to EXTRAPOLATE, SHAPE (39.09) outperforms APE (38.23) on the test set. Figure 3b shows that SHAPE consistently outperformed APE for every sequence length. From this result, we find that the shift invariance also improves the interpolation ability of Transformer.

Analysis
This section provides a deeper analysis of how the model with translation invariance improves the performance. We hereinafter exclusively focus on APE and SHAPE because SHAPE achieves comparable performance to RPE, and we were unable to train RPE on the INTERPOLATE dataset as explained in footnote 9.
As discussed in Section 3.3, Figure 3 demonstrated that SHAPE outperformed APE in terms of BLEU score. However, BLEU evaluates two concepts simultaneously, that is, the token precision via n-gram matching and the output length via the brevity penalty (Papineni et al., 2002). Thus, the actual source of improvement remains unclear. We hereby exclusively analyzed the precision of token prediction. Specifically, we computed tokenwise scores assigned for gold references, and we then compared them across the models; given a sequence pair (X, Y ) and a trained model, we computed a score (i.e., log probability) s j for each token y j in a teacher-forcing manner. Here, a higher score to gold token means better model performance. We used the validation set for comparison. Figure 4 shows the ratio that SHAPE assigns a higher score to a gold token than APE, compared across for each position of the decoder. Better extrapolation means better token precision Figure 4a shows that SHAPE outperforms APE, especially in the right part of the heat map. This area corresponds to sequences longer than those observed during training. This result indicates that better extrapolation in terms of BLEU score means better token precision. Interpolation is particularly effective for rare tokens As shown in Figure 4b, SHAPE consistently outperforms APE and the performance gap is especially significant in the low-frequency region (bottom part). This indicates that SHAPE predicts rare words better than APE. One plausible explanation for this observation is that SHAPE carries out data augmentation in the sense that in each epoch, the same sequence pair is assigned a different position depending on the offset k. Rare words typically have sparse position distributions in training data and thus benefit from the extra position assignment during training.

Conclusion
We investigated SHAPE, a simple variant of APE with shift invariance. We demonstrated that SHAPE is empirically comparable to RPE yet imposes almost no computational overhead on APE. Our analysis revealed that SHAPE is effective at extrapolation to unseen lengths and interpolating rare words. SHAPE can be incorporated into the existing codebase with a few lines of code and no risk of a performance drop from APE; thus, we expect SHAPE to be used as a drop-in replacement for APE and RPE.

A Summary of Datasets
We summarized the statistics, preprocessing, and evaluation metrics of datasets used in our experiment in Table 3. The length statistics are in Figure 5.

B Hyperparameters
We present the list of hyperparameters used in our experiments in Table 4. Hyperparameters for training Transformer follow the recipe available in the official documentation page of OpenNMT-py 10 .

C Similarities of Representations
In Section 3.2, we presented Figure 2 to qualitatively demonstrate that the representation of SHAPE is shift-invariant. We present ten additional figures that we created from ten additional instances in Figure 6. The characteristic of the figures are consistent with that observed in Figure 2; the representation of SHAPE is shift-invariant, whereas the representation of APE is not.

D Detailed BLEU Scores
We report the BLEU score on each of newstest2010-2016 in Table 5 1112 . In addition, we report the performance of APE, RPE, and SHAPE with respect to the source sequence lengths in Figure 7.

E Learning Curve of Each Model
We present the learning curve of each model (APE, RPE, SHAPE) trained on different datasets (VANILLA, EXTRAPOLATE, INTERPOLATE). Figures 8 and 9 present the validation perplexity against the number of gradient steps and wall clock, respectively. From these figures, we made the following observations: First, according to Figure 8, the speed of convergence is similar across the models in terms of the number of gradient steps. In other words, in our experiment (Section 3), we never compare the models whose degree of convergence is different. 10 https://opennmt.net/OpenNMT-py/FAQ. Second, Figure 9 demonstrates that RPE requires more time to complete the training than APE and SHAPE do. As explained in Section 2.2, RPE causes the computational overhead because it needs to compute attention for relative position embeddings. The amount of time required to complete the training is presented in Table 6.

F Sanity Check of the Baseline Performance
Building a strong baseline is essential for trustable results (Denkowski and Neubig, 2017 Table 3: Summary of statistics, preprocessing, and evaluation metric of datasets used in our experiment.   Save checkpoint for every 5,000 steps and take an average of last 10 checkpoints Maximum Offset K (for SHAPE) We set K = 500 for the most of the experiments. We manually tuned K on validation BLEU for EXTRAPOLATE from following range: {10, 20, 30, 40, 100, 500}, and report the score of K = 40 in addition to K = 500. We used a single random seed for the tuning. Relative Distance Limit (for RPE) 16 following (Neishi and Yoshinaga, 2019) GPU Hardware Used DGX-1 and DGX-2  k   0  100  250  500  0  100  250  500  0  100  250  500  0  100  250  500  0  100  250  500  0  100  250 k   0  100  250  500  0  100  250  500  0  100  250  500  0  100  250  500  0  100  250  500  0  100  250 Figure 9 illustrates the corresponding learning curve.  Table 7: BLEU score on newstest2010-2016. We report average result of five distinct trials with different random seeds.