A Simple and Effective Positional Encoding for Transformers

Transformer models are permutation equivariant. To supply the order and type information of the input tokens, position and segment embeddings are usually added to the input. Recent works proposed variations of positional encodings with relative position encodings achieving better performance. Our analysis shows that the gain actually comes from moving positional information to attention layer from the input. Motivated by this, we introduce Decoupled Positional Attention for Transformers (DIET), a simple yet effective mechanism to encode position and segment information into the Transformer models. The proposed method has faster training and inference time, while achieving competitive performance on GLUE, XTREME and WMT benchmarks. We further generalize our method to long-range transformers and show performance gain.


Introduction
Transformers are sequence-to-sequence models that achieve state of the art performance in many Natural Language Processing (NLP) tasks, such as machine translation, language modeling and question answering (Vaswani et al., 2017;Devlin et al., 2018;Yang et al., 2019;Liu et al., 2020). Transformers have two major components: self-attention and a position-wise feed forward layer. Both are permutation equivariant and are not sensitive to the order of input tokens. To make these models position-aware, the position information of the input words is typically added as an additional embedding to the input token embeddings (Vaswani et al., 2017). For example, input embedding (W ) of a sentence is added to the position embeddings (P ), resulting in input W + P to the Transformer. These position embeddings only depend on the location the word appears. For multi-segment tasks, * The authors contribute equally to this paper. Corresponding author email: puchin@google.com additional segment embeddings can be added just like the position embeddings (Devlin et al., 2018).
There have been multiple works exploring different ways to include position information in Transformers (Shaw et al., 2018;Yang et al., 2019;Raffel et al., 2020). Many of those note the advantages of using a relative position encoding scheme over absolute position encodings (see also Fig 1). However what causes this difference is not clear. Yun et al. (2020) have shown that Transformers with absolute position encodings are universal approximators of all sequence to sequence functions, proving that absolute position encodings can capture the position information. Hence what causes the superiority of relative position encodings? A systematic study and understanding of the benefits and drawbacks of different position encoding methods is missing. Ke et al. (2020) hypothesised that the cross correlation between word and position embeddings while computing attention could be the cause of poor performance of absolute position encodings. However such cross terms are present in some of the relative position encoding methods (Shaw et al., 2018;Yang et al., 2019), and these methods perform on par or better than the other position encoding schemes (see §4).
In this paper we undertake a systematic study to understand different position encoding methods. We argue that absolute position embeddings mainly suffer from being added at the input. We show, with our experiments on classification, question answering and machine translation tasks, that absolute position encodings added to attention matrices with different parameters for each head improves significantly over absolute position encodings added to the input. This highlights that where the position information is included in the Transformer is important, providing an explanation for the gap in performance between absolute and relative position encodings. We also compare different position encodings and the effect of sharing position encod- : Performance effect of different positional encoding methods for Transformers (see § 2) on two Natural language Inference datasets from GLUE (Wang et al., 2019), XTREME (Hu et al., 2020) and one Neural Machine Translation dataset WMT 18 (Bojar et al., 2018). Absolute positional encoding (DIET-ABS) can achieve better performance than the relative counterpart (DIET-REL), showing the importance of designing the right position encoding method.
ings across different heads and layers of a Transformer. Based on these observations we propose decoupled positional attention and a new segment encoding approach (for tasks with multiple segments), and empirically show its superiority. We summarize our contributions in this paper below.
• We theoretically and empirically analyze the limitation of the absolute position embeddings added to the input. For both absolute and relative information, we show that encoding position to attention matrix per-head results in superior performance.
• We propose a simple and efficient way to encode position and segment information. The proposed encoding matches the SoTA methods on multiple standard NLP tasks while having a simpler model with lower training/inference costs.
• Our proposed method can be easily applied to long sequence models (DIET-ABS LIN ) and improve all metrics compared with Linformer .
• We present ablation studies comparing different position encoding methods and ways of sharing position encoding parameters across heads and layers in Transformer.

Position Encoding for Transformers
In this section, we briefly review the Transformer models (Vaswani et al., 2017) and discuss previous improvement of position encoding and analyze the limitation of the additive position embedding proposed in the initial and widely-adopted Transformer model.

Transformer
A Transformer block consists of two types of layers: 1) Self-attention layer and 2) Feed forward layers.
Self-Attention Module Given input sequence length n, hidden size d, multi-head query-key down-projection size d h , we define hidden layer input to this attention head as X ∈ R n×d , the query projection matrix as W i Q ∈ R d×d h , the key projection matrix as W i K ∈ R d×d h and the value projection matrix as , for h heads. Usually, d h < d as we do multi-head attention with a smaller representation per head (d h = d/h). With that we can write dot-product attention score: This attention score is used to compute the output for each head, after scaling and per row normalization using softmax: Output of all attention heads in a layer are concatenated and passed to the next feed-forward layer applied token-wise.

Position Aware Self Attention
Many NLP tasks, such as machine translation, language modeling, are sensitive to the ordering of input words. Since Transformers are permutation equivariant, we usually additionally include the position information in the input. Below we discuss some of the popular position encoding methods.

Absolute Position Encodings
Absolute position encodings are computed in the input layer and are summed with the input token embeddings. Vaswani et al. (2017) proposed this for Transformers and it has been a popular choice in the followup works (Radford et al., 2018;Devlin et al., 2018). There are two common variations of the absolute position encodings -fixed and learned.

Relative Position Encodings
One drawback of absolute position encoding is that it requires fixed length of input sequence and does not directly capture relative positions to each word.
To solve these problems several relative positions schemes have been proposed. Shaw et al. (2018) proposed using relative position encoding instead of absolute position encoding, and add position embeddings to the key and optionally value projections instead of the input. They show that this new way of encoding position information leads to better performance on machine translation tasks. Yang et al. (2019) simplified this by removing the position embeddings in value projections and showed better performance on the language modeling tasks. Both these approaches use a vector representation to encode position information.
Raffel et al. (2020) use scalars to encode relative position between query and key indices and add directly to the attention scores matrix. They further use logarithmic binning of position information into a fixed number of buckets. All these relative position methods further share the position encoding parameters across layers.
Recently Ke et al. (2020) hypothesised that the cross correlation between position and token embeddings can result in weaker performance of additive absolute position embeddings and instead proposed to add both absolute and relative positional information based attention directly in each head. However such cross terms are present in the method proposed by Shaw et al. (2018), which does competitively with other approaches. We instead hypothesise that position encodings at input limit the rank of the position attention matrix leading to its poor performance.

Limitations of the Input Additive Position Embedding
In this section we discuss some limitations of the de facto way of adding absolute position encodings to the input token embeddings. We first compare the representation power in terms of the rank of attention matrices achievable with different position encodings.  (1)). With additive positional embedding at input, the attention matrices have much lower rank, limiting the representative power. This is alleviated by DIET-ABS.
Theorem 1. Let P ∈ R n×d be the input position embedding andP ∈ R n×dp be the layer-wise position embeddings. Let W Q , W K ∈ R d×d h be the query and key projection matrices with head projection size d h , and d h < d p , d and n ≥ d h + d p . Let A a = (X + P)W Q W K (X + P) and A r = XW Q W K X +PP be the attention matrices computed using input and layer-wise position embeddings respectively. Then for any X, P, There exists a choice of X,P, W Q , W K such that Remarks. This theorem shows us that the rank of attention matrices is constrained with the absolute position encodings at the input and using per-head position encodings by adding position information to attention matrix directly results in allowing for higher rank attention. See § B for the proof.
Adding the position encodings directly to the input further places a constraint on training dynamics by forcing gradients to be same for both the input token and position embeddings (see § B). Relative position encodings discussed earlier, while addressing some of these concerns, suffer from slower training/inference times (see Table 1) with complex implementations (Shaw et al. (2018); Ke et al. (2020)). In the next section, we present simple position encoding methods that avoid these limitations.

Proposed Position and Segment Encodings
In the previous section, we learned about the limitations of input additive positional embeddings and existing works. Based on these observations, we propose two minimal/efficient ways to incorporate (absolute/relative) positional encodings along with a novel absolute segment encoding approach. By decoupling position and segment from token embeddings we match the SoTA performance while improving training/inference time (see §3.3).

Decoupled Absolute Positional Attention
We propose the following simple absolute position encoding method that adds position information to the token attention matrix directly in each attention head. We further also add segment information to the token attention instead of the input embeddings. This way we can set the rank of position encodings independently resulting in higher rank attention matrix, addressing the limitations discussed earlier.

DIET-ABS
where P Q , P K ∈ R n×dp are low-rank position embedding matrices and E S is the absolute segment attention to model interactions between segments defined as Please note that we use the following notation in the above equation. A i,j denotes the (i, j) entry of matrix A. X i: and X :j denote the ith row and jth column of X respectively. We will follow this notation in the remainder of the paper.
By default, we set d p same as d h . This already results in potentially a rank d p +d h attention matrix as shown in Theorem 1. To illustrate this, we compare the rank of the attention matrices in the first layer of a baseline BERT model and a DIET-ABS model for a sampled batch in Figure 2. The figure shows that attention matrices of DIET-ABS have higher ranks than the baseline BERT. Our detailed experiment results in § 4 also show that DIET-ABS performs noticeably better. This confirms our earlier observation in Theorem 1 that additive position embeddings at input can constrain the model and adding the position embeddings per-head removes this constraint and results in better performance.
With the decoupled positional embedding, we can increase d p to any width k to break the lowrank bottleneck shown in Theorem 1. We call such model DIET-ABS-Rank-k. We also address the efficiency issue introduced by one additional matrix multiplication (P Q P K ). As the positional embeddings are independent of the input, we only need to compute the matrix multiplication once for each training batch, and we can cache the computed matrix before running inference. As a result, we observe neglectable training and inference cost increase in this model variant.

Decoupled Relative Positional Attention
To incorporate relative position inductive bias, we consider a simplified version of the position encoding proposed in T5 (Raffel et al., 2020) without log-binning and per-layer parameter sharing. We further also incorporate our per-head segment encoding as in DIET-ABS. The model can be written as:

DIET-REL
We show an example of this model with two segments in Figure 3.

Training and Inference Costs
We next show the proposed models introduce little computational overhead compared to the baseline model, making our model more practical than alternatives. We consider two different models -BERT BASE model and a smaller model, BERT SMALL , that has hidden size 512, 4 layers and 8 attention heads.
In Table 1  We notice that the simplicity of the proposed methods indeed translates to savings in both training and inference times compared to other position encoding approaches. The savings in step times are even more significant for smaller models (BERT SMALL ) and during inference.
Note that the discrepancy between training and inference speed is likely because gradient updates dominate the cost at training time (Lan et al., 2020).   forward pass which corresponds to costs of using such models in real systems.

Application to Long-range Transformers
Another advantage of our propose approaches is they easily extend to long range Transformer models. For long sequence inputs, Transformers suffer from quadratic dependence of computational complexity with respect to the sequence length. A class of methods reduce this complexity by using a low rank projection of the input sequence for attention computation Choromanski et al., 2021;Dai et al., 2020). However, such methods use the default input position encodings, and there has not been much work in incorporating position information per-head without introducing the quadratic computation complexity on the input sequence length. We illustrate the applicability of our methods to such settings by applying DIET-ABS to Linformer , which projects the attention key and value matrices to a lower dimension k during attention computation.
DIET-ABS LIN The proposed method can be written as: where E ∈ R k×n , P Q ∈ R n×d , P K ∈ R k×d .

Experiments
In this section, we present our experimental results comparing different position and segment encoding approaches discussed in earlier sections. We conduct experiments in three different settings to cover a wide range of use cases. First, we examine the results of a popular transfer learning approach from masked-LM pretraining to the end tasks in GLUE (Devlin et al., 2018). Second, we study zero-shot cross-lingual transferability of the multilingual pretrained models (Hu et al., 2020) to classification and question answering tasks in the XTREME benchmark (Hu et al., 2020). Lastly, we consider training Transformer models from scratch for machine translation. We compare the following positional encoding approaches -absolute positional embedding (Devlin et al., 2018), relative positional embedding (Shaw et al., 2018), combined absolute and relative positional encoding (Ke et al., 2020), relative scalar approach (Raffel et al., 2020), our proposed DIET-ABS and DIET-REL per-head positional encoding approaches. We denote the methods that add position/segment information directly to input token embeddings with input, and methods that add position/segment information directly in attention layer with per-head. For complete experimental setup, see Appendix A.

English Transfer Learning Results
Datasets and Model For pre-training, we use English Wikipedia and Books datasets (Devlin et al., 2018). For Finetuning tasks we use the datasets from the GLUE benchmark (Wang et al., 2019). We apply sub-word tokenization on raw text data using WordPiece (Wu et al., 2016) Table 2: GLUE: Results on the GLUE dev set of the finetuned models based on a pre-trained model with 12layer BERT BASE architecture. We report the median of the maximum accuracy over all checkpoints among five runs. We notice that the shared DIET-ABS with rank 128 performs competitively with existing relative positional embedding SoTA models without the inductive bias of the relative positions. The proposed method also improves performance in the low-rank long range transformer setting of , where relative positional embedding approaches are inefficient to use.  Table 3: XTREME: Fine-tune cross-lingual model on English training set (Cross-lingual Transfer). Performance is measured by accuracy for classification, and f1 score / exact match for question answering. In agreement with results in Table 2 we see in this table that using per-head position encodings is strictly better than absolute position encodings at the input. With layer-wise sharing, DIET-ABS with rank 128 outperforms all SoTA models.

EN-DE DE-EN EN-CS CS-EN
Vaswani et al. (2017) Table 4: Machine Translation: We report results comparing different position encoding methods for Transformers on machine translation tasks en-de, de-en, encs and cs-en from the Newstest 2018 dataset. We notice that all per-head position encoding schemes (all except the first row) do better than the absolute position embeddings added at the input. Further the proposed simple DIET-REL approach is competitive with other position encoding approaches.

Results
We examine how different ways of encoding position and segment affect the transfer learning ability of the pre-trained English BERT models by fine-tuning on the GLUE benchmark (Wang et al., 2019), and present the results in Ta-ble 2. We first notice that all the approaches that encode position features explicitly at per-head level perform better than the baseline additive position encodings at the input (Devlin et al., 2018). All models incorporating relative positions (Shaw et al., 2018;Raffel et al., 2020;Ke et al., 2020), despite their modeling differences, have very similar average score. We show further gains (84.9 to 85.2 for DIET-REL) by moving segment features to perhead.
Interestingly we notice that the proposed absolute position encoding method DIET-ABS, with layer-wise sharing, is on par with all previous SoTA relative positional encodings. This shows that even absolute position encodings can perform better when included per-head instead at the input. We present a detailed ablation study varying the rank and sharing methods of absolute positional attention (DIET-ABS) in Table 8 and Tables 9 in  Appendix C. For long range input, we consider Linformer  with a projection dimension of 32. Due to down-projection, we see non-trivial performance drop, when compared to a Transformer. Even for this setting we see that our absolute positional attention DIET-ABS can be used to improve the model's performance.

Cross-lingual Model Results
Datasets and Model For our multilingual experiments, we pre-train the models on Wikipedia corpus in 100 languages similar to (Lample and Conneau, 2019) for 125K steps with a sequence length of 512, and then fine-tune on downstream XTREME tasks (Hu et al., 2020). We use languageindependent tokenizer, Sentence Piece (Kudo and Richardson, 2018) model, with 120,000 token vocabulary to encode input text.
Classification We conduct 5 trials of fine-tuning for each model on the MultiNLI  training data, then perform zero-shot predictions on XNLI (Conneau et al., 2018), choosing median accuracy to report.

Results
We present our results on the classification and question answering finetuning tasks in XTREME for different position and segment encoding methods in Table 3. Again all per-head position encoding methods outperform input additive position encodings. Interestingly, our simple DIET-ABS turns out to be the best model, better than other models using relative position features. Layer-wise sharing and per-head segment attention allows DIET-ABS to outperform DIET-REL. We present a detailed ablation study in Table 5 to understand effect of decoupled positional attention variants. Finally, we notice similar advantages in using DIET-ABS with the Linformer  model in the long range setting.

Translation Results
Datasets and Model For the machine translation task we consider two language pairs (both directions) for training -WMT 2018 English-to-German (en-de), German-to-English (de-en), English-to-Czech (en-cs) and Czech-to-English (cs-en) (Bojar et al., 2018). We test the corresponding models on Newstest 2018 datasets respectively and report the BLEU score output by SacreBLEU (Post, 2018) with default setting. Our setup follows Vaswani et al. (2017) closely and use their Ten-sor2Tensor framework (Vaswani et al., 2018). Following Vaswani et al. (2017) we use a 6 layer Transformer with encoder-decoder architecture. For more details of our experimental setup please see Appendix A

Results
We report the BLEU scores of the models in Table 4. We observe that moving positional information from input to per-head attention layer improves BLEU scores. Different variations of per-head positional attention do not make much difference with DIET-REL being competitive with Shaw et al. (2018).

Ablation Study
In this section, we share our findings of key factors that affect performance of decoupled positional attention.
Sharing the Positional Encoding Previous works (Raffel et al., 2020;Ke et al., 2020;Shaw et al., 2018) used different sharing methods for the positional encodings to reduce the model parameters. We present a detailed study on different forms of sharing positional encodings and its effect on performance. In particular, we compare the following variations in sharing the position encoding parameters across different heads and the layers in the Transformer.
• head-wise -Same parameters are used for all heads in a layer, with different layers using different parameters (Shaw et al., 2018;Ke et al., 2020).
• layer-wise -Sharing of position encoding parameters across layers with different parameters for each head (Raffel et al., 2020).
• none -Every layer and head uses different position encoding parameters.
We present results comparing different sharing methods in Table 5 for XTREME tasks. We make the following observations 1) head-wise sharing is consistently worse than layer-wise, 2) sharing hurts the performance of DIET-REL whereas it improves   Table 6: Model Parameters: We list the number of model parameters and performance for different position encoding approaches. We observe that sharing hurts the performance of DIET-REL with negligible benefit in the number of parameters. On the contrary, the regularization effect of sharing makes DIET-ABS more stable with lesser parameters to achieve competitive performance.
the performance of DIET-ABS. We summarize the key settings along with the number of model parameters in Table 6. For DIET-REL, sharing brings little effect on saving parameters, and hurts the performance. Hence, we recommend no sharing for relative positional encodings (DIET-REL). On the other hand, it is necessary to share parameters for DIET-ABS in order to keep the number of parameters low. Interestingly, sharing has regularization effect on DIET-ABS, making the model perform better. We choose layer-wise sharing over headwise sharing for its better performance.
Segment Encoding Our novel segment encoding design further improves the model performance showed in Table 5. Both relative and absolute decoupled positional attention models benefit from moving the segment encoding from input to per-head: DIET-REL (+0.4%), layer-wise shared DIET-REL (+0.1%), DIET-ABS (+0.2%), layer-wise shared DIET-ABS (+0.1%). See Appendix D for the results of GLUE benchmark and Appendix C for segment attention visualization.

Rank of Absolute Positional Attention
The design of DIET-ABS allows to learn higher rank attention matrices as shown in Theorem 1. To understand the effect of absolute positional attention rank (d p ) in practice, we conduct experiments varying the rank from d p = 64 to d p = 512. We present the results in Table 5. We notice that the performance improves as we increase the rank from 64 to 128. However there is a performance saturation in further increasing it to 512. We present a visualization of the rank of the positional attention matrix in Appendix B.

Positional Attention Pattern Visualization
We next visualize the learned positional attention patterns of DIET-ABS in Figure 4. We first note that DIET-ABS has learned to capture the relative positional relations between inputs. Also note that, for the the index zero (the [CLS] token), decoupled absolute positional attention usually learns a spe-  (2020)), which may not generalize across tasks.

Conclusion
In this paper we theoretically and empirically examined the limitation of additive position embedding at input and showed that having per-head position embeddings results in better performance. We argued that the superior performance of some of the relative position encoding methods come from their per-head addition to attention matrix rather than the position information being relative vs absolute. Indeed we show that using absolute position encodings per-head results in better performance. Motivated by this we propose a simple per-head position and segment attention method that achieves the state-of-the-art performance on multiple NLP tasks and is more computationally efficient than existing approaches.

A Experimental setup
In this section we present more details of our experimental setup.
Pre-training We pre-train the models using a masked LM task (Devlin et al., 2018) and do not use the Next Sentence Prediction (NSP) loss as suggested in RoBERTa (Liu et al., 2019). Each input is constructed with full sentences from documents, and packed up to the maximum sequence length. We use the same architecture as BERT BASE (Devlin et al., 2018) (L = 12, H = 768, A = 12) for our experiments.
Fine-tuning Some downstream tasks have different groups of full sentences provided at inputs. For those tasks (e.g. MNLI, CoLA, XNLI, SQuAQ), we fine-tune models with supplemental segment encoding discussed in Section §3. We leave models for other tasks unchanged as their pre-training correspondences.
Hyper-parameters Hyper-parameters we use are presented in Table 7.

B Proofs
Proof of Theorem 1. The first claim follows easily by observing that rank of product of an two matrices is upper bounded by the minimum of the individual ranks.
rank(A a ) = rank((X + P)W Q W K (X + P) ) ≤ min(rank(X + P), rank(W Q ), rank(X + P), rank(W K )) The last inequality follows from rank( To prove the second claim we follow a construction approach. Let us first take W Q = W K to be same matrices with first d h rows being identity matrix and the remaining d − d h rows being all zeros. Then Here Choose d p = n > d h and letP = I. Now chosingP with zeros in the first n − d p columns and identity in the last d p columns (P = [0 d,n−dp , I dp,dp ]) giveŝ PP = 0 n−dp,n−dp 0 n−dp,dp 0 dp,n−dp I dp,dp .
Combining these two gives us Let X ∈ R n×d be the input word embeddings in dimension d with sequence length n. We have trainable position embeddings P ∈ R n×d , which are added to the input sequence before feeding into the model g. For a given input X and label y, the objective for a loss function is as follows: Theorem 2. Let X and P be trainable embedding matrices in R n×d . Then the gradients of the loss function in equation (5), at any point (X, y), and for any differentiable functions and g, are same for X and P.
Remarks. This theorem shows us that the gradients are same for the input token embeddings and position embeddings. While in standard NLP tasks the inputs X can be different in each step due to different input tokens being present in each mini batch, the result still suggests that additive position embedding can limit the model from learning the relative importance of position encodings with respect to token embeddings based on the training task at hand.
Proof of Theorem 2. The above theorem follows by just computing the gradients and showing they are equal for each step.
Gradients of the above objective w.r.t X and P are as follows.
The above computation of gradient follows from chain rule. This shows that the gradients of L w.r.t. X and P are the same.

C Attention Visualization
In this section, we examine the model internals to understand how the proposed model works. We first visualize the model internals of different modeling alternatives to argue our proposed model is sensible.
Why We Remove the Input Embedding To understand if it is sensible to remove the input additive embedding after adding position scalars per-head, we add additive position embedding to our DIET-ABS model. Then, we examine the position embedding of the BERT model and our DIET-ABS variant with additive position embedding. Figure 5 shows that, when the model has both absolute scalar and additive absolute position embedding, the position embedding encodes almost no information -all position embeddings at input are similar. The Effect of Segment Attention We also examine the effect of adding segment attention on top of the position attention. Figure 6 shows some representative patterns. We observe that segment attention enables the model to attend more to parts of the sequence that belongs to certain segments.
(a) Attend to the Second Segment (b) Down-weight Relative Position Attention Figure 6: We consider input of length 32 with two segments. The second segment starts at index 16. We observe the attention patterns in the DIET-REL model without token-to-token attention.
Shifting Pattern Learned from Absolute Positional Attention Using relative position encoding gives generally better results despite smaller improvement scale compared to moving feature encoding per-head.
To understand this, we visualize the attention pattern of the absolute positional attention and found two representative patterns in DIET-ABS in Figure 7. We observe that even given absolute position features, the model learns a "shifting pattern" for the most part. Different from Wang and Chen (2020) which claimed absolute position only learns local patterns, we show the position attention can actually attend to longer context. However, the shifting pattern can be modeled directly by relative position. Thus, DIET-REL can be a better model choice with fewer parameters and more accurate inductive bias in some applications.

Rank of Positional Attention Matrices
In Figure 8, we present a comparison of rank of position attention matrices for a BERT BASE model with absolute position embeddings at input (P Q W Q W K P K ) v.s. absolute position embeddings per-head (DIET-ABS (1), (P Q P K ), where P Q , P K ∈ R n×dp ). With additive positional embedding at input, position attention matrices have much lower rank, limiting the representative power. This is alleviated by DIET-ABS.

D Additional Ablation Study on GLUE
Earlier we present an ablation study on XTREME in Table 5 for decoupled positional attention variants. We compare DIET-REL and DIET-ABS against the baseline (Devlin et al., 2018). We now present a similar study on the GLUE benchmark in Table 8 and observe similar results.

Segment Encoding In
Sharing Strategies Sharing plays an important role for DIET-ABS. In Table 9, we find that sharing will degrade the performance of DIET-REL (-0.2% layer-wise, -0.3% head-wise). For DIET-ABS, sharing makes the model more stable, and able to compete with DIET-REL.   Table 9: Sharing ablation study on GLUE: We run ablation study to understand the effect of sharing position encoding parameters across layers and heads. We notice that sharing improves the performance of DIET-ABS, but hurts the performance of DIET-REL with both layer-wise or head-wise sharing.