The Case for Translation-Invariant Self-Attention in Transformer-Based Language Models

Mechanisms for encoding positional information are central for transformer-based language models. In this paper, we analyze the position embeddings of existing language models, finding strong evidence of translation invariance, both for the embeddings themselves and for their effect on self-attention. The degree of translation invariance increases during training and correlates positively with model performance. Our findings lead us to propose translation-invariant self-attention (TISA), which accounts for the relative position between tokens in an interpretable fashion without needing conventional position embeddings. Our proposal has several theoretical advantages over existing position-representation approaches. Proof-of-concept experiments show that it improves on regular ALBERT on GLUE tasks, while only adding orders of magnitude less positional parameters.


Introduction
The recent introduction of transformer-based language models by Vaswani et al. (2017) has set new benchmarks in language processing tasks such as machine translation (Lample et al., 2018;Gu et al., 2018;Edunov et al., 2018), question answering (Yamada et al., 2020), and information extraction (Wadden et al., 2019;Lin et al., 2020). However, because of the non-sequential and positionindependent nature of the internal components of transformers, additional mechanisms are needed to enable models to take word order into account. Liu et al. (2020) identified three important criteria for ideal position encoding: Approaches should be inductive, meaning that they can handle sequences and linguistic dependencies of arbitrary length, data-driven, meaning that positional dependencies are learned from data, and efficient in terms of the number of trainable parameters. Separately, Shaw et al. (2018) argued for translation-invariant positional dependencies that depend on the relative distances between words rather than their absolute positions in the current text fragment. It is also important that approaches be parallelizable, and ideally also interpretable. Unfortunately, none of the existing approaches for modeling positional dependencies satisfy all these criteria, as shown in Table  1 and in Sec. 2. This is true even for recent years' state-of-the-art models such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), ALBERT (Lan et al., 2020), and ELECTRA (Clark et al., 2020), which require many positional parameters but still cannot handle arbitrary-length sequences.
This paper makes two main contributions: First, in Sec. 3, we analyze the learned position embeddings in major transformer-based language models. Second, in Sec. 4, we leverage our findings to propose a new positional-dependence mechanism that satisfies all desiderata enumerated above. Experiments verify that this mechanism can be used alongside conventional position embeddings to improve downstream performance. Our code is available.

Background
Transformer-based language models (Vaswani et al., 2017) have significantly improved modeling accuracy over previous state-of-the-art models like ELMo (Peters et al., 2018). However, the nonsequential nature of transformers created a need for other mechanisms to inject positional information into the architecture. This is now an area of active research, which the rest of this section will review.
The original paper by Vaswani et al. (2017) proposed summing each token embedding with a position embedding, and then used the resulting embedding as the input into the first layer of the model. BERT (Devlin et al., 2019) reached improved performance training data-driven d-dimensional em-  beddings for each position in text snippets of at most n tokens. A family of models have tweaked the BERT recipe to improve performance, including RoBERTa (Liu et al., 2019) and ALBERT (Lan et al., 2020), where the latter has layers share the same parameters to achieve a more compact model. All these recent data-driven approaches are restricted to fixed max sequence lengths of n tokens or less (typically n = 512). Longformer (Beltagy et al., 2020) showed modeling improvements by increasing n to 4096, suggesting that the cap on sequence length limits performance. However, the Longformer approach also increased the number of positional parameters 8-fold, as the number of parameters scales linearly with n; cf. Table 2.
Clark et al. (2019) and Htut et al. (2019) analyzed BERT attention, finding some attention heads to be strongly biased to local context, such as the previous or the next token. Wang and Chen (2020) found that even simple concepts such as word-order and relative distance can be hard to extract from absolute position embeddings. Shaw et al. (2018) independently proposed using relative position embeddings that depend on the signed distance between words instead of their absolute position, making local attention easier to learn. They reached improved BLEU scores in machine translation, but their approach (and refinements by Huang et al. (2019)) are hard to parallelize, which is unattractive in a world driven by parallel computing. Zeng et al. (2020) used relative attention in speech synthesis, letting each query interact with separate matrix transformations for each key vector, depending on their relative-distance offset. Raffel et al.
(2020) directly model position-to-position interactions, by splitting relative-distance offsets into q bins. These relative-attention approaches all facilitate processing sequences of arbitrary length, but can only resolve linguistic dependencies up to a fixed predefined maximum distance.
Tay et al. (2020) directly predicted both word and position contributions to the attention matrix without depending on token-to-token interactions. However, the approach is not inductive, as the size of the attention matrix is a fixed hyperparameter.
Liu et al. (2020) used sinusoidal functions with learnable parameters as position embeddings. They obtain compact yet flexible models, but use a neural ODE, which is computationally unappealing. Ke et al. (2021) showed that self-attention works better if word and position embeddings are untied to reside in separate vector spaces, but their proposal is neither inductive nor parameter-efficient. Su et al. (2021) propose rotating each embedding in the self-attention mechanism based on its absolute position, thereby inducing translational invariance, as the inner product of two vectors is conserved under rotations of the coordinate system. These rotations are, however, not learned.
The different position-representation approaches are summarized in Table 1. None of them satisfy all design criteria. In this article, we analyze the position embeddings in transformer models, leading us to propose a new positional-scoring mechanism that combines all desirable properties (final row).

Analysis of Existing Language Models
In this section, we introspect selected high-profile language models to gain insight into how they have learned to account for the effect of position.

Analysis of Learned Position Embeddings
First, we stack the position embeddings in the matrix E P ∈ R n×d , and inspect the symmetric matrix P = E P E T P ∈ R n×n , where P i,j represents the inner product between the ith and jth embedding vectors. If inner products are translation invariant, P i,j will only depend on the difference between the indices, j − i, giving a Toeplitz matrix, a matrix where each diagonal is constant.  Figure 1: Heatmaps visualizing the matrix P = E P E T P of position-embedding inner products for different models. The greater the inner product between the embeddings, the brighter the color. See appendix Figs. 4, 5 for more. Fig. 1 visualizes the P -matrices for the position embeddings in a number of prominent transformer models, listed from oldest to newest, which also is in order of increasing performance. We note that a clear Toeplitz structure emerges from left to right. Translation invariance is also seen when plotting position-embedding cosine similarities, as done by Wang and Chen (2020) for transformerbased language models and by Dosovitskiy et al.
In Fig. 2 we further study how the degree of Toeplitzness (quantified by R 2 , the amount of the variance among matrix elements P i,j explained by the best-fitting Toeplitz matrix) changes for different ALBERT models. With longer training time (i.e., going from ALBERT v1 to v2), Toeplitzness increases, as the arrows show. This is associated with improved mean dev-set score. Such evolution is also observed in Wang and Chen (2020, Fig. 8).

Translation Invariance in Self-Attention
Next, we analyze how this translation invariance is reflected in self-attention. Recall that Vaswani et al.
(2017) self-attention can be written as and define position embeddings E P , word embeddings E W , and query and key transformation weight matrices W Q and W K . By taking and replacing each row of E W by the average word embedding across the entire vocabulary, we obtain a matrix we call F P that quantifies the average effect of E P on the softmax in Eq. (1). Plots of the resulting F P for all 12 ALBERT-base attention heads in the first layer are in appendix Fig. 8. Importantly, these matrices also exhibit Toeplitz structure.

Proposed Self-Attention Mechanism
We now introduce our proposal for parameterizing the positional contribution to self-attention in an efficient and translation-invariant manner, optionally removing the position embeddings entirely.

Leveraging Translation Invariance for Improved Inductive Bias
Our starting point is the derivation of Ke et al. (2021). They expand QK T while ignoring cross terms, yielding an approximation they support by theory and empirical evidence. They then "untie" the effects of words and positions by using different W -matrices for the two terms in Eq. (3). We agree with sepa-  rating these effects, but also see a chance to reduce the number of parameters. Concretely, we propose to add a second term F P ∈ R n×n , a Toeplitz matrix, inside the parentheses of Eq. (1). F P can either a) supplement or b) replace the effect of position embeddings on attention in our proposed model. For case a), we simply add F P to the existing expression inside the soft- (3). This produces two new self-attention equations: where the inputs Q W , K W , and V W (defined by Q W = E W W Q , and similarly for K W and V W ) do not depend on the position embeddings E P . Case a) is not as interpretable as TISA alone (case b), since the resulting models have two terms, E P and F P , that share the task of modeling positional information. Our two proposals apply to any sequence model with a self-attention that follows Eq. (1), where the criteria in Table 1 are desirable.

Positional Scoring Function
Next, we propose to parameterize the Toeplitz matrix F P using a positional scoring function f θ (·) on the integers Z, such that (F P ) i,j = f θ (j − i). f θ defines F P -matrices of any size n. The value of f θ (j−i) directly models the positional contribution for how the token at position i attends to position j. We call this translation-invariant self-attention, or TISA. TISA is inductive and can be simplified down to arbitrarily few trainable parameters. Let k = j − i. Based on our findings for F P in Sec. 3, we seek a parametric family {f θ } that allows both localized and global attention, without diverging as |k| → ∞. We here study one family   The total number of positional parameters of TISA is then 3SHL. As seen in Table 2, this is several orders of magnitude less than the embeddings in prominent language models.
The inductivity and localized nature of TISA suggests the possibility to rapidly pre-train models on shorter text excerpts (small n), scaling up to longer n later in training and/or at application time, similar to the two-stage training scheme used by Devlin et al. (2019), but without risking the undertraining artifacts visible for BERT at n > 128 in Figs. 1 and 4. However, we have not conducted any experiments on the performance of this option.

Experiments
The main goal of our experiments is to illustrate that TISA can be added to models to improve their performance (Table 3a), while adding a minuscule amount of extra parameters. We also investigate the performance of models without position em-  Table 3: GLUE task dev-set performance (median over 5 runs) with TISA (S kernels) and without (baseline). ∆ is the maximum performance increase in a row and ∆% is the corresponding relative error reduction rate.
beddings (Table 3b), comparing TISA to a bagof-words baseline (S = 0). All experiments use pretrained ALBERT base v2 implemented in Huggingface (Wolf et al., 2020). Kernel parameters θ (h) for the functions in Eq. (5) were initialized by regression to the F P profiles of the pretrained model, (see Appendix C for details); example plots of resulting scoring functions are provided in Fig.  3. We then benchmark each configuration with and without TISA for 5 runs on GLUE tasks (Wang et al., 2018), using jiant (Phang et al., 2020) and standard dataset splits to evaluate performance. Our results in Table 3a show relative error reductions between 0.4 and 6.5% when combining TISA and conventional position embeddings. These gains are relatively stable regardless of S. We also note that Lan et al. (2020) report 92.9 on SST-2 and 84.6 on MNLI, meaning that our contribution leads to between 1.3 and 2.8% relative error reductions over their scores. The best performing architecture (S = 5), gives improvements over the baseline on 7 of the 8 tasks considered and on average increases the median F1 score by 0.4 points. All these gains have been realized using a very small number of added parameters, and without pre-training on any data after adding TISA to the architecture. The only joint training happens on the training data of each particular GLUE task.
Results for TISA alone, in Table 3b, are not as strong. This could be because these models are derived from an ALBERT model pretrained using conventional position embeddings, since we did not have the computational resources to tune fromscratch pretraining of TISA-only language models. Figs. 3 and 6 plot scoring functions of different attention heads from the initialization described in Appendix C. Similar patterns arose consistently and rapidly in preliminary experiments on pretraining TISA-only models from scratch. The plots show heads specializing in different linguistic aspects, such as the previous or next token, or multiple tokens to either side, with other heads showing little or no positional dependence. This mirrors the visualizations of ALBERT base attention heads in Figs Interestingly, the ALBERT baseline on STS-B in Table 3a is only 1.3 points ahead of the bagof-words baseline in Table 3b. This agrees with experiments shuffling the order of words (Pham et al., 2020;Sinha et al., 2021) finding that modern language models tend to focus mainly on higherorder word co-occurrences, rather than word order, and suggests that word-order information is underutilized in state-of-the-art language models.

Conclusion
We have analyzed state-of-the-art transformerbased language models, finding that translationinvariant behavior emerges during training. Based on this we proposed TISA, the first positional information processing method to simultaneously satisfy the six key design criteria in Table 1. Experiments demonstrate competitive downstream performance. The method is applicable also to transformer models outside language modeling, such as modeling time series in speech or motion synthesis, or to describe dependencies between pixels in computer vision.
A Visualizing E P E T P for Additional Language Models Fig. 1 shows the inner product between different position embeddings for the models BERT base uncased, RoBERTa base, ALBERT base v1 as well as ALBERT xxlarge v2. Leveraging our analysis findings of translation invariance in the matrix of E P E T P in these pretrained networks, we investigate the generality of this phenomenon by visualizing the same matrix for additional existing large language models. We find that similar Toeplitz patterns emerge for all investigated networks.

B Coefficient of Determination R 2
The coefficient of determination, R 2 , is a widely used concept in statistics that measures what fraction of the variance in a dependent variable that can be explained by an independent variable. Denoting the Residual Sum of Squares, RSS, and Total Sum of Squares, T SS, we have that where R 2 = 0 means that the dependent variable is not at all explained, and R 2 = 1 means that the variance is fully explained by the independent variable. Applied to a matrix, A ∈ R n×n , to determine its degree of Toeplitzness, we get RSS by finding the Toeplitz matrix, A T ∈ R n×n , that minimizes the following expression: Furthermore, we can compute T SS as:

C Extracting ALBERT positional scores
In order to extract out the positional contributions to the attention scores from ALBERT, we disentangle the positional and word-content contributions from equation (3), and remove any dependencies on the text sequence through E W . We exchange E W ≈ E W , with the average word embedding over the entire vocabulary, which we call E W .
This way, we can disentangle and extract the positional contributions from the ALBERT model.

Initialization of Position-Aware Self-Attention
Using this trick, we initialize F P with formula (12). Since F P is only generating the positional scores, which are independent of context, it allows for training a separate positional scorer neural network to predict the positional contributions in the ALBERT model. Updating only 2,160 parameters (see Ta-ble 2) significantly reduces the computational load. This pretraining initialization scheme converges in less than 20 seconds on a CPU.
Removing Position Embeddings When removing the effect of position embeddings, we calculate the average position embedding and exchange all position embeddings for it. This reduces the variation between position embeddings, while conserving the average value of the original input vectors E W +E P .

Extracted Attention Score Contributions
Leveraging our analysis findings of translation invariance in large language models, we visualize the scoring functions as a function of relative distance offset between tokens. Fig. 3 shows the implied scoring functions for 4 attention heads for 5 different absolute positions. Figs. 6, 7 show all 12 attention heads of ALBERT base v2 with TISA.

D Number of Positional Parameters of Language Models
In the paper, define positional parameters as those modeling only positional dependencies. In most BERT-like models, these are the position embeddings only (typically n×d parameters). Ke et al. (2021) propose to separate position and content embeddings, yielding more expressive models with separate parts of the network for processing separate information sources. In doing so, they introduce two weight matrices specific to positional information processing, U Q ∈ R d×d and U K ∈ R d×d , totaling nd+2d 2 positional parameters.  Figure 4: Visualizations of the inner-product matrix P = E P E T P ∈ R n×n for different BERT, ELECTRA, and RoBERTa models. We see that ELECTRA and RoBERTa models show much stronger signs of translational invariance than their BERT counterparts. Most BERT models follow the pattern noted by Wang and Chen (2020), where the Toeplitz structure is much more pronounced for the first 128 × 128 submatrix, reflecting how these models mostly were trained on 128-token sequences, and only scaled up to n = 512 for the last 10% of training (Devlin et al., 2019). Position embeddings 385 through 512 of the BERT cased models show a uniform color, suggesting that these embeddings are almost completely untrained. (h) ALBERT xxlarge v2 Figure 5: Visualizations of the inner-product matrix P = E P E T P ∈ R n×n for different ALBERT models (Lan et al., 2020). We plot both v1 and v2 to show the progression towards increased Toeplitzness during training.
Hyperparameter Selection We performed a manual hyperparameter search starting from the hyperparameters that the Lan et al. (2020) re-port in https://github.com/google-research/ albert/blob/master/run_glue.sh. Our hyperparameter config files can be found with our code. : Rows from the positional attention matrices F P for all ALBERT base v2 attention heads, centered on the main diagonal. Note that the vertical scale generally differs between plots. The plots are essentially aligned sections through the matrices in Fig. 8, but zoomed in to show details over short relative distances since this is where the main peak(s) are located, and the highest values are by far the most influential on softmax attention.

E Reproducibility
Experiments were run on a GeForce RTX 2080 machine with 8 GPU-cores. Each downstream experiment took about 2 hours to run.  Some heads are seen to be sensitive to position, while others are not. Note that these visualizations deliberately use a different color scheme from other (red) matrices, to emphasize the fact that the matrices visualized here represent a different phenomenon and are not inner products.