The Impact of Positional Encodings on Multilingual Compression

In order to preserve word-order information in a non-autoregressive setting, transformer architectures tend to include positional knowledge, by (for instance) adding positional encodings to token embeddings. Several modifications have been proposed over the sinusoidal positional encodings used in the original transformer architecture; these include, for instance, separating position encodings and token embeddings, or directly modifying attention weights based on the distance between word pairs. We first show that surprisingly, while these modifications tend to improve monolingual language models, none of them result in better multilingual language models. We then answer why that is: sinusoidal encodings were explicitly designed to facilitate compositionality by allowing linear projections over arbitrary time steps. Higher variances in multilingual training distributions requires higher compression, in which case, compositionality becomes indispensable. Learned absolute positional encodings (e.g., in mBERT) tend to approximate sinusoidal embeddings in multilingual settings, but more complex positional encoding architectures lack the inductive bias to effectively learn cross-lingual alignment. In other words, while sinusoidal positional encodings were designed for monolingual applications, they are particularly useful in multilingual language models.


Introduction
Multiple recent papers have attempted to pinpoint precisely what components of multilingual language models enable cross-lingual transfer. Pires et al. (2019) show that although wordpiece overlap tends to improve cross-lingual transfer performance, even languages with different scripts (and no shared subwords) may enable zero-shot transfer. Wu and Dredze (2019) report similar results on a wider range of tasks. Artetxe et al. (2020) show  Table 1: We compare six positional encodings and their impact on cross-lingual generalization in multilingual language models that neither a shared vocabulary nor joint multilingual pre-training are necessary to train successful multilingual models. K et al. (2020) find that model depth is a contributor to transfer performance, but that reducing the number of self-attention heads does not have much of an effect.
Our starting point is , who claim that a) multilingual compression is caused by forced parameter sharing across languages, and that b) positional encodings play a significant role in the creation of a multilingual space, even in the absence of shared subwords and shared special tokens, like delimiters.
Contributions We build on  and demonstrate, through a series of experiments on synthetic and real data, that the choice of positional encoding mechanism has a significant effect on cross-lingual model performance: While many positional encodings have been proposed in monolingual settings as improvements over sinusoidal or absolute positional encodings, originally proposed in Vaswani et al. (2017) and Devlin et al. (2019), including untied positional encodings (TUPE; Ke et al. (2020)) and relative positional encodings (Shaw et al., 2018;Huang et al., 2020), none of these better facilitate cross-lingual compression or sharing. In fact, multilingual language models trained with untied or relative positional encodings exhibit much worse cross-lingual perfor-mance. We show that this is because sinusoidal embeddings facilitate compositionality, which we argue is particularly important for cross-lingual compression. We present a method for quantifying the compositionality of positional encodings, and find additional evidence for this hypothesis in word-position correlations and ablation studies. We are, to the best of our knowledge, the first to show this asymmetry between monolingual and multilingual language model training. Our experiments rely on the protocols in , but in addition to simple experiments with their Bible data, we also replicate all our experiments on Wikipedia data. Rather than relying on deterministic perturbations of data, as in  and Sinha et al. (2021), we make novel use of Galactic Dependencies (Wang and Eisner, 2016) in our experiments. Based on our experiments, we recommend caution when adopting methods developed for monolingual language models when training multilingual models, as well as that future work on positional encoding mechanisms also provides evaluations in multilingual settings.

Positional encodings
Positional encodings have been a mainstay of non-autoregressive transformer-based models right since Vaswani et al. (2017) first proposed the transformer architecture. The motivation being that given that transformers 1 are order-invariant (as opposed recurrent or convolutional networks), there must be some injection of word order into the encoder. Rather than using conventional "embeddings", Vaswani et al. (2017) use fixed sinusoidal position encodings, where each dimension characterises a sinusoidal waveform of a fixed frequency. Specifically, each encoding p is given as: where pos is the position and i is the dimension. They add these encodings to token representations before passing the sum to the first layer of the selfattention mechanism.
Several alternatives to sinusoidal encodings have been proposed since Vaswani et al. (2017). Most multilingual models tend to use BERT-style (Devlin et al., 2019) learnt absolute positional encodings, where a unique vector is learned and assigned to each position; these vectors are then added to word representations before being passed to the self-attention mechanism.
As an alternative to such position representations, where every position is represented by a unique vector, relative positional encodings have been proposed (Shaw et al., 2018;Huang et al., 2020). Rather than assigning representations to tokens based on their position, relative positional encoding involves assigning representations to position-position pairs; typically, these encodings are calculated separately and added to the attention matrix. We evaluate both the encodings proposed in Shaw et al. (2018) and the encodings proposed in (Huang et al., 2020) in our experiments below. He et al. (2021) propose eliminating positionposition correlations, and using separate parameters for word and position representations; Wang et al. (2019) propose using dependency trees instead of raw sequential positions. Ke et al. (2020) recommend eliminating the addition operation in BERT-style representations; they argue that wordposition correlations are effectively nil, and that the addition introduces unnecessary noise. We evaluate two untied positional encodings proposed in Ke et al. (2020) (TUPE). TUPE modifies absolute representations by a) untying word-position correlations; b) using a separate set of parameters for positional attention and c) untying [CLS] tokens from positions.
We refer to recent surveys Wang et al., 2021) for a more detailed treatment of position encoding methods. We provide a summary of our methods in Table 1. W Q,l and W K,l represent the query/key weights for the attention mechanism at some layer l, and a ij or b j−i are learnt vectors corresponding to the offset j − i. Note that the untied position-position term (p i U Q )(p j U K ) is added at every layer.
The above positional encodings have been introduced in the context of monolingual pretrained language models, and there has been only a limited amount of work addressing the effect of positional encodings on multilingual models. Liu et al. (2020a) find that positional information tends to hurt machine translation, as the encoder learns a word-order bias towards the source languages. 2 Artetxe et al. (2020) find that language-specific positional representations help in an adapter-based training scenario. Ding et al. (2020) attempt to account for structural differences between languages by using bracketing transduction grammar trees to reorder position labels (and find that it helps). Liu et al. (2020b) find that models that are relatively agnostic to word-order tend to perform better in cross-lingual settings; they hypothesise that large multilingual encoders, being trained on languages with drastic differences in word orders, tend to have order-agnostic positional encodings, and thus discourage fine-tuning positional encodings downstream. Contemporaneous with this work, Sinha et al. (2021) show that positional information is important for monolingual models even given unnatural, randomly shuffled word ordering.
Dufter and Schütze (2021) present a set of experiments training smaller language models on bilingual corpora, consisting of the same corpus in English and "fake-English", which is English with a shifted BPE vocabulary. They evaluate retrieval and translation scores at different layers; gold alignments are easy to derive given that the corpora are effectively parallel corpora, and that the vocabularies for both halves are effectively the same. As we build on these experiments, we adopt slightly simplified notation, and denote vocabulary-shifted corpora with square brackets, eg. [EN].

Experiments
Galactic Dependencies A drawback of the multilingual experiments presented in  is that EN and [EN] effectively have the same structure. While the authors attempt to control for this in additional experiment where word order in [EN] is completely reversed, this does not resemble realistic differences across languages. Using true multilingual corpora is, however, difficult: our retrieval and translation tasks are easy to bootstrap precisely because we have faux-parallel corpora, with effectively pre-aligned vocabulary.
To induce structural diversity in our corpora, therefore, we reorder our corpora using Galactic Dependencies (GD) models (Wang and Eisner, 2016). Briefly, GD models sample ordering statistics based on dependency relations for the dependants of verbs and/or nouns from some superstrate language XX; when applied to sentences these during fine-tuning helps cross-lingual zero-shot generalization.
in some substrate language (in the context of our experiments, EN), the models reorder dependants of VERB and/or NOUN nodes to match the ordering statistics of the substrate language they were trained on. We opt to reorder both nominal and verbal arguments, and follow the authors in denoting the sampling operation with a ∼, giving us for eg. EN∼XX for an English language corpus, with dependent order statistics adapted from some language XX. Table 2 contains an example sentence and some of its reorderings.
Note that GD reordering only works for projective sentences, and rather than retain un-reordered non-projective sentences, we exclude them from all our corpora.  This approach, while simple and useful, does have several limitations. Predominantly, because our reordering is fundamentally syntactic/structural, our fake languages still maintain both the morphology of the source language (English in our case), and the same vocabulary distribution. Thus, although scrambling ought to affect context and neighbourhoods, an English token and its corresponding fake token have exactly the same unigram distribution.
Training Our model of choice is an underparameterised BERT, as in . We train multiple such underparameterised BERT models, each with a different encoding mechanism from Section 2, on two bilingual corpora: -a bilingual corpus comprised of English, and a fake vocab-shifted English.
EN + [EN∼XX] -a bilingual corpus comprised of English, and a fake English that has had its constituents reordered to match the distribution of some language XX.
We reorder our English starting point according to seven different faux-languages (just "languages" for brevity): Arabic, German, Basque, Finnish, French, Hindi and Swedish. Note that given that our starting point was English, there was no way for us to control for morphological differences; as such, languages with freer word order (like Basque) are likelier to make our English corpora ambiguous.
We use two corpora in this work: the first is the Bible splits from , with the English easy-to-read Bible as the training split, and the KJV Bible as validation. The second corpus uses the English Wikipedia as the training split, and Common Crawl as validation. We present corpus statistics in Table 3. For each corpus, we learn and apply a BPE vocabulary of size 2048.

Train Validation
Bible 30602 9080 Wikipedia 50000 20000 Following Dufter and Schütze (2021), our BERT models all have a single head and 12 layers. We reduce the dimensionality of the encoder layers to 64, and the feed-forward layers to 256. Each model is trained for 100 epochs with three different random seeds (0, 42 and 100), giving us a total of 7 languages x 6 encoding methods x 3 seeds x 2 cor-pora = 252 models. We implement our code 3 in the transformers library (Wolf et al., 2020). For learned absolute and the two relative encoding models, we use the default implementations, that scale attention operations by a scaling factor of 1 √ d . For our untied models, we adjust our scaling factor to 1 √ 2d as in the original paper (Ke et al., 2020). For sinusoidal representations, while Vaswani et al. (2017) multiply token embeddings by √ d to avoid drowning them out with the [−1, 1] sinusoidal encoding range, we find that our default embedding size is too small for this to have an effect, and instead scale up token embeddings by 2 √ d before adding positional encodings.
For all parameterised encoding models except TUPE (relative), we use a maximum of k = 512 positions; the concrete transformers implementation of the relative methods means that this gives us 1023 total offsets. 4 For TUPE (relative), we use a maximum of k = 128 positions, divided into 32 bins with logarithmically increasing bin sizes; this is taken from the original implementation in Ke et al. (2020).

Evaluation
We adopt Dufter and Schütze's (2021) evaluation pipeline, evaluating each of our models at layers 0 and 8; we also describe a multilingual score, which is defined as the average accuracy for the retrieval and translation tasks, at layers 0 and 8. We also measure perplexity, both on the monolingual first half of the corpus, and on both halves combined. Note that true perplexities for masked language models are intractable (Wang and Cho, 2019; Salazar et al., 2020). We use a trivial approximation and calculate perplexity based on the prediction loss for each masked token; note that while these suffice for comparison purposes, they are not true perplexities and should not be taken as such outside the context of these experiments.
We present our results (averaged out over fauxlanguages) in Figure 1, with full results in Appendix C. As expected, the more recent positional encodings are superior to sinusoidal or absolute positional encodings in the monolingual setting; but somewhat surprisingly, sinusoidal and absolute positional encodings are clearly outperforming the more recent approaches in the multilingual setting. We also note that the gap in multilingual performance only grows larger when a different word order is imposed on the target language; see the bottom row of Figure 1. Interestingly, switching to structurally different L2s can sometimes reduce the language modelling perplexity of the L1: this could be due to regularisation induced by structural differences.
Typological differences We discuss "typology" with a caveat: our experiments with GD only alter word order, which means that all our alteredstructure experiments still have English morphology. As such, it is impossible to talk about non-English languages; only about non-English wordorder tendencies, when induced in English. Having said that, when we measure performance variation across languages (Figure 2), our results are more or less what one would expect: performance is decent for relatively rigid word-order languages, and poorer for languages that have complex morphology.
Interestingly, SVO languages consistently tend to perform better than our three non-SVO languages (Basque, Hindi and Arabic); this could be due to VSO/SOV languages requiring morphology to disambiguate between adjacent nominals (Lev- shina, 2019). Another justification could also be that these are languages with a very different "default" word order to English; this would further motivate Ding et al.'s (2020) use of cross-lingually reordered position markers.
Real-world results While we conduct most of our analyses on our toy models, we also ran a series of experiments to verify that our results would hold with larger models. As such, we pre-trained full size BERT models (base, not large) for two epochs, on a corpus consisting of 8.5M, 9.3M and 800k sentences in English, German and Hindi respectively. We then fine-tuned these models for three epochs on (English) MultiNLI (Williams et al., 2018), and evaluated on held-out XNLI test sets for our three languages (Conneau et al., 2018); the process took approximately 4 days per model, on a single V100 GPU. We trained two models (seeds 0 and 42) per method, for three different positional encoding methods: a) absolute positional encodings, as these are used in the original BERT, b) sinusoidal encodings, as these were the original transformer encodings, and c) TUPE (absolute), as the most recent innovation. Our real-world results appear to validate our toy experiments: performance on English, the language the model was fine-tuned on, is highest with TUPE, while cross-lingual transfer suffers, both on German and to a lesser extent on Hindi.

Analyses
In an attempt to explain the significantly improved cross-lingual performance of absolute positional encodings, we tried to examine precisely what sort of encoding was being learnt. Part of the original motivation behind sinusoidal encodings was that they would allow for compositionality; for any fixed offset k, there exists a linear transformation from p pos to p pos+k , making it easier to learn to attend to relative offsets; the proof of this is in Appendix A. 5 We examined our absolute positional encodings to see whether or not they were being induced to learn some specific function. represents a specific dimension of the encoding vectors generated for positions 0 to 31. Interestingly, it appears that absolute representations converge to waveforms that represent sinusoids somewhat, while neither of the untied experiments do so (cf. Appendix B).
We hypothesize that absolute representations converge to waveforms because of increased pressure for compositionality, being trained on structurally different languages. To test this, we quantify the extent to which the absolute, relative and untied encodings are compositional in the sense that there is a linear transformation from p pos to p pos+k for different k.
To this end, we use Procrustes analysis (Stegmann and Gomez, 2002) to learn a linear transformation for each k, based on the representations of p pos and p pos+k . Specifically, we apply orthogonal Procrustes analyses (Schönemann, 1966), which avoid scaling and translation.
First, we minimise arg min T ||p pos −Tp pos+k || 2 . Next, we apply T to a different randomly selected pos , i.e. we calculate L = ||p pos − Tp pos +k || 2 . The higher the final loss L, the less our encodings facilitate compositionality. In order to make learning T simpler, rather than selecting representations for single positions pos and pos , we select chunks of arbitrary size C, and stack their positions into a matrix. Note that for sinusoidal representations, the loss is close to zero regardless of span.
The losses are plotted over a range of offsets for both absolute representations and for TUPE(a), in Figure 5; we include a control model trained on a monolingual corpus. Losses are averaged over 125 runs per offset, with random values of pos, pos and C. While both forms of representation appear to be similar (and relatively non-sinusoidal) when trained on the monolingual corpus, introducing bilingualism leads to a clear difference between the two: absolute positional representations tend to be a lot closer to sinusoidal representations than untied ones do. Note, also, that this gap is clear-est for the (simpler) EN + [EN] experiment -this is unsurprising, as EN + [EN] is still perceived as bilingual due to the shifted vocabulary. The structural similarity between the two, however, makes it easier to build compositional representations by relying on offsets, as the model only needs to learn to represent one language, structurally speaking. We observe a similar gap when comparing pretrained BERT models: bert-base-multilingual-cased exhibits more sinusoidal representations over a range of offsets, when compared to bert-base-cased, although the gap is narrower than with our toy models. Correlations in multilingual settings A key motivation for eliminating word-position correlations, presented in (Ke et al., 2020), is the fact that these correlations are effectively zero, leading to no additional information for the model. Figure 6 captures word-position correlations from three of our trained models (with an additional model trained on a purely monolingual corpus); note that while these correlations are very close to zero for monolingual corpora, there is a visible "banding" phenomenon in the multilingual corpora, that only grows stronger when a different grammar is sampled. A similar banding phenomenon is visible when we compare multilingual and monolingual pre-trained BERT models (Appendix B), albeit with reduced magnitude. We hypothesize Figure 7: Ablation experiments, averaged over languages (for perplexity and ML score). Procrustes losses calculated as in §5, for the EN∼FI model (seed 0). that the pressure for compositionality induces these correlations.
Ablation studies Finally, we ran a series of ablation experiments on absolute positional encodings to support the above analysis. Three of the experiments involved removing position-position correlations, position-word correlations, word-position correlations, and a fourth involved using separate parameters for word and position attention. Results are presented in Figure 7; we also include the median Procrustes loss. We note that the removal of both position-word correlations and word-position correlations has an effect on both perplexity and ML score. Interestingly, removing word-position correlations ((p i W Q )(w j W K ) ) does not have the same effect as the inverse does: perplexity is lower than with position-word correlations removed, but so is the ML score, indicating a difference between the role played by position as a key, and as a query.
On relative representations Given our previous assumptions about offsets aiding compositionality, why, then, do our relative representations -that explicitly calculate offsets -perform poorly in multilingual settings? We speculate that the reason relative encodings appear to hurt multilingual compression is that offset-specific bias terms sparsify the learning signal for (and thereby hinder the alignment of) disjoint vocabularies. In compensating for this, relative positional encodings sacrifice their compositionality. Relative representations aid compositionality by directly providing a bias term derived from the distance between a word pair. As shown above, absolute representations learn similar biases; however, being actively forced to learn such biases could encourage models to jointly learn alignment and compositionality.
Further, offset representations are also effectively "hard", i.e. derived from the hard distance between the two tokens. The interaction between w i and w j is not wholly mediated by the distance i − j, however, this correlation is forced by the product term (x i W Q )(a ij ) . The term (x i W Q )(p j W K ) , on the other hand, could effectively attend to multiple offsets. p j W K is fixed for position j; given the sinusoidal nature of p, the product term could induce a "soft" positional representation with subspaces attending to different offsets 6 ; the relevant offset mix could then be indexed into by x i W Q .

Discussion
The main contribution of our work is practical, namely showing that findings about positional encodings in the context of monolingual language models do not apply straightforwardly to multilingual language models. In answering why sinusoidal embeddings are superior to more recent alternatives in the multilingual setting, we also found the compositionality of positional encodings to be predictive of multilingual compression in such models. While relative positional encodings seem designed for compositionality, they prevent efficient alignment of multilingual vocabularies. Sinha et al. (2021) show that word order matters little for monolingual language model pretraining, and that pretrained language models seem to rely mostly on higher-order word co-occurrence statistics. Our work shows that this finding does not generalize to pretraining multilingual language models. In the multilingual setting, word order clearly matters, as also shown in previous work (Ke et al., 2020;, and compositional positional encodings seem to facilitate effective multilingual compression. This aligns with the observation that syntactic reordering à la Ding et al. (2020) is in some cases an effective way to encourage compositional cross-lingual representations.
In general, our results illustrate how methods developed for monolingual language models should not be blindly adopted when training multilingual models, which potentially require different architectures. Conversely, we would encourage future work on new positional encoding mechanisms for non-autoregressive models to also evaluate these mechanisms in multilingual settings.

Conclusion
Through a series of synthetic and real experiments with training multilingual language models, we showed that a) sinusoidal positional encodings perform better in multilingual settings than more recent alternatives (that have been shown to perform better in monolingual settings); b) this is likely because of an increased pressure for compositionality. We devised a method for quantifying the compositionality of positional encodings, and strengthened our results by also considering word-position correlations and ablation studies. Mikkel B. Stegmann and David Delgado Gomez. 2002. A brief introduction to statistical shape analysis. = sin(ω(t + k)) cos(ω(t + k)) implying that for a fixed frequency ω, there exists a rotation matrix R k that can induce a rotational offset of k.