Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders

Previous work has indicated that pretrained Masked Language Models (MLMs) are not effective as universal lexical and sentence encoders off-the-shelf, i.e., without further task-specific fine-tuning on NLI, sentence similarity, or paraphrasing tasks using annotated task data. In this work, we demonstrate that it is possible to turn MLMs into effective lexical and sentence encoders even without any additional data, relying simply on self-supervision. We propose an extremely simple, fast, and effective contrastive learning technique, termed Mirror-BERT, which converts MLMs (e.g., BERT and RoBERTa) into such encoders in 20-30 seconds with no access to additional external knowledge. Mirror-BERT relies on identical and slightly modified string pairs as positive (i.e., synonymous) fine-tuning examples, and aims to maximise their similarity during “identity fine-tuning”. We report huge gains over off-the-shelf MLMs with Mirror-BERT both in lexical-level and in sentence-level tasks, across different domains and different languages. Notably, in sentence similarity (STS) and question-answer entailment (QNLI) tasks, our self-supervised Mirror-BERT model even matches the performance of the Sentence-BERT models from prior work which rely on annotated task data. Finally, we delve deeper into the inner workings of MLMs, and suggest some evidence on why this simple Mirror-BERT fine-tuning approach can yield effective universal lexical and sentence encoders.


Introduction
Transfer learning with pretrained Masked Language Models (MLMs) such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) has been widely successful in NLP, offering unmatched performance in a large number of tasks (Wang et al., 2019a). Despite the wealth of semantic knowledge stored in the MLMs (Rogers et al., 2020), they do not produce high-quality lexical and sentence embeddings when used off-the-shelf, without further BERT/ RoBERTa BERT/ RoBERTa weight sharing ( random span masking ) multiple dropout layers f(x 5 ) Figure 1: Illustration of the main concepts behind the proposed self-supervised Mirror-BERT method. The same text sequence can be observed from two additional "views": 1) by performing random span masking in the input space, and/or 2) by applying dropout (inside the BERT/RoBERTa MLM) in the feature space, yielding identity-based (i.e., "mirrored") positive examples for fine-tuning. A contrastive learning objective is then applied to encourage such "mirrored" positive pairs to obtain more similar representations in the embedding space relatively to negative pairs. task-specific fine-tuning (Feng et al., 2020;Li et al., 2020). In fact, previous work has shown that their performance is sometimes even below static word embeddings and specialised sentence encoders (Cer et al., 2018) in lexical and sentence-level semantic similarity tasks (Reimers and Gurevych, 2019;Vulić et al., 2020b;Litschko et al., 2021). In order to address this gap, recent work has trained dual-encoder networks on labelled external resources to convert MLMs into universal language encoders. Most notably, Sentence-BERT (SBERT, Reimers and Gurevych 2019) further trains BERT and RoBERTa on Natural Language Inference (NLI, Bowman et al. 2015;Williams et al. 2018) and sentence similarity data (Cer et al., 2017) to obtain high-quality universal sentence embeddings. Recently, SapBERT  self-aligns phrasal representations of the same meaning using synonyms extracted from the UMLS (Bodenreider, 2004), a large biomedical knowledge base, obtaining lexical embeddings in the biomedical domain that reach state-of-the-art (SotA) performance in biomedical entity linking tasks. However, both SBERT and SapBERT require annotated (i.e., human-labelled) data as external knowledge: it is used to instruct the model to produce similar representations for text sequences (e.g., words, phrases, sentences) of similar/identical meanings.
In this paper, we fully dispose of any external supervision, demonstrating that the transformation of MLMs into universal language encoders can be achieved without task-labelled data. We propose a fine-tuning framework termed Mirror-BERT, which simply relies on duplicating and slightly augmenting the existing text input (or their representations) to achieve the transformation, and show that it is possible to learn universal lexical and sentence encoders with such "mirrored" input data through self-supervision (see Fig. 1). The proposed Mirror-BERT framework is also extremely efficient: the whole MLM transformation can be completed in less than one minute on two 2080Ti GPUs.
Our findings further confirm a general hypothesis from prior work Ben-Zaken et al., 2020;) that finetuning exposes the wealth of (semantic) knowledge stored in the MLMs. In this case in particular, we demonstrate that the Mirror-BERT procedure can rewire the MLMs to serve as universal language encoders even without any external supervision. We further show that data augmentation in both input space and feature space are key to the success of Mirror-BERT, and they provide a synergistic effect.
Contributions. 1) We propose a completely selfsupervised approach that can quickly transform pretrained MLMs into capable universal lexical and sentence encoders, greatly outperforming offthe-shelf MLMs in similarity tasks across different languages and domains. 2) We investigate the rationales behind why Mirror-BERT works at all, aiming to understand the impact of data augmentation in the input space as well as in the feature space. We release our code and models at https: //github.com/cambridgeltl/mirror-bert.

Mirror-BERT: Methodology
Mirror-BERT consists of three main parts, described in what follows. First, we create positive pairs by duplicating the input text ( §2.1). We then further process the positive pairs by simple data augmentation operating either on the input text or on the feature map inside the model ( §2.2). Finally, we apply standard contrastive learning, 'attracting' the texts belonging to the same class (i.e., positives) while pushing away the negatives ( §2.3).

Training Data through Self-Duplication
The key to success of dual-network representation learning (Henderson et al., 2019;Reimers and Gurevych, 2019;Humeau et al., 2020;Liu et al., 2021, inter alia) is the construction of positive and negative pairs. While negative pairs can be easily obtained from randomly sampled texts, positive pairs usually need to be manually annotated. In practice, they are extracted from labelled task data (e.g., NLI) or knowledge bases that store relations such as synonymy or hypernymy (e.g., PPDB, Pavlick et al. 2015;BabelNet, Ehrmann et al. 2014;WordNet, Fellbaum 1998;UMLS).
Mirror-BERT, however, does not rely on any external data to construct the positive examples. In a nutshell, given a set of non-duplicated strings X , we assign individual labels (y i ) to each string and build a dataset D = {(x i , y i )|x i ∈ X , y i ∈ {1, . . . , |X |}}. We then create self-duplicated training data D simply by repeating every element in D. In other words, let X = {x 1 , x 2 , . . .}. We then have D = {(x 1 , y 1 ), (x 2 , y 2 ), . . .} and D = {(x 1 , y 1 ), (x 1 , y 1 ), (x 2 , y 2 ), (x 2 , y 2 ), . . .} where x 1 = x 1 , y 1 = y 1 , x 2 = x 2 , y 2 = y 2 , . . .. In §2.2, we introduce data augmentation techniques (in both input space and feature space) applied on D . Each positive pair (x i , x i ) yields two different points/vectors in the encoder's representation space (see again Fig. 1), and the distance between these points should be minimised.

Data Augmentation
We hypothesise that applying certain 'corruption' techniques to (i) parts of input text sequences or (ii) to their representations, or even (iii) doing both in combination, does not change their (captured) meaning. We present two 'corruption' techniques as illustrated in Fig. 1. First, we can directly mask parts of the input text. Second, we can erase (i.e., dropout) parts of their feature maps. Both techniques are rather simple and intuitive: (i) even when masking parts of an input sentence, humans can usually reconstruct its semantics; (ii) dropping a small subset of neurons or representation dimen-  sions, the representations of a neural network will not drift too much.
Input Augmentation: Random Span Masking. The idea is inspired by random cropping in visual representation learning (Hendrycks et al., 2020). In particular, starting from the mirrored pairs (x i , y i ) and (x i , y i ), we randomly replace a consecutive string of length k with [MASK] in either x i or x i . The example (Fig. 2) illustrates the random span masking procedure with k = 5.
Feature Augmentation: Dropout. The random span masking technique, operating directly on text input, can be applied only with sentence/phraselevel input; word-level tasks involve only short strings, usually represented as a single token under the sentence-piece tokeniser. However, data augmentation in the feature space based on dropout, as introduced below, can be applied to any input text.
Dropout (Srivastava et al., 2014) randomly drops neurons from a neural net during training with a probability p. In practice, it results in the erasure of each element with a probability of p. It has mostly been interpreted as implicitly bagging a large number of neural networks which share parameters at test time (Bouthillier et al., 2015). Here, we take advantage of the dropout layers in BERT/RoBERTa to create augmented views of the input text. Given a pair of identical strings x i and x i , their representations in the embedding space slightly differ due to the existence of multiple dropout layers in the BERT/RoBERTa architecture (Fig. 6). The two data points in the embedding space can be seen as two augmented views of the same text sequence, which can be leveraged for fine-tuning. 1 It is possible to combine data augmentation via random span masking and featuure augmentation via dropout; this variant is also evaluated later.

Contrastive Learning
Let f (·) denote the encoder model. The encoder is then fine-tuned on the data constructed in §2.2. 1 The dropout augmentations are naturally a part of the BERT/RoBERTa network. That is, no further actions need to be taken to implement them. Note that random span masking is applied on only one side of the positive pair while dropout is applied on all data points.
Figure 3: As the same vector goes through the same dropout layer separately, the outcomes are independent. Consequently, two fully identical strings fed to the single BERT/RoBERTa model yield different representations in the MLM embedding space.
Given a batch of data D b , we leverage the standard InfoNCE loss (Oord et al., 2018) to cluster/attract the positive pairs together and push away the negative pairs in the embedding space: τ denotes a temperature parameter; N i denotes all negatives of x i , which includes all x j , x j where i = j in the current data batch (i.e., Intuitively, the numerator is the similarity of the self-duplicated pair (the positive example) and the denominator is the sum of the similarities between x i and all other strings besides x i (the negatives). 2  (Limsopatham and Collier, 2016) and COMETA (stratified-general split, Basaldella et al. 2020) as our evaluation datasets. The first three datasets are in the scientific domain (i.e., the data have been extracted from scientific papers), while the latter two are in the social media domain (i.e., extracted from online forums discussing health-related topics). We report Spearman's rank correlation coefficients (ρ) for word similarity; accuracy @1/@5 is the standard evaluation measure in the BEL task.
Mirror-BERT: Training Resources. For finetuning (general-domain) lexical representations, we use the top 10k most frequent words in each language. For biomedical name representations, we randomly sample 10k names from the UMLS. In sentence-level tasks, for STS, we sample 10k sentences (without labels) from the training set of the STS Benchmark; for Spanish, Arabic and Russian, we sample 10k sentences from the Wiki-Matrix dataset (Schwenk et al., 2021). For QNLI, we sample 10k sentences from its training set.
Training Setup and Details. The hyperparameters of word-level models are tuned on SimLex-999 (Hill et al., 2015); biomedical models are tuned on COMETA (zero-shot-general split). Sentencelevel models are tuned on the dev set of STS-b. τ in Eq. (1) is 0.04 (biomedical and sentence-level models); 0.2 (word-level). Dropout rate p is 0.  rate of k = 5, while k = 2 for biomedical phraselevel models; we do not employ span masking for word-level models (an analysis is in the Appendix). All lexical models are trained for 2 epochs, max token length is 25. Sentence-level models are trained for 1 epoch with a max sequence length of 50. All models use AdamW (Loshchilov and Hutter, 2019) as the optimiser, with a learning rate of 2e-5, batch size of 200 (400 after duplication). In all tasks, for all 'Mirror-tuned' models, unless noted otherwise, we create final representations using [CLS], instead of another common option: mean-pooling (mp) over all token representations in the last layer (Reimers and Gurevych, 2019). 5 6 4 Results and Discussion

Lexical-Level Tasks
Word Similarity (Tab. 1). SotA static word embeddings such as fastText (Mikolov et al., 2018) typically outperform off-the-shelf MLMs on word similarity datasets (Vulić et al., 2020a). However, our results demonstrate that the Mirror-BERT procedure indeed converts the MLMs into much stronger word encoders. The Multi-SimLex results on 8 languages from Tab. 1 suggest that the fine-   tuned +Mirror variant substantially improves the performance of base MLMs (both monolingual and multilingual ones), even beating fastText in 5 out of the 8 evaluation languages. 7 We also observe that it is essential to have a strong base MLM. While Mirror-BERT does offer substantial performance gains with all base MLMs, the improvement is more pronounced when the base model is strong (e.g., EN, ZH).
Biomedical Entity Linking (Tab. 2). The goal of BEL is to map a biomedical name mention to a controlled vocabulary (usually a node in a knowledge graph). Considered a downstream application in BioNLP, the BEL task also helps evaluate and compare the quality of biomedical name representations: it requires pairwise comparisons between the biomedical mention and all surface strings stored in the biomedical knowledge graph.
The results from Tab. 2 suggest that our +Mirror transformation achieves very strong gains on top of the base MLMs, both BERT and PubMedBERT (Gu et al., 2020). We note that PubMedBERT is a current SotA MLM in the biomedical domain, and performs significantly better than BERT, both before and after +Mirror fine-tuning. This highlights the necessity of starting from a domain-specific model when possible. On scientific datasets, the self-supervised PubMedBERT+Mirror model is 7 Language codes: see the Appendix for a full listing. very close to SapBERT, which fine-tunes PubMed-BERT with more than 10 million synonyms extracted from the external UMLS knowledge base. However, in the social media domain, PubMed-BERT+Mirror still cannot match the performance of knowledge-guided SapBERT. This in fact reflects the nature and complexity of the task domain. For the three datasets in the scientific domain (NCBI, BC5-d, BC5-c), strings with similar surface forms tend to be associated with the same concept. On the other hand, in the social media domain, semantics of very different surface strings might be the same. 8 This also suggests that Mirror-BERT adapts PubMedBERT to a very good surface-form encoder for biomedical names, but dealing with more difficult synonymy relations (e.g. as found in the social media) does need external knowledge. 9   Question-Answer Entailment (Fig. 4). The results indicate that our +Mirror fine-tuning consistently improves the underlying MLMs. The RoBERTa+Mirror variant even shows a slight edge over the supervised SBERT model (.709 vs. .706).

Cross-Lingual Tasks
We observe huge gains across all language pairs in CLWS (Tab. 5) and BLI (Tab. 6) after running the Mirror-BERT procedure. For language pairs that involve Hebrew, the improvement is usually smaller. We suspect that this is due to mBERT itself containing poor semantic knowledge for Hebrew. This finding aligns with our prior argument that a strong base MLM is still required to obtain prominent gains from running Mirror-BERT.

Further Discussion and Analyses
Running Time. The Mirror-BERT procedure is extremely time-efficient. While fine-tuning on NLI (SBERT) or UMLS (SapBERT) data can take hours, Mirror-BERT with 10k positive pairs completes the conversion from MLMs to universal language encoders within a minute on two NVIDIA RTX 2080Ti GPUs. On average, 10-20 seconds is needed for 1 epoch of the Mirror-BERT procedure. Input Data Size (Fig. 5). In our main experiments in §4.1- §4.3, we always use 10k examples for Mirror-BERT tuning. In order to assess the importance of the fine-tuning data size, we run a relevant analysis for a subset of base MLMs, and on a subset of English tasks. In particular, we evaluate the following: The results indicate that the performance in all tasks reaches its peak in the region of 10k-20k examples and then gradually decreases, with a steeper drop on the the word-level task. 10 11 Random Span Masking + Dropout? (Tab. 7).
We conduct our ablation studies on the English STS tasks. First, we experiment with turning off dropout, random span masking, or both. With both techniques turned off, we observe large performance drops of RoBERTa+Mirror and BERT+Mirror (see also the Appendix). Span masking appears to be the more important factor: its absence causes a larger decrease. However, the best performance is achieved when both dropout and random span masking are leveraged, suggesting a synergistic effect when the two augmentation techniques are used together.  Regularisation or Augmentation? (Tab. 8).
When using dropout, is it possible that we are simply observing the effect of adding/removing regularisation instead of the augmentation benefit? To answer this question, we design a simple probe that attempts to disentangle the effect of regular-12 Drophead rates for BERT and RoBERTa are set to the default values of 0.2 and 0.05, respectively. 13 Besides the drophead-based feature space augmentation, in our side experiments, we also tested input space augmentation techniques such as whole-word masking, random token masking, and word reordering; they typically yield performance similar or worse to random span masking. We also point to very recent work that explores text augmentation in a different context (Wu et al., 2020;. We leave a thorough search of optimal augmentation techniques for future work.  isation versus augmentation; we turn off random span masking but leave the dropout on (so that the regularisation effect remains). However, instead of assigning independent dropouts to every individual string (rendering each string slightly different), we control the dropouts applied to a positive pair to be identical. As a result, it holds f ( , · · · , |D|}. We denote this as "controlled dropout". In Tab. 8, we observe that, during the +Mirror fine-tuning, controlled dropout largely underperforms standard dropout and is even worse than not using dropout at all. As the only difference between controlled and standard dropout is the augmented features for positive pairs in the latter case, this suggests that the gain from +Mirror indeed stems from the data augmentation effect rather than from regularisation. Mirror-BERT Improves Isotropy? (Fig. 7). We argue that the gains with Mirror-BERT largely stem from its reshaping of the embedding space geometry. Isotropy (i.e., uniformity in all orientations) of the embedding space has been a favourable property for semantic similarity tasks (Arora et al., 2016;Mu and Viswanath, 2018). However, Ethayarajh (2019) shows that (off-the-shelf) MLMs' representations are anisotropic: they reside in a narrow cone in the vector space and the average cosine similarity of (random) data points is extremely high. Sentence embeddings induced from MLMs without fine-tuning thus suffer from spatial anistropy (Li et al., 2020;Su et al., 2021). Is Mirror-BERT then improving isotropy of the embedding space? 14 To investigate this claim, we inspect (1) the distributions of cosine similarities and (2) an isotropy score, as defined by Mu and Viswanath (2018).
First, we randomly sample 1,000 sentence pairs from the Quora Question Pairs (QQP) dataset. In 14 Some preliminary evidence from Tab. 7 already leads in this direction: we observe large gains over the base MLMs even without any positive examples, that is, when both span masking and dropout are not used (i.e., it always holds xi = xi and f (xi) = f (xi)). During training, this leads to a constant numerator in Eq. (1). In this case, learning collapses to the scenario where all gradients solely come from the negatives: the model is simply pushing all data points away from each other, resulting in a more isotropic space.  Fig. 7, we plot the distributions of pairwise cosine similarities of BERT representations before (Figs. 7a and 7b) and after the +Mirror tuning (Fig. 7c). The overall cosine similarities (regardless of positive/negative) are greatly reduced and the positives/negatives become easily separable. We also leverage a quantitative isotropy score (IS), proposed in prior work (Arora et al., 2016;Mu and Viswanath, 2018), and defined as follows: where V is the set of vectors, 15 C is the set of all possible unit vectors (i.e., any c so that c = 1) in the embedding space. In practice, C is approximated by the eigenvector set of V V (V is the stacked embeddings of V). The larger the IS value, more isotropic an embedding space is (i.e., a perfectly isotropic space obtains the IS score of 1). IS scores in Tab. 9 confirm that the +Mirror finetuning indeed makes the embedding space more isotropic. Interestingly, with both data augmentation techniques switched off, a naive expectation is that IS will increase as the gradients now solely come from negative examples, pushing apart points in the space. However, we observe the increase of IS only for word-level representations. This hints at more complex dynamics between isotropy and gradients from positive and negative examples, where positives might also contribute to isotropy in some settings. We will examine these dynamics more in future work. 16 Learning New Knowledge or Exposing Available Knowledge? Running Mirror-BERT for more epochs, or with more data (see Fig. 5) does not re-15 V comprises the corresponding text data used for Mirror-BERT fine-tuning (10k items for each task type). 16 Introducing positive examples also naturally yields stronger task performance, as the original semantic space is better preserved. Gao et al. (2021) provide an insightful analysis on the balance of learning uniformity and alignment preservation, based on the method of Wang and Isola (2020  .393 + Mirror (random string, lr 5e-5) .481  sult in performance gains. This hints that, instead of gaining new knowledge from the fine-tuning data, Mirror-BERT in fact 'rewires' existing knowledge in MLMs (Ben-Zaken et al., 2020). To further verify this, we run Mirror-BERT with random 'zero-semantics' words, generated by uniformly sampling English letters and digits, and evaluate on (EN) Multi-SimLex. Surprisingly, even these data can transform off-the-shelf MLMs into effective word encoders: we observe a large improvement over the base MLM in this extreme scenario, from ρ =0.267 to 0.481 (Tab. 10). We did a similar experiment on the sentence-level and observed similar trends. However, we note that using the actual English texts for fine-tuning still performs better as they are more 'in-domain' (with further evidence and discussions in the following paragraph).
Selecting Examples for Fine-Tuning. Using raw text sequences from the end task should be the default option for Mirror-BERT fine-tuning since they are in-distribution by default, as semantic similarity models tend to underperform when faced with a domain shift (Zhang et al., 2020). In the generaldomain STS tasks, we find that using sentences extracted from the STS training set, Wikipedia articles, or NLI datasets all yield similar STS performance after running Mirror-BERT (though optimal hyperparameters differ). However, porting BERT+Mirror trained on STS data to QNLI results in AUC drops from .674 to .665. This suggests that slight or large domain shifts do affect task performance, further corroborated by our findings from fine-tuning with fully random strings (see before).
Further, Fig. 8 shows a clear tendency that more frequent strings are more likely to yield good task performance. There, we split the 100k most frequent words from English Wikipedia into 10 equally sized fine-tuning buckets of 10k examples each, and run +Mirror fine-tuning on BERT with each bucket. In sum, using frequent in-domain examples seems to be the optimal choice.

Related Work
Self-supervised text representations have a large body of literature. Here, due to space constraints, we provide a highly condensed summary of the most related work. Even prior to the emergence of large pretrained LMs (PLMs), most representation models followed the distributional hypothesis (Harris, 1954) and exploited the co-occurrence statistics of words/phrases/sentences in large corpora (Mikolov et al., 2013a,b;Pennington et al., 2014;Kiros et al., 2015;Hill et al., 2016;Logeswaran and Lee, 2018). Recently, DeCLUTR (Giorgi et al., 2021) follows the distributional hypothesis and formulates sentence embedding training as a contrastive learning task where span pairs sampled from the same document are treated as positive pairs. Very recently, there has been a growing interest in using individual raw sentences for selfsupervised contrastive learning on top of PLMs. Wu et al. (2020) explore input augmentation techniques for sentence representation learning with contrastive objectives. However, they use it as an auxiliary loss during full-fledged MLM pretraining from scratch (Rethmeier and Augenstein, 2021). In contrast, our post-hoc approach offers a lightweight and fast self-supervised transformation from any pretrained MLM to a universal language encoder at lexical or sentence level. Carlsson et al. (2021) use two distinct models to produce two views of the same text, where we rely on a single model, that is, we propose to use dropout and random span masking within the same model to produce the two views, and demonstrate their synergistic effect. Our study also explores word-level and phrase-level representations and tasks, and to domain-specialised representations (e.g., for the BEL task).
SimCSE (Gao et al., 2021), a work concurrent to ours, adopts the same contrastive loss as Mirror-BERT, and also indicates the importance of data augmentation through dropout. However, they do not investigate random span masking as data augmentation in the input space, and limit their model to general-domain English sentence representations only, effectively rendering SimCSE a special case of the Mirror-BERT framework. Other concurrent papers explore a similar idea, such as Self-Guided Contrastive Learning (Kim et al., 2021), ConSERT , and BSL , inter alia. They all create two views of the same sentence for contrastive learning, with different strategies in feature extraction, data augmentation, model updating or choice of loss function. However, they offer less complete empirical findings compared to our work: we additionally evaluate on (1) lexical-level tasks, (2) tasks in a specialised biomedical domain and (3) cross-lingual tasks.

Conclusion
We proposed Mirror-BERT, a simple, fast, selfsupervised, and highly effective approach that transforms large pretrained masked language models (MLMs) into universal lexical and sentence encoders within a minute, and without any external supervision. Mirror-BERT, based on simple unsupervised data augmentation techniques, demonstrates surprisingly strong performance in (wordlevel and sentence-level) semantic similarity tasks, as well as on biomedical entity linking. The large gains over base MLMs are observed for different languages with different scripts, and across diverse domains. Moreover, we dissected and analysed the main causes behind Mirror-BERT's efficacy.

B Additional Training Details
Most Frequent 10k/100k Words by Language. The most frequent 10k words in each language were selected based on the following list: https://github.com/oprogramador/ most-common-words-by-language. The most frequent 100k English words in Wikipedia can be found here: https://gist.github.com/h3xx/ 1976236.
[CLS] or Mean-Pooling? For MLMs, the consensus in the community, also validated by our own experiments, is that mean-pooling performs better than using [CLS] as the final output representation. However, for Mirror-BERT models, we found [CLS] (before pooling) generally performs better than mean-pooling. The exception is BERT on sentence-level tasks, where we found mean-pooling performs better than [CLS]. In sum, sentence-level BERT+Mirror models are finetuned and tested with mean-pooling while all other Mirror-BERT models are fine-tuned and tested with [CLS]. We also tried representations after the pooling layer, but found no improvement. Training Stability. All task results are reported as averages over three runs with different random seeds (if applicable). In general, fine-tuning is very stable and the fluctuations with different random seeds are very small. For instance, on the sentencelevel task STS, the standard deviation is < 0.002. On word-level, standard deviation is a bit higher, but is generally < 0.005. Note that the randomly dropout rate→ 0.05 0.    Table 14: Full table for MVN and IS of word-, phrase-, and sentence-level models. Higher is better, that is, more isotropic with IS, while the opposite holds for MVN (lower scores mean more isotropic representation spaces).

E Mean-Vector l 2 -Norm (MVN)
To supplement the quantitative evidence already suggested by the Isotropy Score (IS) in the main paper, we additionally compute the mean-vector l 2norm (MVN) of embeddings. In the word embedding literature, mean-centering has been a widely studied post-processing technique for inducing better semantic representations. Mu and Viswanath (2018) point out that mean-centering is essentially increasing spatial isotropy by shifting the centre of the space to the region where actual data points reside in. Given a set of representation vectors V, we define MVN as follows: (3) The lower MVN is, the more mean-centered an embedding is. As shown in Tab. 14, MVN aligns with the trends observed with IS. This further confirms our intuition that +Mirror tuning makes the space more isotropic and shifts the centre of space close to the centre of data points. Very recently, Cai et al. (2021) defined more metrics to measure spatial isotropy. Rajaee and Pilehvar (2021) also used Eq. (2) for analysing sentence embedding's isotropiness.

F Evaluation Dataset Details
All datasets used and links to download them can be found in the code repository provided. The Russian STS dataset is provided by https://github.com/deepmipt/ deepPavlovEval.

G Pretrained Encoders
A complete listing of URLs for all used pretrained encoders is provided in Tab. 15. For monolingual MLMs of each language, we made the best effort to select the most popular one (based on download counts). For computational tractability of the large number of experiments conducted, all models are BASE models (instead of LARGE).

H Full Tables
Here, we provide the complete sets of results. In these tables we include both MLMs w/ features extracted using both mean-pooling ("mp") and [CLS] ("CLS").
For full multilingual word similarity results, view Tab

I Number of Model Parameters
All BERT/RoBERTa models in this paper have ≈110M parameters.

J Hyperparameter Optimisation
Tab. 21 lists the hyperparameter search space. Note that the chosen hyperparameters yield the overall best performance, but might be suboptimal on any single setting (e.g. different base model).

K Software and Hardware Dependencies
All our experiments are implemented using Py-Torch 1.7.0 and huggingface.co transformers 4.4.2, with Automatic Mixed Precision (AMP) 17 turned on during training. Please refer to the GitHub repo for details. The hardware we use is listed in Tab. 22.       Table 21: Hyperparameters along with their search grid. * marks the values used to obtain the reported results. The hparams are not always optimal in every setting but generally performs (close to) the best.
hardware specification RAM 128 GB CPU AMD Ryzen 9 3900x 12-core processor × 24 GPU NVIDIA GeForce RTX 2080 Ti (11 GB) × 2 Table 22: Hardware specifications of the used machine. When encountering out-of-memoery error, we also used a second server with two NVIDIA GeForce RTX 3090 (24 GB).