Discovering Low-rank Subspaces for Language-agnostic Multilingual Representations

Large pretrained multilingual language models (ML-LMs) have shown remarkable capabilities of zero-shot cross-lingual transfer, without direct cross-lingual supervision. While these results are promising, follow-up works found that, within the multilingual embedding spaces, there exists strong language identity information which hinders the expression of linguistic factors shared across languages. For semantic tasks like cross-lingual sentence retrieval, it is desired to remove such language identity signals to fully leverage semantic information. In this work, we provide a novel view of projecting away language-specific factors from a multilingual embedding space. Specifically, we discover that there exists a low-rank subspace that primarily encodes information irrelevant to semantics (e.g., syntactic information). To identify this subspace, we present a simple but effective unsupervised method based on singular value decomposition with multiple monolingual corpora as input. Once the subspace is found, we can directly project the original embeddings into the null space to boost language agnosticism without finetuning. We systematically evaluate our method on various tasks including the challenging language-agnostic QA retrieval task. Empirical results show that applying our method consistently leads to improvements over commonly used ML-LMs.

While these ML-LMs offer practical solutions for cross-lingual tasks, there is an enduring debate about why the ML-LMs work.From a positive perspective, Pires et al. (2019) conduct an exploratory study on mBERT (Devlin et al., 2019), suggesting that cross-lingual transfer is possible even to languages in different scripts.Chi et al. (2020) probe mBERT for structural phenomena and find that its representations can recover syntactic tree distances in languages other than English.These findings present shreds of evidence that the pretrained multilingual representations do capture cross-lingual properties in various aspects.On the flip side, a line of research shows that pretrained ML-LMs encode strong language-specific signals.This causes their multilingual representations to cluster by language identities instead of semantic meaning (Wu and Dredze, 2019;Roy et al., 2020;Libovický et al., 2020).The property largely hinders the expression of linguistic signals shared across languages.For applications like cross-lingual sentence retrieval that mainly consider semantic information, ML-LMs with strong language-specific signals tend to retrieve answers from specific languages, regardless of their semantic meaning (Roy et al., 2020).
Motivated by previous findings about language identity information, we aim to locate languagespecific factors captured by the pretrained ML-LMs for recovering a language-agnostic embedding space.Inspired by advances in domain generalization (Muandet et al., 2013;Motiian et al., 2017;Piratla et al., 2020), we explore a simple but effective approach, LSAR, to discover a Low-rank Subspace for language-Agnostic Representations within an ML-LM.The subspace primarily encodes information irrelevant to semantics, and can be identified without any translation pairs based on singular value decomposition.Once the subspace is found, we can directly factor out language-specific factors from the multilingual embeddings by projecting them into the null space without finetuning.
To evaluate LSAR, we focus on semantic tasks for multilingual sentence embeddings.On standard cross-lingual zero-shot transfer tasks including classification and sentence retrieval, LSAR consistently achieves significant improvements.Especially, applying LSAR leads to significant improvements for pretrained ML-LMs on LAReQA (Roy et al., 2020), a challenging benchmark targeting strong language agnosticism.
We further examine what information exactly the subspace contains.By performing correlation analysis between structural language similarities obtained from the URIEL database (Littell et al., 2017) and the language similarities captured on the subspace, we observe that the subspace encodes a great deal of syntactic information.This implies that LSAR successfully erases linguistic signals that are redundant to semantic tasks to facilitate language agnosticism.
To conclude, our main contributions are: • We present one of the pioneering efforts to discover that there exist low-rank subspaces of pretrained ML-LMs' embeddings that mainly encode language-specific signals.
• To identify the subspace in a ML-LM, we present a simple unsupervised approach called LSAR based on singular value decomposition.By projecting embeddings onto the null space, LSAR can exclude the unwanted factors to facilitate language agnosticism.
• Empirical results show that LSAR is surprisingly effective for a variety of semantic tasks.We also elucidate that the subspace encodes strong syntactic signals with careful experimental analysis.

Related Work
Understanding Pretrained Multilingual Representations Recently, there has been a surge of interest in probing pretrained ML-LMs like mBERT (Devlin et al., 2019).Pires et al. (2019) present an exploratory study on the cross-linguality of mBERT, showing that mBERT exhibits strong zero-shot performances for typologically similar languages.Libovický et al. (2020) find that the original mBERT embeddings can be decomposed into a language-specific component and a languageneutral component.Chi et al. (2020) probe mBERT for universal grammatical relations and show that mBERT does encode fine-grained syntactic distinctions across languages.Muller et al. (2021) find that mBERT operates as the stacking of two subnetworks and mainly the lower part of the model is crucial for cross-lingual transfer.
Language-agnostic Representations To further facilitate semantic downstream tasks like text classification, retrieval, and question answering, it is appealing to remove language-specific signals from the original embeddings without destroying the intrinsic semantic meaning.LASER (Artetxe and Schwenk, 2019) utilizes parallel data to train a BiLSTM-based multilingual sentence encoder.Zhao et al. (2021) obtain language-agnostic embeddings from mBERT and XLM-R by explicitly aligning the word pairs and further normalizing the latent spaces with zero mean and unit variance.Yang et al. (2021) regard the top principal components from each language's embedding space as the primary source of language bias and propose to project them away to boost language agnosticism.
Our work bears resemblance to Yang et al. (2021), but with clear distinctions in that: 1) we model language-specific signals jointly in the multilingual embedding space instead of locating it separately within each language; 2) we further verify what exactly the linguistic signals are identified, and present evidences that LSAR primarily removes syntactic information.A few previous works (Gonen et al., 2020;Liang et al., 2021;Chang et al., 2022) also attempt to locate languageagnostic embeddings in subspaces of ML-LMs.Apart from the dissimilarity of methodology, we focus on sentence-level instead of token-level tasks and provide shreds of evidence that the identified subspace exhibits strong correlation with syntactic information.

Low-rank Subspaces in Other Applications
Low-rank subspaces have been employed in various applications.In face recognition, the most expressive features for face representations are located via subspace analysis methods like PCA (Turk and Pentland, 1991; Wang and Tang, ≈ Figure 1: Conceptual illustration of our alignment method LSAR.There exists strong language identity information from the original pretrained multilingual representations.By projecting away language-specific components that reside in a low-rank subspace discovered in identification process (in top-right), we can produce a language-agnostic embedding space via language agnosticism rectification (in bottom).The probing procedure (colored in blue-grey) and the inference procedure (colored in yellow) can be done separately.

2004
).For domain adaptation and domain generalization, a typical idea is to uncover a shared subspace on which the distribution mismatch between domains is reduced (Muandet et al., 2013;Pan et al., 2011;Motiian et al., 2017).Recent advances in probing Generative Adversarial Networks (GANs) also observe meaningful latent subspaces that enable precise control of GAN generation (Wang and Ponce, 2021;Zhu et al., 2021).These findings to some extent motivate this paper.

Methodology
In this section, we first introduce our method to identify the low-rank language-specific subspace in an unsupervised manner.Once the subspace is found, we can then suppress the language identity from the original multilingual embeddings to achieve language agnosticism rectification by projecting them to the null space.This posttraining alignment procedure can largely benefit downstream tasks like cross-lingual retrieval which solely utilize semantic-related information.

Multilingual Embedding Decomposition
To locate the language-specific factors, we follow previous works (Pires et al., 2019;Libovický et al., 2020;Yang et al., 2021) to hypothesize that each multilingual embedding e l ∈ R d in language l can be decomposed in an additive form: where s l ∈ R d and a l ∈ R d represent the languagespecific component to remove and the languageagnostic component to keep, respectively.
Built on the above assumption, previous unsupervised approaches extract the language identity information separately for each language space.Given an ML-LM (e.g., mBERT), the extracted embeddings E l := {e i l } n i=1 from n samples of task training data or external monolingual corpora contain mixed linguistic information of semanticrelevant and semantic-irrelevant signals about language l.PCA(E l ) ∈ R d×k to encode language identity signals, and propose to factor them out with s l = C l C ⊤ l e l to facilitate language agnosticism.In spite of their promising results for semanticrelated tasks, these methods fall short of comprehensively discovering cross-lingual relationship in the latent space.For each language l, both of them leverage solely E l to locate language-specific information, which fails to distinguish itself from semantic signals as other languages' characteristics is unknown.Without careful tuning, this can lead to unexpected semantic information loss (Khodak et al., 2018).Besides, it is also unclear what exactly language-specific signals are captured by these approaches.

Low-rank Subspace Identification
To alleviate the above issues, we attempt to globally capture language-specific information from the multilingual latent space.Inspired by previous works in domain adaptation and domain generalization (Muandet et al., 2013;Motiian et al., 2017;Piratla et al., 2020), we present a simple approach that identifies a low-rank subspace of the original multilingual latent space, M s ∈ R d×r , spanned by r components.Intuitively, the subspace encodes language-specific signals via measuring the latent discrepancy among languages.
To be specific, we first extract the mean embedding µ l =1 n n i=1 e i l of each language l in the same spirit of previous approaches.Concatenating µ l of L languages column-by-column results in the mean embedding matrix M ∈ R d×L .As discussed in Section 3.1, the mean embeddings can unexpectedly mix the desired language-specific signals with semantic information.To avoid removing the semantic information shared among languages, we decompose M into two components: 1) a vector µ representing what is commonly shared across languages in the latent space; 2) a matrix M s specifying a low-rank subspace on which different languages express different linguistic signals.With the orthogonality constraint, our objective is: where Γ ∈ R L×r is the coordinates of languagespecific signals along the subspace's r components and 1 ∈ R d contains all ones.
The optimal solution of Equation 1 can be computed efficiently via Singular Value Decomposition (SVD), as proved in Appendix A. Algorithm 1 presents the detailed procedure.The only hyperparameter r < L controls the amount of languagespecific information captured by the identified subspace.The larger r is, the more language-specific signals we can identify.

Language Agnosticism Rectification
Once we find the low-rank subspace with semantically irrelevant information encoded, we can im-Algorithm 1: language-specific Subspace Identification In: languages' mean embeddings M , rank of subspace r Out: language-agnostic component µ, language-specific subspace prove language agnosticism via projecting multilingual embeddings onto the null space of M s : Given that usually l ≪ d, the information removed is restricted to aspects that emerges to be language-specific and will not lead to dimensional collapse.

Experiments
We systematically evaluate our method on various tasks followed by further analyses 1 , with the purposes of understanding: 1) whether the proposed approach can benefit downstream tasks; 2) what exactly the identified low-rank subspace captures.
To begin with, we describe our evaluation protocol for the alignment methods, which largely follows Yang et al. (2021) but with a broader scope to include more base models as listed in Section B. Given one of the pretrained ML-LMs, we first randomly collect 10,000 sentences for each language from the OSCAR corpus (Ortiz Suárez et al., 2020) covering all the evaluated languages and their web crawl texts 2 .The sentence embeddings extracted by the pretrained model are then used for finding mBERT XLM XLM-R LABSE Cross-lingual zero-shot transfer (w/o finetuning) Original 37.53+00.00%28.13+00.00%57.68+00.00%95.47+00.00%Centered (Libovický et al., 2020)  We use "-" to indicate results that are not reported in the references and use "+%" to report relative improvements.
the low-rank subspace described in Equation 1. Unless otherwise indicated, we consistently report LSAR with r = l − 1, where l is the number of the evaluated languages.We evaluate language agnosticism over pretrained ML-LMs that are commonly used, as described in Appendix B. Detailed results are listed in Appendix C.3.

Baselines
Apart from Original that keeps the pretrained ML-LM intact, we compare LSAR with the following baselines.The baselines share the same setting as ours in that both of them require no parallel text and aim at removing language-specific factors in a post-training manner.
Centered Libovický et al. (2020) extract language-neutral embeddings from the original pretrained multilingual sentence encoders via subtracting the mean embedding for each language.
The mean embeddings are calculated from the multi-monolingual OSCAR corpus.
LIR Yang et al. (2021) propose to project away the top-k principal components of each language's embeddings to facilitate language agnosticism, where k is the hyperparameter.Again, the top principal components are extracted from the multimonolingual corpus.

Sentence Retrieval
Tatoeba (Artetxe and Schwenk, 2019) is a commonly used dataset for evaluating ML-LMs.It comprises up to 1,000 sentences for each language along with their English translations.We follow the evaluation procedure of XTREME (Hu et al., 2020) that covers 36 languages.For each language pair, we go through each sentence in the source language and find the closest sentence in the target language using cosine similarity.
The top-1 retrieval accuracy results are shown in Table 1.For mBERT (Devlin et al., 2019), XLM (Conneau and Lample, 2019), and XLM-R (Conneau et al., 2020a), applying LSAR brings significant performance gains of up to 19% relative improvement.Compared with Centered and LIR which separately remove information for each language, LSAR jointly utilizes the encoded information from all the languages to better locate language-specific factors.Furthermore, we observe that LSAR consistently achieves the best results with hyperparameter r (the rank of the low-rank subspace) equal to the number of the evaluated languages, as shown in Appendix C.1.As the languages are diversely distributed, it is reasonable that each language possesses its own linguistic characteristics, resulting in a larger language-specific subspace to factor out.In contrast, we find that LIR is vulnerable to its hyperparameter k (the number of the removed principal components), which is best shown in Figure 7.
For LABSE (Feng et al., 2022), all the methods fail to provide marked enhancement.This can be mainly attributed to the fact that LABSE already uses parallel corpora to explicitly align multilingual embeddings.Despite that the improvement is marginal, it is still promising to combine LSAR with existing pretraining objectives to produce better language-agnostic embeddings.
We also include several representative baselines that finetune either mBERT or XLM-R for bet- XQuAD-R MLQA-R En-En X-X En  ter cross-lingual transfer results.Although these methods are not directly comparable to ours, we believe it provides additional valuable findings to include them.Full-Model-FS and S 4 -Tuning finetune XLM-R on full English labeled examples and then K-shot data over target languages (K = 64/128).
For Full-Model, the pretrained models are finetuned on the English SQuAD data.On mBERT, LSAR outperforms Full-Model by a large margin.
We also observe on XLM-R that LSAR is competitive with finetuning the full model on 128-shot data as well as finetuning a dedicated language subnetwork (S 4 -Tuning) on 64-shot data.The results are quite promising given that we obtain better performances with the original encoders intact and no task-relevant training data.

Language-agonstic Answer Retrieval
While Tatoeba reveals the cross-lingual transferability across English-centric language pairs, it is restricted to monolingual pools (i.e., the set of candidates is restricted to certain language).Therefore, it fails to thoroughly evaluate whether texts with a similar semantic meaning are grouped together in the latent space, regardless of their languages.
With that in mind, we further examine the alignment methods on LAReQA (Roy et al., 2020), a challenging cross-lingual answer retrieval task.Unlike Tatoeba, the targets of LAReQA must be retrieved from a large multilingual candidate pool.It consists of two sub-datasets, XQuAD-R and MLQA-R, whose candidate pool covers 11 and 7 languages respectively.
We follow Yang et al. (2021) to evaluate the alignment methods on two models, mBERT (En-En) and mBERT (X-X).Specifically, mBERT (En-En) finetunes the original mBERT model on the English QA pairs collected from the SQuAD v1.1 dataset.mBERT (X-X) employs the same training procedure but with an extended dataset where each sample is translated into the 11 XQuAD languages.Since all positive samples for finetuning are within the same language as the question query, both models exhibit strong self-language bias while preserving the weak alignment property.For evaluation, we use the dot product of embeddings to score a QA pair, which accords with the finetuning protocol.The retrieval performance is measured by mean Average Precision (mAP).
Table 2 reports our LAReQA results.We can observe that applying LSAR again results in signification improvements, nearly doubling mAP of mBERT (X-X) on XQuAD-R.Since in the candidate pool each language has one of the relevant answers, better retrieval performances directly indicate better language agnosticism.Centered and LIR (k = 1) also show impressive performances, suggesting that in weakly aligned multilingual sys- tems, the mean embeddings and principal components do encode language-specific signals.But for LIR, it is shown that removing the first principal component consistently leads to the best performance.This is opposite to what we observe on Tatoeba, where the optimal k is around 15.
To further illustrate the degree of language agnosticism, we project an English question (What theory best explains gravity?) as well as all candidates and the ground-truth answers in English, Thai, and Mandarin via PCA.As plotted in Figure 2, candidates in English are retrieved from mBERT (X-X) with higher priority than those in Thai and Mandarin.Applying LSAR can effectively eliminate strong language identity information residing in the original embedding space and draw closer the question and answers from different languages.LIR with k = 1, however, falls short of rectifying language-specific signals as illustrated by the embedding spectrum in Figure 2b.

Zero-shot Classification
We also include the Amazon Reviews classification task (Prettenhofer and Stein, 2010) to assess zeroshot cross-lingual transfer.The dataset consists of product reviews in English, French, German, and Japanese.Each review is labeled as positive or negative, making it a binary classification task.We use   the same procedure to extract sentence embeddings as in Section 4.2, and normalize them to make regularization hyperparameters more consistent across languages.Appendix C.1 specifies how we select hyperparameters.Following (Yang et al., 2021), the performance is evaluated via training a logistic regression classifier3 on the English training data and then evaluating it on the test sets of all four languages.
From Table 3, we observe that the classifier trained on English data benefits from LSAR for classifying reviews based on semantics as the language-specific factors are effectively erased.Another interesting observation is that unlike sentence retrieval, removing more directions does not result in better performance.This indicates that classification tasks can be more sensitive to semantic information.

Analysis
In this section, we present analysis on a variety of aspects towards what exactly language-specific information LSAR captures.

Language-specific Signals are Rectified
From previous findings, we conjecture that our method achieves impressive cross-lingual performance by effectively removing language identity signals.To quantitatively verify this, we measure the strength of language identity information from the perspective of clustering quality.If the embeddings are clustered by language types, we can generally state that language-specific signals still play a prominent role in the multilingual latent space.We perform K-Means clustering on sentence representations of Tatoeba with the number of clusters equal to the number of languages, and then evaluate the resulting clusters using the Normalized Mutual Information (NMI) metric (Jawahar et al., 2019)  4 .As shown in Table 4, the original pretrained embeddings have relatively high NMI scores, suggesting the existence of strong language identity information.Our method consistently achieves smaller NMI scores.This indicates that the embeddings have a lower tendency to group by language types since LSAR successfully winnows down languagespecific information.
The same conclusion can be drawn from the limit-to-one-target setting of LAReQA (Roy et al., 2020).Specifically, we remove 10 targets from the multilingual pool of XQuAD-R to evaluate on each target separately.We choose the most biased X-X variant as the base model.The heatmaps in Figure 3 show for each question language (row), the retrieval mAP on the pool containing just one target in different answer languages (column).Since X-X has strong self-language bias, Original shows better performance on the diagonal than off-diagonal.After applying LSAR, we observe a significant increase in average off-diagonal performance (23.76% vs. 5.89%), without sacrificing much on-diagonal performance (81.57% vs. 84.57%).This again verifies that applying LSAR effectively removes language-specific information.

Removed Components Form Groups of Language Families
We next examine whether the removed components found by the low-rank subspace are truly language-4 sklearn.metrics.normalized_mutual_info_score().specific.This is demonstrated via plotting the removed components for different languages along top basis vectors of the subspace.For the ease of visualization, we group them by language family.
Figure 4 shows the histograms of removed components along the top two basis vectors extracted from mBERT on 36 languages of Tatoeba, according to Equation 1.We can observe that the removed components disperse in groups of language families along these directions.This implies that the identified subspace do capture language-specific signals and hence removing them along the basis vectors can narrow down latent discrepancy.

The Identified Subspace Primarily Encodes Syntactic Information
Finally, given that the removed components are language-specific, we investigate to what extent the low-rank subspace encodes typological relations among languages.Specifically, we use the URIEL database (Littell et al., 2017) to collect distances between English and other languages set out by experts based on certain typological information (e.g., syntax and phonology).We then compare the typological distances with languages similarities obtained from the removed language-specific embeddings s L as well as the resulting languageagnostic embeddings a L by calculating the cosine similarity between languages' mean embeddings.Among all types of typological signals listed in URIEL, we find that the removed languagespecific factors are mostly correlated with syntactic information.Table 5 shows the Pearson correlations on English and other 36 languages from Tatoeba.The removed language-specific component s L is highly correlated with syntactic infor-mBERT XLM XLM-R LABSE s L 0.6910 0.6378 0.7526 0.6894 a L -0.2711 0.2239 0.1338 -0.2362Table 5: Pearson correlations between syntactic language similarities obtained from the URIEL database, and the language similarities obtained from languagespecific s L as well as language-agnostic a L .mation, whereas the correlation is much smaller in the language-agnostic embedding space with s L removed.This finding is in line with previous works (Chi et al., 2020;Zhao et al., 2021) that observe the pretrained multilingual models encode rich syntactic information.
We find no prominent correlation between the removed components along certain basis vectors of the subspace and typological information.As we do not presuppose any correspondence between basis vectors and linguistic signals, a specific basis vector falls short of individually encoding languagespecific information.

Conclusion
We present a simple yet effective approach called LSAR to boost language agnosticism for pretrained multilingual encoders.LSAR identifies a low-rank subspace residing in a pretrained model that primarily encodes language-specific signals in an unsupervised manner via singular value decomposition.Once the subspace is discovered, it can be used to efficiently project away the language identity information.Empirical results demonstrate the great effectiveness of LSAR on semantic tasks and shed light on its ability to locate syntactic relations between languages.

Limitations
Our method LSAR is designed and evaluated for semantic tasks.For future work, we are interested in continuing our study for locating more fine-grained linguistic information, which can potentially boost a larger variety of downstream tasks.While the simplicity of the proposed LSAR is appealing, it also opens up directions for future work by generalizing the first-moment mean embeddings to highermoment statistics and combining with pretraining objectives in more sophisticated ways.
Theorem 1.For any matrix M ∈ R d×L , Algorithm 1 returns µ ∈ R d , M s ∈ R d×r , Γ ∈ R L×r that minimize Equation 1 where µ ⊥ Span (M s ).
Proof.Algorithm 1 first obtains the best approximation of M with rank r + 1 and 1 in its row space (Line 1-3).The orthogonal constraint µ ⊥ Span (M s ) is then forced without obeying the lowrank property (Line 4-5).
To begin with, note that the optimization problem in Equation 1 is equivalent to the following: r the top-r component of U ΣV ⊤ , by σ i (A) the i-th largest singular value of A and by A i the best rank-i approximation of A.
The first step is to show that µ ′ 1 ⊤ + U r Σ r V ⊤ r minimizes the objective in Equation 2. Following the proof of Eckart-Young-Mirsky theorem for lowrank approximation (Schmidt, 1907;Eckart and Young, 1936), let M := M − M with any feasible M fixed.We have where the minimum is taken over all M with rank M = i + r and 1 ∈ Span M ⊤ .By taking Next, we find µ and M s that meet the orthogonality constraint while preserving the low-rank structure.Suppose

B Base Models
We evaluate the alignment methods based on a number of established pretrained multilingual models.We mainly build on the Transformers library (Wolf et al., 2020) for our experiments.
mBERT5 Multilingual BERT (Devlin et al., 2019) is a transformer model (Vaswani et al., 2017) pretrained on Wikipedia, with the objective of Masked Language Modeling (MLM) and a shared vocabulary across all languages.
XLM6 XLM (Conneau and Lample, 2019) also uses the MLM objective and the monolingual Wikipedia corpus for pretraining, with a larger model and a larger vocabulary.
LABSE8 LABSE (Feng et al., 2022) is the stateof-the-art multilingual sentence encoder that leverages bilingual sentence pairs for pretraining.Following previous works (Jawahar et al., 2019;Ruder et al., 2021) that observe certain intermediate layers of Transformer consistently outperform the last layer for cross-lingual tasks, we use the 8th layer for mBERT and XLM, and the 12th layer for XLM-R.We apply mean-pooling to obtain sentence embeddings as it is widely used (Conneau et al., 2020b;Muller et al., 2021).For LABSE as well as mBERT (X-X) and mBERT (En-En) used in LAReQA, we evaluate the alignment methods on the original sentence embeddings.

C Supplementary Results
In this section, we provide supplementary experimental results.

C.1 Hyperparameter Selection
For the considered baselines, we do not conduct sophisticated hyperparameter search given that it is non-trivial for LIR.To provide fair comparison, for LIR and LSAR that both have one single hyperparameter (the number of top principal components k and the number of basis vectors to span the lowrank subspace r), we exhaustively enumerate all  values within a scope and report the best performances on the test data.Figure 7 shows the trend of accuracy on Tatoeba as the hyparameters change.

C.2 Wiki40-B Results
In this section we list the results of LAReQA (Table 6) and Amazon Reviews (Table 8-11) with Wiki-40B (Guo et al., 2020) 9 as the text resource.
For Amazon Reviews, we also report the performances obtained in the last layers to reproduce those in Yang et al. (2021).
For Amazon Reviews, we determine the L2 regularization strength using a hyperparameter sweep on the 5-fold cross-validation routine, over the range between 1e-4 and 1e4 with 10 logarithmically spaced steps.This training procedure is implemented using the Scikit-Learn library (Buitinck et al., 2013

C.3 OSCAR Results
The detailed results with OSCAR is provided in this section.
Tatoeba We report the results for all languages on Tatoeba in Table 17-20.Additionally, the complete set of results for clustering performance is shown in Table 12.
LAReQA We report the detailed results on LAReQA in Table 7.We omit listing all languages due to limited space.
Amazon Reviews We provide the results for all languages on Amazon Reviews in  Figure 7: accuracy on Tatoeba (averaged over all 36 languages) with different hyperparameters (k for LIR and r for LSAR).We observe that removing more principal components within each language for LIR does not result in better performances and can instead lead to information loss.For mBERT and XLM, the best k is found 17, whereas it is 14 for XLM-R.LSAR, however, consistently achieves the best results with r = 36 as larger subspaces encode more language-specific signals.Table 20: Retrieval accuracy (%) on Tatoeba for each language (LABSE), using OSCAR as the text resource.
s l .Yang et al. (2021) use the top-k principal components C l = Figure 2: 2D PCA visualization on LAReQA.We display the embeddings collected from mBERT (X-X) on the XQuAD-R sub-dataset.Embeddings of the candidate answers (C) in English, Thai, and Mandarin are shown in small scatters.Embeddings of the question (Q) in English and the ground-truth answers (A) in English, Thai, and Mandarin are shown in large scatters.Higher opacity indicates higher predicted ranking (color bars: / / ).

Figure 3 :
Figure3: Answer retrieval mAP on XQuAD-R broken down by question language (row) and answer language (column), with model mBERT (X-X).Only one correct answer is included in the multilingual candidate pool.

Figure 4 :
Figure 4: Removed components along the top two basis vectors of the identified low-rank subspace on mBERT.

Figure 5 :
Figure 5: Language similarity obtained from syntactic signals vs. language similarity measured by languagespecific s L of mBERT.Each point is a language.

Figure 6 :
Figure 6: Retrieval accuracy on Tatoeba (averaged over all 36 languages) at different layers.

Table 3 :
Classification accuracy (%) on Amazon Reviews (averaged over English, French, German and Japanese).We exclude Centered as the embeddings are already normalized and hence Centered produces the same results as Original.The results of LABSE are placed in Appendix C due to limited space.

Table 4 :
Clustering performance (NMI) of embeddings obtained by mBERT on Tatoeba.The results of LABSE are placed in Appendix C due to limited space.

Table 6 :
Answer retrieval mAP (%) on XQuAD-R and MLQA-R of LAReQA (averaged over all languages), using Wiki-40B as the text resource.

Table 7 :
). Answer retrieval mAP (%) on XQuAD-R and MLQA-R of LAReQA (averaged over all languages), using OSCAR as the text resource.

Table 13 :
Classification accuracy (%) on Amazon Reviews (mBERT), using OSCAR as the text resource.

Table 14 :
Classification accuracy (%) on Amazon Reviews (XLM), using OSCAR as the text resource.

Table 15 :
Classification accuracy (%) on Amazon Reviews (XLM-R), using OSCAR as the text resource.

Table 17 :
Retrieval accuracy (%) on Tatoeba for each language (mBERT), using OSCAR as the text resource.