A Joint Matrix Factorization Analysis of Multilingual Representations

We present an analysis tool based on joint matrix factorization for comparing latent representations of multilingual and monolingual models. An alternative to probing, this tool allows us to analyze multiple sets of representations in a joint manner. Using this tool, we study to what extent and how morphosyntactic features are reflected in the representations learned by multilingual pre-trained models. We conduct a large-scale empirical study of over 33 languages and 17 morphosyntactic categories. Our findings demonstrate variations in the encoding of morphosyntactic information across upper and lower layers, with category-specific differences influenced by language properties. Hierarchical clustering of the factorization outputs yields a tree structure that is related to phylogenetic trees manually crafted by linguists. Moreover, we find the factorization outputs exhibit strong associations with performance observed across different cross-lingual tasks. We release our code to facilitate future research.


Introduction
Pre-trained multilingual models (Conneau and Lample, 2019a;Conneau et al., 2020;Liu et al., 2020;Xue et al., 2021) have gained widespread adoption in recent years.They initially pre-trained in many languages and subsequently fine-tuned for specific downstream tasks.Their aim is to leverage the linguistic knowledge acquired from similar languages, thereby benefiting low-resource languages and enabling zero-shot cross-lingual transfer ability.While numerous prior works have demonstrated these models have such abilities (Gerz et al., 2018;Ziser and Reichart, 2018;Aharoni et al., 2019;K et al., 2020;Muller et al., 2021;Fujinuma et al., 2022;Qiu et al., 2023), there are still open questions about the nature of the linguistic knowledge these models possess and the extent to which they 1 https://github.com/zsquaredz/joint_multilingual_analysis/ acquire and incorporate linguistic information in their multilingual representations.Previous work has used singular vector canonical correlation analysis (SVCCA; Raghu et al. 2017) and other similarity statistics like centered kernel alignment (CKA; Kornblith et al., 2019) to analyze multilingual representations (Singh et al., 2019;Kudugunta et al., 2019;Muller et al., 2021).However, such methods can only compare one pair of representation sets at a time.In contrast to that, we analyze all multilingual representations simultaneously using parallel factor analysis 2 (PARAFAC2; Harshman 1972b), a method that allows us to factorize a set of representations jointly by decomposing it into multiple components that can be analyzed individually and then recombined to understand the underlying structure and patterns in the data.More precisely, we extend the sub-population analysis method recently presented by Zhao et al. (2022), who compare two models as an alternative to probing: a control model trained on data of interest and an experimental model, which is identical to the control model but is trained on additional data from different sources.By treating the multilingual experimental model as a shared component in multiple comparisons with different control models (each one is a monolingual model trained on a subset of the multilingual model), we can better analyze the multilingual representations.
As an alternative to probing, our representation analysis approach: a) enables standardized comparisons across languages within a multilingual model, circumventing the need for external performance upper bounds in making meaningful interpretation of performance metrics; b) directly analyzes model representations, avoiding the need for auxiliary probing models and potential biases from specific probing classifier architectures; and c) compares multilingual versus monolingual representations for any inputs, avoiding reliance on labelled probing datasets.
We use PARAFAC2 to directly compare representations learned between multilingual models and their monolingual counterparts.We apply this efficient paradigm to answer the following research questions on multilingual models: Q1) How do multilingual language models encode morphosyntactic features in their layers?Q2) Are our findings robust to address low-resource settings?Q3) Do morphosyntactic typology and downstream task performance reflect in the factorization outputs?
We experiment with two kinds of models, XLM-R (Conneau et al., 2020) and RoBERTa (Liu et al., 2019).We apply our analysis tool on the multilingual and monolingual representations from these models to check morphosyntactic information in 33 languages from Universal Dependencies treebanks (UD; Nivre et al., 2017a).Our analysis reinforces recent findings on multilingual representations, such as the presence of language-neutral subspaces in multilingual language models (Foroutan et al., 2022), and yields the following key insights: • Encoding of morphosyntactic information is influenced by language-specific factors such as writing system and number of unique characters.
• Multilingual representations demonstrate distinct encoding patterns in subsets of languages with low language proximity.
• Representation of low-resource languages benefits from the presence of related languages.
• Our factorization method's utility reflects in hierarchical clustering within phylogenetic trees and prediction of cross-lingual task performance.

Background and Motivation
In this paper, we propose to use PARAFAC2 for multilingual analysis.By jointly decomposing a set of matrices representing cross-covariance between multilingual and monolingual representations, PARAFAC2 allows us to compare the representations across languages and their relationship to a multilingual model.For an integer n, we use [n] to denote the set {1, ..., n}.For a square matrix Σ, we denote by diag(Σ) its diagonal vector.
PARAFAC2 Let ℓ index a set of matrices,2 such that A ℓ = E[X ℓ Z ⊤ ], the matrix of crosscovariance between X ℓ and Z, which are random vectors of dimensions d and d ′ , respectively. 3This means that For any ℓ and two vectors, a ∈ R d , b ∈ R d ′ , the following holds due to the linearity of expectation: PARAFAC2 on the set of matrices {A ℓ } ℓ in this case finds a set of transformations {U ℓ } ℓ , V and a set of diagonal matrices {Σ ℓ } ℓ such that: (1) We call the elements on the diagonal of Σ ℓ pseudo-singular values, in relationship to singular value decomposition that decomposes a single matrix in a similar manner.The decomposition in Eq. 1 jointly decomposes the matrices such that each A ℓ is decomposed into a sequence of three transformations: first transforming Z into a latent space (V), scaling it (Σ ℓ ) and then transforming it into a specific ℓth-indexed X ℓ space (U ℓ ).Unlike singular value decomposition, which decomposes a matrix into a similar sequence of transformations with orthonormal matrices, PARAFAC2 does not guarantee U ℓ and V to be orthonormal and hence they do not represent an orthonormal basis transformation.However, Harshman (1972b) showed that a solution can still be found and is unique if we add the constraint that U ⊤ ℓ U ℓ is constant for all ℓ.In our use of PARAFAC2, we follow this variant.We provide an illustration of PARAFAC2 in Figure 1.experimental model, contrasting them with representations derived from a set of control models.In our case, the experimental model is a jointlytrained multilingual pre-trained language model, and the control models are monolingual models trained separately for a set of languages.Formally, there is an index set of languages [L] and a set of models consisting of the experimental model E and the control models We assume a set of inputs we apply our analysis to, X = L ℓ=1 X ℓ .Each set X ℓ = {x ℓ,1 , . . ., x ℓ,m } represents a set of inputs for the ℓth language.While we assume, for simplicity, that all sets have the same size m, it does not have to be the case.In our case, each X ℓ is a set of input words from language ℓ, which results in a set of representations as follows.For each ℓ ∈ [L] and i ∈ [m] we apply the model E and the model C ℓ to x ℓ,i to get two corresponding representations y ℓ,i ∈ R d and z ℓ,i ∈ R d ℓ .Here, d is the dimension of the multilingual model and d ℓ is the dimension of the representation of the monolingual model for the ℓth language.Stacking up these sets of vectors separately into two matrices (per language ℓ), we obtain the set of paired matrices Y ℓ ∈ R m×d and Z ℓ ∈ R m×d ℓ .We further calculate the covariance matrix Ω ℓ , defined as: Use of PARAFAC2 Given an integer k smaller or equal to the dimensions of the covariance matrices, we apply PARAFAC2 on the set of joint matrices, decomposing each Ω ℓ into: where U ∈ R d ℓ ×k and V ∈ R d×k .
To provide some intuition on this decomposition, consider Eq. 2 for a fixed ℓ.If we were following SVD, such decomposition would give two projections that project the multilingual representations and the monolingual representations into a joint latent space (by applying U ℓ and V on zs and ys, respectively).When applying PARAFAC2 jointly on the set of L matrices, we enforce the matrix V to be identical for all decompositions (rather than be separately defined if we were applying SVD on each matrix separately) and for U ℓ to vary for each language.We are now approximating the Ω ℓ matrix, which by itself could be thought as transforming vectors from the multilingual space to the monolingual space (and vice versa) in three transformation steps: first into a latent space (V), scaling it (Σ ℓ ), and then specializing it monolingually.
The diagonal of Σ ℓ can now be readily used to describe a signature of the ℓth language representations in relation to the multilingual model (see also Dubossarsky et al. 2020).This signature, which we mark by sig(ℓ) = diag(Σ ℓ ), can be used to compare the nature of representations between languages, and their commonalities in relationship to the multilingual model.In our case, this PARAFAC2 analysis is applied to different slices of the data.We collect tokens in different languages (both through a multilingual model and monolingual models) and then slice them by specific morphosyntactic category, each time applying PARAFAC2 on a subset of them.
For some of our analysis, we also use a condensed value derived from sig(ℓ).We follow a similar averaging approach to that used by SVCCA (Raghu et al., 2017), a popular representation analysis tool, where they argue that the single condensed SVCCA score represents the average correlation across aligned directions and serves as a direct multidimensional analogue of Pearson correlation.In our case, each signature value within sig(ℓ) from the PARAFAC2 algorithm corresponds to a direction, all of which are normalized in length, so the signature values reflect their relative strength.Thus, taking the average of sig(ℓ) provides an intensity measure of the representation of a given language in the multilingual model.We provide additional discussion in §5.1.

Experimental Setup
Data We use CoNLL's 2017 Wikipedia dump (Ginter et al., 2017) to train our models.Following Fujinuma et al. (2022), we downsample all Wikipedia datasets to an identical number of sequences to use the same amount of pre-training data for each language.In total, we experiment with 33 languages.For morphosyntactic features, we use treebanks from UD 2.1 (Nivre et al., 2017a).These treebanks contain sentences annotated with morphosyntactic information and are available for a wide range of languages.We obtain a representation for every word in the treebanks using our pretrained models.We provide further details on our pre-training data and how we process morphosyntactic features in Appendix B.1.
Task For pre-training our models, we use masked language modeling (MLM).Following Devlin et al. (2019), we mask 15% of the tokens.To fully control our experiments, we follow Zhao et al. (2022) and train our models from scratch.

Models
We have two kinds of models: the multilingual model E, trained using all L languages available, and the monolingual model C ℓ for ℓ ∈ [L] trained only using the ℓth language.We use the XLM-R (Conneau et al., 2020) architecture for the multilingual E model, and we use RoBERTa (Liu et al., 2019) for the monolingual C ℓ model.We use the base variant for both kinds of models.We use XLM-R's vocabulary and the SentencePiece (Kudo and Richardson, 2018) tokenizer for all our experiments provided by Conneau et al. (2020).This enables us to support all languages we analyze and ensure fair comparison for all configurations.We provide additional details about our models and training in Appendix B.2.

Experiments and Results
This section outlines our research questions (RQs), experimental design, and obtained results.

Morphosyntactic and Language Properties
Here, we address RQ1: How do multilingual language models encode morphosyntactic features in their layers?While in broad strokes, previous work (Hewitt and Manning, 2019;Jawahar et al., 2019;Tenney et al., 2019) showed that syntactic information tends to be captured in lower to middle layers within a network, we ask a more refined question here, and inspect whether different layers are specialized for specific morphosyntactic features, rather than providing an overall picture of all morphosyntax in a single layer.As mentioned in §3, we have a set of signatures, sig(ℓ) for ℓ ∈ [L], each describing the ℓth language representation for the corresponding morphosyntactic category we probe and the extent to which it utilizes information from each direction within the rows of V.
PARAFAC2 identifies a single transformation V that maps a multilingual representation into a latent space.Following that, the signature vector scales in specific directions based on their importance for the final monolingual representation it is transformed to.Therefore, the signature can be used to analyze whether similar directions in V are important for the transformation to the monolingual space.By using signatures of different layers in a joint factorization, we can identify comparable similarities for all languages.Analogous to the SVCCA similarity score (Raghu et al., 2017), we 0 1 2 3 4 5 6 7 8 9 10 11   condense each signature vector into a single value by taking the average of the signature.This value encapsulates the intensity of the use of directions in V.A high average indicates the corresponding language is well-represented in the multilingual model.We expect these values to exhibit a general trend (either decreasing or increasing) going from lowers to upper layers as lower layers are more general and upper layers are known to be more task-specific (Rogers et al., 2020).In addition, the trend may be contrasting for different languages and morphosyntactic features.
Language Signatures Across Layers We begin by presenting the distribution of average sig(ℓ) values for all languages across all layers for all lexical tokens in Figure 2a.We observe a gradual decrease in the mean of the distribution as we transition from lower to upper layers.This finding is consistent with those from Singh et al. (2019), who found that the similarity between representations of different languages steadily decreases up to the final layer in a pre-trained mBERT model.We used the Mann-Kendall (MK) statistical test (Mann, 1945;Kendall, 1948) for individual languages across all layers.The MK test is a rank-based non-parametric method used to assess whether a set of data values is increasing or decreasing over time, with the null hypothesis being there is no clear trend.Since we perform multiple tests (33 tests in total for all languages), we also control the false discovery rate (FDR; at level q = 0.05) with corrections to the pvalues (Benjamini and Hochberg, 1995).We found that all 33 languages except for Arabic, Indonesian, Japanese, Korean, and Swedish exhibit significant monotonically decreasing trends from lower layers to upper layers, with the FDR-adjusted p-values (p < 0.05).Figure 2a shows that the spread of the distribution for each layer (measured in variance) is constantly decreasing up until layer 6.From these layers forward, the spread increases again.A small spread indicates that the average intensity of scaling from a multilingual representation to the monolingual representation is similar among all languages.This provides evidence of the multilingual model aligning languages into a languageneutral subspace in the middle layers, with the upper layers becoming more task-focused (Merchant et al., 2020).This result is also supported by findings of Muller et al. (2021) -different languages representations' similarity in mBERT constantly increases up to a mid-layer then decreases.

Logogram vs. Phonogram
In Figure 2a, we observe a long bottom tail in the average sig(ℓ) plots for all languages, with Chinese and Japanese showing lower values compared to other languages that are clustered together, suggesting that our models have learned distinct representations for those two languages.We investigated if this relates to the logographic writing systems of these languages, which rely on symbols to represent words or morphemes rather than phonetic elements.We conducted an ablation study where we romanized our Chinese and Japanese data into Pinyin and Romaji, 4 respectively, and retrained our models.One might ask why we did not normalize the other languages in our experiment to use the Latin alphabet.
There are two reasons for this: 1) the multilingual model appears to learn them well, as evidenced by their similar signature values to other languages; 2) our primary focus is on investigating the impact of logographic writing systems, with Chinese and Japanese being the only languages employing logograms, while the others use phonograms.Figure 2b shows that, apart from the embedding layer, the average sig(ℓ) are more closely clustered together after the ablation.Our findings suggest that logographic writing systems may present unique challenges for multilingual models, warranting further research to understand their computational processes.Although not further explored here, writing systems should be considered when developing and analyzing multilingual models.
Language Properties In addition to data size, we explore the potential relationship between language-specific properties and the average sig(ℓ).
We consider two language properties: the number of unique characters and the type-token ratio (TTR), a commonly used linguistic metric to assess a text's vocabulary diversity.5TTR is calculated by dividing the number of unique words (measured in lemmas) by the total number of words (measured in tokens) obtained from the UD annotation metadata.Typically, a higher TTR indicates a greater degree of lexical variation.We present the Pearson correlation, averaged across all layers, in Figure 4.
To provide a comprehensive comparison, we include the results for data size as well.The detailed results for each layer can be found in Appendix C.
Examining the overall dataset, we observe a strong negative correlation between the number of unique characters and signature values.Similarly, the TTR exhibits a similar negative correlation, indicating that higher language variation corresponds to lower signature values.When analyzing individual categories, we consistently find a negative correlation for both the number of unique characters and the TTR.This further supports our earlier finding that Chinese and Japanese have lower signature values compared to other languages, as they possess a higher number of unique characters and TTR.
Generalization to Fully Pre-trained Models To ensure equal data representation for all languages in our experiment-controlled modeling, we downsampled the Wikipedia dataset and used an equal amount for pre-training our multilingual models.
To check whether our findings are also valid for multilingual pre-trained models trained on fullscale data, we conducted additional experiments using a public XLM-R checkpoint. 6The setup remained the same, except that we used representations obtained from this public XLM-R instead of our own trained XLM-R.We observe that the trends for signature values were generally similar, except for the embedding and final layers, where the values were very low.This was expected, as the cross-covariance was calculated with our monolingual models.The similar trend among the middle layers further supports the idea that these layers learn language-and data-agnostic representations.Furthermore, the Pearson correlations between the number of unique characters, TTR, data size, and the average sig(ℓ) for the overall dataset were as follows: -0.65, -0.28, and -0.02, respectively.These values are nearly identical to those shown in Figure 4, confirming the robustness of our method and its data-agnostic nature.

Language Proximity and Low-resource Conditions
Here, we address RQ2: Are our findings robust to address language subsets and low-resource settings?In RQ1, our analysis was based on the full set of pre-training languages available for each morphosyntactic category we examine.In this question, we aim to explore subsets of representations derived from either a related or diverse set of pre-training languages, and whether such choices yield any alterations to the findings established in RQ1.Furthermore, we extend our analysis to low-resource settings and explore potential changes in results on low-resource languages, particularly when these languages could receive support from other languages within the same language family.We also explore the potential benefits of employing language sampling techniques for enhancing the representation of low-resource languages.Language Proximity We obtain the related set of languages by adding all languages that are from the same linguistic family and genus (full information available in Appendix A).In total, we obtained three related sets of languages: Germanic languages, Romance languages, and Slavic languages.There are other related sets, but we do not include them in our experiment since the size of those sets is very small.For the diverse set of languages, we follow Fujinuma et al. ( 2022) and choose ten languages from different language genera that have a diverse set of scripts .These languages are Arabic, Chinese, English, Finnish, Greek, Hindi, Indonesian, Russian, Spanish, and Turkish.We use the χ 2 -square variance test to check whether the variance of the diverse set's average signatures from a particular layer is statistically significant from the variance of that of the related set, given a morphosyntactic category.We test layers 0 (the embedding layer), 6, and 12, covering the lower, middle, and upper layers within the model.We first find that for the overall dataset, the variance of the diverse set average signatures is significantly different (at α = 0.05) from all three related set variances for all three layers.This suggests that, in general, multilingual representations are encoded differently for different subsets of languages with low language proximity.For the attributes of number, person, and tense, the variance within the diverse set significantly differs from the variances within the three related sets across all three layers, with a statistical significance level of α = 0.05.This finding is sensible as all these three attributes have distinctions in the diverse set of languages.For example, Arabic has dual nouns to denote the special case of two persons, animals, or things, and Russian has a special plural form of nouns if they occur after numerals.On the other hand, for attributes like gender, we do not witness a significant difference between the diverse set and related set since there are only four possible values (masculine, feminine, neuter, and common) in the UD annotation for gender.We speculate that this low number of values leads to low variation among languages, thus the non-significant difference.This finding concurs with Stanczak et al. (2022), who observed a negative correlation between the number of values per morphosyntactic category and the proportion of language pairs with significant neuron overlap.Hence, the lack of significant differences in variance between the diverse and related sets can be attributed to the substantial overlap of neurons across language pairs.Low-resource Scenario In order to simulate a low-resource scenario, we curtailed the training data for selected languages, reducing it to only 10% of its original size.The choice of low-resource languages included English, French, Korean, Turkish, and Vietnamese.English and French were selected due to the availability of other languages within the same language family, while the remaining languages were chosen for their absence of such familial relationships.Notably, Korean was specifically selected as it utilizes a distinct script known as Hangul.To examine the impact of low-resource conditions on each of the selected languages, we re-trained our multilingual model, with each individual language designated as low-resource.To address potential confounding factors, we also retrained monolingual models on the reduced dataset.Additionally, we explored a sampling technique (Devlin, 2019) to enhance low-resource languages.Further details can be found in Appendix D.
Our analysis reveals the impact of low-resource conditions on signature values.English and French, benefiting from languages within the same language family, exhibit minimal changes in signature values, indicating a mitigation of the effects of low-resource conditions on language representation.Remaining languages without such support experience a significant decline in signature values (dropping from 0.3 to nearly 0), particularly after the embedding layer.This implies that low-resource languages struggle to maintain robust representations without assistance from related languages.Additionally, our findings suggest that language sampling techniques offer limited improvement in signature values of low-resource languages.

Utility of Our Method
Here we address RQ3: Do morphosyntactic typology and downstream task performance reflect in the factorization outputs?Having conducted quantitative analyses of our proposed analysis tool thus far, our focus now shifts to exploring the tool's ability to unveil morphosyntactic information within multilingual representations and establish a relationship between the factorization outputs and downstream task performance.To investigate these aspects, we conduct two additional experiments utilizing the signature vectors obtained from our analysis tool.Firstly, we construct a phylogenetic tree using cosine distance matrices of all signature vectors.Subsequently, we examine the correlations between the results of the XTREME benchmark (Hu et al., 2020) and the sig(ℓ) values.

Phylogenetic Tree
We first compute cosine distance matrices using all signature vectors for all 33 languages and 12 layers for each morphosyntactic attribute.Then, from the distance matrix, we use an agglomerative (bottom-up) hierarchical clustering method: unweighted pair group method with arith-metic mean (UPGMA; Sokal and Michener, 1958) to construct a phylogenetic tree.We show the distance matrices between all language pairs and their signature vectors based on overall representations obtained from layers 0, 6 and 12 in Figure 5.We can observe that signatures for Arabic, Chinese, Hindi, Japanese, and Korean are always far with respect to those for other languages across layers.From the distance matrix, we construct a phylogenetic tree using the UPGMA cluster algorithm.We present our generated trees and a discussion in Appendix E.1.In short, the constructed phylogenetic tree resembles linguistically-crafted trees.
Performance Prediction To establish a robust connection between our factorization outputs and downstream task performances, we conducted an analysis using the XTREME benchmark, which includes several models: mBERT (Devlin et al., 2019), XLM (Conneau and Lample, 2019b), XLM-R, and MMTE (Arivazhagan et al., 2019).This benchmark encompasses nine tasks that span four different categories: classification, structured prediction, question answering, and retrieval.These tasks demand reasoning on multiple levels of meaning.To evaluate the relationship between the metrics of each task and our average sig(ℓ) across all available languages for that task, we calculated the Pearson correlation.For each task's performance metrics, we use the results reported by Hu et al. (2020).The obtained correlation values using signature values from the last layer are presented in Table 1, along with pertinent details about each task, such as the number of available languages, and the metrics employed.For a comprehensive analysis, we also provide results using sig(ℓ) from every layer in Appendix E.2.Observing the results, it becomes evident that the XLM-R model exhibits the highest correlation, which is expected since the sig(ℓ) values obtained from our factorization process are also computed using the same architecture.
Furthermore, for most tasks, the highest correlation is observed with the final layers, which is reasonable considering their proximity to the output.Notably, we consistently observe high correlation across all layers for straightforward tasks like POS and PAWS-X operating on the representation level.However, for complex reasoning tasks like XNLI, only the final layer achieves reasonable correlation.These results suggest that the factorization outputs can serve as a valuable indicator of performance for downstream tasks, even without the need for fine-tuning or the availability of task-specific data.

Related Work
Understanding the information within NLP models' internal representations has drawn increasing attention in the community.Singh et al. (2019) applied canonical correlation analysis (CCA) on the internal representations of a pre-trained mBERT and revealed that the model partitions representations for each language rather than using a shared interlingual space.Kudugunta et al. (2019) used SVCCA to investigate massively multilingual Neural Machine Translation (NMT) representations and found that different language encoder representations group together based on linguistic similarity.Libovický et al. (2019) showed that mBERT representations could be split into a language-specific component and a language-neutral component by centering mBERT representations and using the centered representation on several probing tasks to evaluate the language neutrality of the representations.Similarly, Foroutan et al. (2022) employed the lottery ticket hypothesis to discover subnetworks within mBERT and found that mBERT is comprised of language-neutral and languagespecific components, with the former having a greater impact on cross-lingual transfer performance.Muller et al. (2021) presented a novel layer ablation approach and demonstrated that mBERT could be viewed as the stacking of two sub-networks: a multilingual encoder followed by a task-specific language-agnostic predictor.
Probing (see Belinkov 2022 for a review) is a widely-used method for analyzing multilingual representations and quantifying the information encoded by training a parameterized model, but its effectiveness can be influenced by model parameters and evaluation metrics (Pimentel et al., 2020).Choenni and Shutova (2020) probed representations from multilingual sentence encoders and dis-covered that typological properties are persistently encoded across layers in mBERT and XLM-R.Liang et al. (2021) demonstrated with probing that language-specific information is scattered across many dimensions, which can be projected into a linear subspace.Intrinsic probing, on the other hand, explores the internal structure of linguistic information within representations (Torroba Hennigen et al., 2020).Stanczak et al. (2022) conducted a large-scale empirical study over two multilingual pre-trained models, mBERT, and XLM-R, and investigated whether morphosyntactic information is encoded in the same subset of neurons in different languages.Their findings reveal that there is considerable cross-lingual overlap between neurons, but the magnitude varies among categories and is dependent on language proximity and pre-training data size.Other methods, such as matrix factorization techniques, are available for analyzing representations (Raghu et al., 2017;Morcos et al., 2018;Kornblith et al., 2019) and even modifying them through model editing (Olfat and Aswani, 2019;Shao et al., 2023a;Kleindessner et al., 2023;Shao et al., 2023b).When applied to multilingual analysis, these methods are limited to pairwise language comparisons, whereas our proposed method enables joint factorization of multiple representations, making it well-suited for multilingual analysis.

Conclusions
We introduce a representation analysis tool based on joint matrix factorization.We conduct a largescale empirical study over 33 languages and 17 morphosyntactic categories and apply our tool to compare the latent representations learned by multilingual and monolingual models from the study.Our findings show variations in the encoding of morphosyntactic information across different layers of multilingual models.Language-specific differences contribute to these variations, influenced by factors such as writing systems and linguistic relatedness.Furthermore, the factorization outputs exhibit strong correlations with cross-lingual task performance and produce a phylogenetic tree structure resembling those constructed by linguists.These findings contribute to our understanding of language representation in multilingual models and have practical implications for improving performance in cross-lingual tasks.In future work, we would like to extend our analysis tool to examine representations learned by multimodal models.

Limitations
Our research has several limitations.First, we only used RoBERTa and its multilingual variant, XLM-R, for our experiments.While these models are widely used in NLP research, there are other options available such as BERT, mBERT, T5, and mT5, which we have yet to explore due to a limited budget of computational resources.Second, to ensure equal data representation for all languages we experimented with, we downsampled Wikipedia resulting in a corpus of around 200MB per language.While we validated our findings against a publicly available XLM-R checkpoint trained on a much larger resource, further verification is still necessary.Third, our analyses are limited to morphosyntactic features, and in the future, we aim to expand our scope to include other linguistic aspects, such as semantics and pragmatics.Notice that in this list and in our work, we omit a language that has less than 100 instances labeled for a particular morphosyntactic category.

B Additional Details for Experiments B.1 Details for Data
We use CoNLL's 2017 Wikipedia dump (Ginter et al., 2017) to train our models.Following Fujinuma et al. (2022), we downsample all Wikipedia datasets to an identical number of sequences in order to use the same amount of data for all language pre-training.The downsampled dataset is standardized to the Hindi corpus, which has the smallest size among all languages we examine.For each language's pre-training data, there are about 30M tokens (approximately 200MB).In total, we experiment with 33 languages.We provide the full list of languages used for our experiments in Appendix A. We also create a validation set with 1K sequences (about 512 tokens per sequence) to measure model loss (cross-entropy) during pre-training.
For morphosyntactic features, we use treebanks from UD 2.1 (Nivre et al., 2017a).These treebanks contain sentences annotated with morphosyntactic information and are available for a wide range of languages.We obtain a contextual representation for every word in the treebanks by feeding them to our multilingual/monolingual models.We then use the UniMorph schema (Kirov et al., 2018) to map each word with its parts of speech and morphosyntactic properties.We provide a list of morphosyntactic categories we use in Appendix A. We follow Stanczak et al. (2022) and use the converter (McCarthy et al., 2018) to switch morphosyntactic annotations from UD v2.1 to UniMorph schema.

B.2 Details for Models
We use the XLM-R (Conneau et al., 2020) architecture for the multilingual E model, and we use RoBERTa (Liu et al., 2019) for the monolingual C ℓ model.We use the base variant for both kinds of models, which consists of 12 layers, 768 hidden dimensions, 8 attention heads for RoBERTa, and 12 attention heads for XLM-R.We use XLM-R's vocabulary and the SentencePiece (Kudo and Richardson, 2018) tokenizer for all our experiments provided by Conneau et al. (2020) in order to support all languages we analyze and enable fair comparison for all configurations.We do not use the original RoBERTa vocabulary and tokenizer since they only support English.We pre-train all models for a maximum of 150K steps, and all models use the validation set cross-entropy loss to perform early stopping.We use the AdamW optimizer (Loshchilov and Hutter, 2019) with a learning rate of 10 −4 .Our monolingual models were trained on four NVIDIA GeForce GTX 1080 Ti GPUs with a batch size of two per GPU, and our multilingual models were trained on four NVIDIA A100 GPUs with a batch size of 16 per GPU.Both models take about two days to train.We use PyTorch (Paszke et al., 2019), the HuggingFace library (Wolf et al., 2020) and the TensorLy library (Kossaifi et al., 2019) for all model implementation and PARAFAC2 computation.

D Additional Results for RQ2
Adhering to the current standard practice of language sampling during pre-training of multilingual models, we also experimented a setting inspired by the approach described by Devlin (2019).Following their approach, we applied a sampling technique to boost the representation of lower-resource languages.This involved sampling examples based on the probability P (L) ∝ |L| α , where P (L) represents the probability of selecting text from a given language during pre-training, and |L| denotes the number of examples available in that language.For our study, we set the value of α to 0.3.

E Additional Results for RQ3
We include in this section a phylogenetic tree analysis based on our approach, and a performance prediction experiment.

E.1 Phylogenetic Tree
We show the tree generated from layer 6's matrix is provided in Figure 8a.
There is ongoing discussion over the specifics of the linguistic evolutionary phylogenetic tree of languages, and a tree model has limitations because not all evolutionary connections are fully hierarchical, and it is difficult to account for horizontal transmissions (Singh et al., 2019).Despite this, we can still see that the constructed phylogenetic tree closely matches the language tree that linguists created to describe the links and development of human languages.We can see that generally, Germanic, Romance, and Slavic languages are clustered in different sub-trees.In particular, West Slavic languages, South Slavic languages, and East Slavic languages are generally clustered together before being combined into the common Slavic language family.Also, Eastern Romance language Romanian are merged together with Western Romance languages to form the Romance language family cluster.Similar to the findings of Singh et al. (2019), we also observe that trees generated across different layers are generally similar.They may have different structures as the branching of the tree may differ, but languages within the same family or genus are also close in the tree.
So far, we have constructed trees based on the full slice of the data using representations from the PoS attribute.We also tried to generate trees using all other morphosyntactic attributes.However, since for most morphosyntactic attributes, some

Figure 1 :
Figure 1: A diagram of the matrix factorization that PARAFAC2 performs.For our analysis, A ℓ ranges over covariance matrices between multilingual model representations and a ℓth monolingual model representations.
zh and ja romanized data

Figure 2 :
Figure 2: Average signature violin plots for all layers and languages on (a) original data and (b) data with Chinese (zh) and Japanese (ja) romanized.

Figure 3 :
Figure 3: Pearson correlation results between the average sig(ℓ) for all languages and their data size for each morphosyntactic category among all layers.

Figure 4 :
Figure 4: Pearson correlation results between the average sig(ℓ) for all languages and the number of unique characters, type-token ratio (TTR), and data size for each morphosyntactic category, averaged across all layers.

Figure 5 :
Figure 5: Cosine distance matrices between all language pairs and their signature vectors based on overall representations obtained from layer 0, 6 and 12. Darker color indicates the cosine distance being close to 1.

Figure 6 :
Figure 6: Pearson correlation results between the average sig(ℓ) for all languages and their number of unique characters for each morphosyntactic category among all layers.

Figure 7 :
Figure 7: Pearson correlation results between the average sig(ℓ) for all languages and their type-token ratio (TTR) for each morphosyntactic category among all layers.

Table 1 :
Pearson correlations between final layer's sig(ℓ) and XTREME benchmark performances on various tasks.