Identifying the Correlation Between Language Distance and Cross-Lingual Transfer in a Multilingual Representation Space

Prior research has investigated the impact of various linguistic features on cross-lingual transfer performance. In this study, we investigate the manner in which this effect can be mapped onto the representation space. While past studies have focused on the impact on cross-lingual alignment in multilingual language models during fine-tuning, this study examines the absolute evolution of the respective language representation spaces produced by MLLMs. We place a specific emphasis on the role of linguistic characteristics and investigate their inter-correlation with the impact on representation spaces and cross-lingual transfer performance. Additionally, this paper provides preliminary evidence of how these findings can be leveraged to enhance transfer to linguistically distant languages.


Introduction
It has been shown that language models implicitly encode linguistic knowledge (Jawahar et al., 2019;Otmakhova et al., 2022).In the case of multilingual language models (MLLMs), previous research has also extensively investigated the influence of these linguistic features on cross-lingual transfer performance (Lauscher et al., 2020;Dolicki and Spanakis, 2021;de Vries et al., 2022).However, limited attention has been paid to the impact of these factors on the language representation spaces of MLLMs.
Despite the fact that state-of-the-art MLLMs such as mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020), use a shared vocabulary and are intended to project text from any language into a language-agnostic embedding space, empirical evidence has demonstrated that these models encode language-specific information across all layers (Libovický et al., 2020;Gonen et al., 2020).This leads to the possibility of identifying distinct monolingual representation spaces within the * Research was conducted at Zortify.shared multilingual representation space (Chang et al., 2022).
Past research has focused on the cross-linguality of MLLMs during fine-tuning, specifically looking at the alignment of representation spaces of different language pairs (Singh et al., 2019;Muller et al., 2021).Our focus, instead, is directed towards the absolute impact on the representation space of each language individually, rather than the relative impact on the representation space of a language compared to another one.Isolating the impact for each language enables a more in-depth study of the inner modifications that occur within MLLMs during fine-tuning.The main objective of our study is to examine the role of linguistic features in this context, as previous research has shown their impact on cross-lingual transfer performance.More specifically, we examine the relationship between the impact on the representation space of a target language after fine-tuning on a source language and five different language distance metrics.We have observed such relationships across all layers with a trend of stronger correlations in the deeper layers of the MLLM and significant differences between language distance metrics.
Additionally, we observe an inter-correlation among language distance, impact on the representation space and transfer performance.Based on this observation, we propose a hypothesis that may assist in enhancing cross-lingual transfer to linguistically distant languages and provide preliminary evidence to suggest that further investigation of our hypothesis is merited.

Related Work
In monolingual settings, Jawahar et al. (2019) found that, after pre-training, BERT encodes different linguistic features in different layers.Merchant et al. (2020) showed that language models do not forget these linguistic structures during fine-tuning on a downstream task.Conversely, Tanti et al. (2021) have shown that during fine-tuning in multilingual settings, mBERT forgets some languagespecific information, resulting in a more crosslingual model.
At the representation space level, Singh et al. (2019) and Muller et al. (2021) studied the impact of fine-tuning on mBERT's cross-linguality layer-wise.However, their research was limited to the evaluation of the impact on cross-lingual alignment comparing the representation space of one language to another, rather than assessing the evolution of a language's representation space in isolation.

Experimental Setup
In this paper, we focus on the effect of fine-tuning on the representation space of the 12-layer multilingual BERT model (bert-base-multilingual-cased).We restrict our focus on the Natural Language Inference (NLI) task and fine-tune on all 15 languages of the XNLI dataset (Conneau et al., 2018) individually.We use the test set to evaluate the zero-shot cross-lingual transfer performance, measured as accuracy, and to generate embeddings that define the representation space of each language.More details on the training process and its reproducibility are provided in Appendix A.

Measuring the Impact on the Representation Space
We focus on measuring the impact on a language's representation space in a pre-trained MLLM during cross-lingual transfer.We accomplish this by measuring the similarity of hidden representations of samples from different target languages before and after fine-tuning in various source languages.For this purpose, we use the Centered Kernel Alignment (CKA) method (Kornblith et al., 2019) 1 .When using a linear kernel, the CKA score of two representation matrices X ∈ R N ×m and Y ∈ R N ×m , where N is the number of data points and m is the representation dimension, is given by where ∥•∥ F is the Frobenius norm.
Notation We define H i S→T ∈ R N ×m as the hidden representation2 of N samples from a target language T at the i-th attention layer of a model fine-tuned in the source language S, where m is the hidden layer output dimension.Similarly, we denote the hidden representation of N samples from language L at the i-th attention layer of a pre-trained base model (i.e.before fine-tuning) as More specifically, the representation space of each language will be represented by the stacked hidden states of its samples.
We define the impact on the representation space of a target language T at the i-th attention layer when fine-tuning in a source language S as follows:

Measuring Language Distance
In order to quantify the distance between languages we use three types of typological distances, namely the syntactic (SYN), geographic (GEO) and inventory (INV) distance, as well as the genetic (GEN) and phonological (PHON) distance between source and target language.These distances are pre-computed and are extracted from the URIEL Typological Database (Littell et al., 2017) using lang2vec3 .For our study, such language distances based on aggregated linguistic features offer a more comprehensive representation of the relevant language distance characteristics.More information on these five metrics is provided in Appendix B.

Correlation Analysis
Relationship Between the Impact on the Representation Space and Language Distance.Given the layer-wise differences of mBERT's crosslinguality (Libovický et al., 2020;Gonen et al., 2020), we measure the correlation between the impact on the representation space and the language distances across all layers.Figure 1 shows almost no significant correlation between representation space impact and inventory or phonological distance.Geographic and syntactic distance mostly show significant correlation values at the last layers.
Only the genetic distance correlates significantly across all layers with the impact on the representation space.Figure 1: Pearson correlation coefficient between the impact on a target language's representation space when fine-tuning in a source language and different types of linguistic distances between the source and target language for each layer.Same source-target language pair data points were excluded in order to prevent an overestimation of effects.( * p < 0.05, and * * p < 0.01, two-tailed).
Relationship Between Language Distance and Cross-Lingual Transfer Performance.Table 1 shows that all distance metrics correlate with crosslingual transfer performance, which is consistent with the findings of Lauscher et al. (2020).Furthermore, we note that the correlation strengths align with the previously established relationship between language distance and representation space impact, with higher correlation values observed for syntactic, genetic, and geographic distance than for inventory and phonological distance.Relationship Between the Impact on the Representation Space and Cross-Lingual Transfer Performance.In general, cross-lingual transfer performance clearly correlates with impact on the representation space of the target language, but this correlation tends to be stronger in the deeper layers of the model ( ( * p < 0.01, two-tailed).
5 Does Selective Layer Freezing Allow to Improve Transfer to Linguistically Distant Languages?
In the previous section we observed an intercorrelation between cross-lingual transfer performance, the linguistic distance between the target and source language, and the impact on the representation space.Given this observation, we investigate the possibility to use this information to improve transfer to linguistically distant languages.More specifically, we hypothesize that it may be possible to regulate cross-lingual transfer performance by selectively interfering with the previously observed correlations at specific layers.A straightforward strategy would be to selectively freeze layers, during the fine-tuning process, where a significant negative correlation between the impact on their representation space and the distance between source and target languages has been observed.By freezing a layer, we manually set the correlation between the impact on the representation space and language distance to zero, which may simultaneously reduce the significance of the correlation between language distance and transfer performance.
Wu and Dredze (2019) already showed that freezing early layers of mBERT during fine-tuning may lead to increased cross-lingual transfer performance.With the same goal in mind, Xu et al. (2021) employ meta-learning to select layer-wise learning rates during fine-tuning.In what follows, we will, however, not focus on pure overall transfer performance.Our approach is to specifically target transfer performance improvements for target languages that are linguistically distant from the source language, rather than trying to achieve equal transfer performance increases for all target languages.

Experimental Setup
For our pilot experiments, we focus on English as the source language.Additionally, we choose to carry out our pilot experiments on layers 1, 2, 5, and 6, as the representation space impact at these layers exhibits low correlation values with transfer performance (Table 2) and high correlations with different language distances (Figure 2 in Appendix C).This decision is made to mitigate the potential impact on the overall transfer performance, which could obscure the primary effect of interest, and to simultaneously target layers which might be responsible for the transfer gap to distant languages.We conduct 3 different experiments aiming to regulate correlations between specific language distances and transfer performance.In an attempt to diversify our experiments, we aim to decrease the transfer performance gap for both a single language distance metric (Experiment A) and multiple distance metrics (Exp.C).Furthermore, in another experiment we aim at deliberately increasing the transfer gap (Exp.B).

Results
Table 3 provides results of all 3 experiments.Experiment A. The 2 nd layer shows a strong negative correlation (-0.66) between representation space impact and inventory distance to English.Freezing the 2 nd layer during fine-tuning has led to a less significant correlation between inventory distance and transfer performance (+0.0116).
Experiment B. The 5 th layer shows a strong positive correlation (0.499) between representation space impact and phonological distance to English.Freezing the 5 th layer during fine-tuning has led to a more significant correlation between phonological distance and transfer performance (-0.012).
Experiment C. The 1 st layer, 2 nd layer and 6 th layer show a strong negative correlation between the impact on the representation space and the syntactic (-0.618), inventory (-0.66) and phonological (-0.543) distance to English, respectively.Freezing the 1 st , 2 nd and 6 th layer during fine-tuning has led to a less significant correlation of transfer performance with syntactic (+0.0029) and phonological (+0.011) distance.

Conclusion
In previous research, the effect of fine-tuning on a language representation space was usually studied in relative terms, for instance by comparing the cross-lingual alignment between two monolingual representation spaces before and after fine-tuning.Our research, however, focused on the absolute impact on the language-specific representation spaces within the multilingual space and explored the relationship between this impact and language distance.Our findings suggest that there is an intercorrelation between language distance, impact on the representation space, and transfer performance which varies across layers.Based on this finding, we hypothesize that selectively freezing layers during fine-tuning, at which specific inter-correlations are observed, may help to reduce the transfer performance gap to distant languages.Although our hypothesis is only supported by three pilot experiments, we anticipate that it may stimulate further research to include an assessment of our hypothesis.

Limitations
It is important to note that the evidence presented in this paper is not meant to be exhaustive, but rather to serve as a starting point for future research.Our findings are based on a set of 15 languages and a single downstream task and may not generalize to other languages or settings.Additionally, the proposed hypothesis has been tested through a limited number of experiments, and more extensive studies are required to determine its practicality and effectiveness.
Furthermore, in our study, we limited ourselves to using traditional correlation coefficients, which are limited in terms of the relationships they can capture, and it is possible that there are additional correlations that could further strengthen our results and conclusions.
2. Geographic Distance refers to the shortest distance between two languages on the surface of the earth's sphere, also known as the orthodromic distance.
3. Inventory Distance is the cosine distance between the inventory feature vectors of languages, sourced from the PHOIBLE9 database (Moran et al., 2019).
4. Genetic Distance is based on the Glottolog10 (Hammarström et al., 2015) tree of language families and is obtained by computing the distance between two languages in the tree.
5. Phonological Distance is the cosine distance between the phonological feature vectors of languages, sourced from WALS and Ethnologue.

C Additional Figures
Figure 2 provides Pearson correlation coefficients between the impact on the target language representation space when fine-tuning in English and different types of linguistic distances between English and the target language for each layer.English-English data points were excluded in order to prevent an overestimation of effects.Figure 3
* 0.5901 * All 0.4343 * 0.5026 * Table 2: Pearson correlation coefficients between cross-lingual transfer performance and the impact on the representation space of the target language.

Table 3 :
Pearson correlation coefficients quantifying the relationship between cross-lingual transfer performance and different language distance metrics after freezing different layers during fine-tuning.The first row contains baseline values for full-model fine-tuning.The last column provides the average cross-lingual transfer performance (CLTP), measured as accuracy, across all target languages.English has been the only source language.