Exploring the Relationship between Alignment and Cross-lingual Transfer in Multilingual Transformers

Without any explicit cross-lingual training data, multilingual language models can achieve cross-lingual transfer. One common way to improve this transfer is to perform realignment steps before fine-tuning, i.e., to train the model to build similar representations for pairs of words from translated sentences. But such realignment methods were found to not always improve results across languages and tasks, which raises the question of whether aligned representations are truly beneficial for cross-lingual transfer. We provide evidence that alignment is actually significantly correlated with cross-lingual transfer across languages, models and random seeds. We show that fine-tuning can have a significant impact on alignment, depending mainly on the downstream task and the model. Finally, we show that realignment can, in some instances, improve cross-lingual transfer, and we identify conditions in which realignment methods provide significant improvements. Namely, we find that realignment works better on tasks for which alignment is correlated with cross-lingual transfer when generalizing to a distant language and with smaller models, as well as when using a bilingual dictionary rather than FastAlign to extract realignment pairs. For example, for POS-tagging, between English and Arabic, realignment can bring a +15.8 accuracy improvement on distilmBERT, even outperforming XLM-R Large by 1.7. We thus advocate for further research on realignment methods for smaller multilingual models as an alternative to scaling.


Introduction
With the more general aim of improving the understanding of Multilingual Large Language Models (MLLM), we study the link between the multilingual alignment of their representations and their ability to perform cross-lingual transfer learning, and investigate conditions for realignment methods to improve cross-lingual transfer.
MLLMs, like mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020a), are Transformer Figure 1: Cross-lingual transfer between English and Arabic with and without realignment, using a bilingual dictionary.For some tasks, realignment can make small models competitive with a large baseline.encoders (Vaswani et al., 2017) which show an effective ability to perform Cross-lingual Transfer Learning (CTL).Despite the absence of any explicit cross-lingual training signal, mBERT and XLM-R can be fine-tuned on a specific task in one language and then provide high accuracy when evaluated on another language on the same task (Pires et al., 2019;Wu and Dredze, 2019).By alleviating the need for training data for a specific task in all languages and for translation data which more than often lacks for non-English languages, CTL with MLLMs could help bridge the gap in NLP between English and other languages.
But the ability of MLLMs to generalize across languages is highly correlated with the similarity between the training language (often English) and the language to which we hope to transfer knowledge (Pires et al., 2019;Wu and Dredze, 2019).For distant and low-resources languages, CTL with mBERT can give worse results than fine-tuning a Transformer from scratch (Wu and Dredze, 2020a).
Realignment methods (Wu and Dredze, 2020b), sometimes called adjustment or explicit alignment, aim to improve the cross-lingual properties of an MLLM by trying to make similar words from dif-ferent languages have closer representations.Realignment methods typically require a translation dataset and an alignment tool, like FastAlign (Dyer et al., 2013), to extract contextualized pairs of translated words that will be realigned.
Despite some encouraging results on specific tasks, current realignment methods might not consistently improve cross-lingual zero-shot abilities of mBERT and XLM-R (Wu and Dredze, 2020b).When tested with several seeds on various finetuning tasks, improvements brought by realignment are not always significant and do not compare with the gain brought by scaling the model, e.g., from XLM-R Base to XLM-R Large.However, these realignment methods were not tried on smaller models like distilmBERT as we do here.
The mitigated results of realignment methods raise the question of whether cross-lingual transfer is at all linked with multilingual alignment.If improving alignment does not necessarily improve CTL, then the two might not be correlated.Despite the ability of mBERT and XLM-R to perform CTL, there lacks consensus on whether they actually hold aligned representations (Gaschi et al., 2022).
We thus investigate the link between alignment and CTL, with three contributions: (1) We find a high correlation between multilingual alignment and cross-lingual transfer for multilingual Transformers, (2) we show that, depending on the downstream task, fine-tuning on English can harm the alignment to different degrees, potentially harming cross-lingual transfer, and (3) we link our findings to realignment methods and identify conditions under which they seem to bring the most significant improvements to zero-shot transfer learning, particularly on smaller models as shown on Fig. 1.

Related Work
Current realignment methods are applied on a pretrained model before fine-tuning in one language (typically English).Common tasks are Natural Language Inference (NLI), Named Entity Recognition (NER), Part-of-speech tagging (POS-tagging) or Question Answering (QA).The model is then expected to generalize better to other languages for the task than without the realignment.Realignment methods rely on pairs of words extracted from translated sentences using a word alignment tool, usually FastAlign (Dyer et al., 2013), but other tools like AWESOME-align (Dou and Neubig, 2021) could be used.Various realignment objectives are used to bring closer together the contextualized embeddings of words in such pairs: using a linear mapping (Wang et al., 2019), a ℓ 2 -based loss with regularization to avoid degenerative solutions (Cao et al., 2020;Zhao et al., 2021), or a contrastive loss (Pan et al., 2021;Wu and Dredze, 2020b).
Existing realignment methods might not significantly improve cross-lingual transfer.Despite improvements on NLI (Cao et al., 2020;Zhao et al., 2021;Pan et al., 2021) or on dependency parsing (Wang et al., 2019), the results might not hold across tasks and languages.A comparative study by Kulshreshtha et al. (2020) showed that methods based on linear mapping are effective only on "moderately close languages", whereas ℓ 2 -based loss improves results for "extremely distant languages".This latter ℓ 2 -loss was shown to work well on a NLI task, but not for all languages on a NER task, and to be even detrimental for QA tasks (Efimov et al., 2022).Finally, Wu and Dredze (2020b) compared linear mapping realignment, ℓ 2 -based realignment and contrastive learning on several tasks, languages and models, performing several runs.They found that existing methods do not bring consistent improvements over no realignment.
Expecting realignment methods to succeed implies a direct link between the multilingual alignment of the representations produced by a model and its ability to perform CTL.However, there isn't any strong consensus on whether multilingual Transformers have well-aligned representations (Gaschi et al., 2022), let alone on whether better-aligned representations lead to better CTL.
Assessing the multilingual alignment of contextualized representations can take many forms.Pairs of words are extracted from translated sentences, usually with FastAlign or a bilingual dictionary (Gaschi et al., 2022).Then, after building contextualized representations of the words of each pair, the distribution of their similarity can be compared with that of random pairs of words (Cao et al., 2020).But this method can lead to incorrect conclusions (Efimov et al., 2022).A high overlap in the distribution of similarities between related and random pairs means that sometimes random pairs can have higher similarities than related pairs.But since those pairs do not necessarily involve the same words, a high overlap does not mean that any word is closer to an unrelated one than to a related one.An alternative is to compare a related pair to its neighbors (Efimov et al., 2022), which shows that realignment methods indeed improve multilingual alignment and that fine-tuning can harm this alignment.Another similar approach consists in designing a nearest-neighbor search criterion.This was done for sentence-level representations (Pires et al., 2019) and for word-level alignment (Conneau et al., 2020b;Gaschi et al., 2022), showing that MLLMs like mBERT have a multilingual alignment that is competitive with static embeddings (Bojanowski et al., 2017) explicitly aligned with a supervised alignment method (Joulin et al., 2018).

Method
To study the link between multilingual alignment and cross-lingual transfer (CTL), we need a way to evaluate alignment and CTL.We use a relative difference to evaluate CTL, we discuss different methods for evaluating alignment, and describe the realignment method used in our experiments.

Evaluating cross-lingual transfer
A model has high CTL abilities when, after finetuning for one language, it can obtain a high evaluation score on other languages.To evaluate it for a given task, we compute the relative difference between the evaluation metric m en on the English development set and the evaluation metric m tgt on the target language: The monolingual metric is a score between 0 and 1, like accuracy or f1-score, where higher is better.
Then our metric gives scores between -1 and +∞.
A negative score is obtained if and only if m tgt < m en , which should always be the case in practice.
Values closer to 0 then indicate better CTL for a specific task and language.
It must be noted that for datasets where the target language test set is a translation of the English one, the normalization in Equation 1 allows the metric to boild down roughly to minus the proportion of correct answers in English that were misclassified when translated, assuming there isn't not to many misclassified English examples that were correctly classified in the target language, which should be the case since there are not that much misclassified English examples in general.

Evaluating alignment
To evaluate multilingual alignment, we use the same method for extracting pairs of translated words with their context as Gaschi et al. (2022).Provided a source of related pairs of words from both languages, a fixed number of pairs of words are randomly selected and a nearest-neighbor search with cosine similarity is performed.The top-1 accuracy of the nearest-neighbor search is the score of the alignment evaluation.
To extract contextualized pairs of translated words from a translation dataset, FastAlign is the most widely used word aligner in realignment methods (Wu and Dredze, 2020b;Cao et al., 2020;Zhao et al., 2021;Wang et al., 2019), but it is prone to errors and thus generates noisy training realignment data (Pan et al., 2021;Gaschi et al., 2022).Following Gaschi et al. (2022), we use a bilingual dictionary to extract matching pairs of words in translated sentences, discarding any ambiguity to obtain the most accurate pairs possible.
It is worth noting that the accuracy of a nearestneighbor search is not symmetric.We use the convention that an A-B alignment means that we look for the translation of each word of language A among its nearest neighbors.Two types of alignment can be evaluated: strong and weak alignment (Roy et al., 2020).Weak alignment is the expected way to compute alignment: when evaluating A-B weak alignment, we search a translation for a given word of A only among nearest-neighbors belonging to B. But with such an evaluation, there can be situations with highly measured alignment where representations from both languages are far apart with respect to intra-language similarity.Strong alignment remedies to this by including language A in the search space.With A-B strong alignment, we search a translation for a given word of A among its nearest-neighbors belonging to both language B and A. For a given pair of related words to be considered close enough, the word from language B must be closer to its translation in A than any other word from B and A. We show in our experiments that strong alignment is more correlated with CTL than weak alignment.1

Realignment loss
A realignment task consists in making the representations of related pairs closer to each other.The method used to extract related pairs for alignment evaluation can be used for computing the realignment loss.Following Wu and Dredze (2020b), we   minimize a contrastive loss using the framework of Chen et al. (2020), encouraging strong alignment for pairs within a batch.A batch is composed of a set of representations H of all words in a few pairs of translated sentences and a set P ⊆ H × H containing the pairs of translated words extracted with a bilingual dictionary (or a word aligner).The realignment loss can then be written as: (2) For each pair (s, t), the cosine similarity (sim) is compared to negative pairs (all other pairs in H) us-

Experimental details
We evaluate cross-lingual transfer with three multilingual tasks, the sizes of which are reported in Table 1: • Part-of-speech tagging (POS-tagging) with the Universal Dependencies dataset (Zeman et al., 2020).Similarly to Wu and Dredze (2020b), we use the following treebanks: Arabic-PADT, English-EWT, Spanish-GSD, French-GSD, Russian-GSD, and Chinese-GSD.
It must be noted that XNLI is the only dataset with translated test sets, and thus the only one for which the cross-lingual transfer metric is strictly comparable across languages.In our experiments, high correlation will nonetheless be observed between CTL and alignment for the two other tasks, suggesting that the CTL metrics is not so much affected by difference in size and domain between the test sets.Further details about implementation can be found in Appendix B And in the source code2 .

Correlation between alignment and CTL
We measure the correlation between multilingual alignment and cross-lingual transfer (CTL) across models, languages and seeds.We also compare the correlation between alignment before fine-tuning and after fine-tuning with CTL and with different alignment measures.Spearman's rank correlation is measured between alignment before or after fine-tuning and CTL.The English-target alignment is computed for each target language with the method described in Section 3.2 and is compared with the transfer ability from English to that same target language with the metric described in Section 3.1.
Table 2 shows correlations between CTL and different types of alignment.It is computed separately for each different task (POS, NER, NLI), for the alignment at the last and second to last layer (last and penult), before and after fine-tuning on the given task, and with weak and strong alignment.Comparing other layers for models of different sizes is less relevant, since the correlation is computed across models with various number of layers.And a model-by-model analysis of the correlation with the alignment in various layers did not reveal contradictory results (cf.Appendix E).Each correlation value is obtained from 100 samples with four different models (distilmBERT, mBERT, XLM-R Base and Large), five target languages (Arabic, Spanish, French, Russian and Chinese) and five seeds for initialization of the classification head and shuffling of the fine-tuning data.
Results show that strong alignment is better correlated to cross-lingual transfer than weak alignment.With the exception of two tasks after finetuning (NER and NLI), strong alignment has a marginally higher correlation with CTL.This is particularly noticeable when looking at alignment before fine-tuning on the last layer, going from a correlation between 0.51 and 0.69 for weak alignment to one ranging from 0.82 to 0.86 for strong alignment.
Tab. 2 also shows that for NLI, the alignment on the penultimate layer seems better correlated to cross-lingual transfer than with the last layer.A relatively important gap in correlation is measured between the last and the second-before-last layer for all cases except for strong alignment before fine-tuning.The fact that alignment on the penultimate layer would correlate better than the last for NLI can be explained by the sentence-level nature of the task.For sentence classification tasks, the classification head reads only the representation of the first token of the last layer, which is computed from the representations of all the tokens at the previous layer, leading to a pooling of the penultimate layer.
Despite the different values observed, there seems to be no significant difference between correlation for alignment measured before and after fine-tuning, and a careful analysis of confidence interval obtained with bootstrapping (Efron and Tibshirani, 1994) can confirm this (cf.Appendix C for detailed results).
Fig. 2 shows the relation between CTL and English-target strong alignment measured after fine-tuning measured in four situations to further illustrate the link between alignment and transfer.
Fig. 2b shows one of the cases with higher correlation (0.92).The correlation seems to hold well across models (forms) and languages (colors).However, for a given model and language, the random seed for fine-tuning seems to be detrimental to the correlation, although at a small scale.Hence, alignment might not be the only factor to affect cross-lingual generalization as the model initializa-tion or the data shuffling seems to play a smaller role.
Fig. 3 shows a case with one of the lowest correlations between strong alignment and CTL (0.70).It seems that models and initialization seeds have a higher impact on alignment than on CTL.For example, in the case of English-French alignment (green), CTL is between 0.0 and -0.1, whatever the model and seed, not overlapping with other target-English language pairs, but alignment varies between approximately 0.05 and 0.5, overlapping with all other language pairs.Interestingly, the penultimate layer has a higher accuracy (0.82), suggesting that for NER the last layer is not necessarily the one for which alignment correlates the most with CTL.
For two of the three tested tasks (NER and POStagging), it must be noted that the CTL metric is not strictly comparable across languages since the test sets for each language are of different domains and sizes (cf.Section 3.4).However, for the third task (NLI), each test set is a translation of the English one, and thus the CTL metric is strictly comparable in that case.This might explain why correlations are higher for the NLI task than the others.Nevertheless, the observed correlation for the two other tasks is still significantly high, which suggests that the general tendency might not be affected by the differences in domains and sizes in the test sets.

The impact of fine-tuning on alignment
To study the link between alignment and crosslingual transfer (CTL), we also look at the impact of fine-tuning over alignment.We've already shown that strong alignment is highly correlated with CTL.However, we weren't able to conclude whether alignment measured before or after finetuning was better correlated to CTL abilities.To understand the difference between both measures, we study in this section the impact of fine-tuning on the alignment of MLLMs representations.We use the same fine-tuning runs as in the previous section (4).
Tab. 3 shows the relative variation in alignment before and after fine-tuning for all tasks and models tested and for three languages for clarity (complete results in Appendix D).The relative difference is built in the same way as the cross-lingual transfer evaluation (Eq.1).Negative values indicate a drop in alignment.Alignment is measured at the last layer.Fig. 4  a few cases.
For certain combinations of models and tasks, fine-tuning is detrimental to multilingual alignment.distilmBERT and mBERT mainly show a decrease in alignment for POS-tagging and NER, and smaller improvements than other models on NLI.However, POS-tagging is the only of the three tasks which shows dramatic drops where alignment can be reduced by as much as 96%.
The drop in alignment can be explained by catastrophic forgetting.If the model is only trained on a monolingual task, it might not retain information about other languages or about the link between English and other languages.
What is more surprising is the increase in alignment obtained in other cases.XLM-R Base and Large, which are larger models than mBERT and distilmBERT, have a relative increase that can go as high as 25.36 on the NLI task for distant languages.And although these increases are from a small alignment measure, we still observe a large increase for middle layers where the initial alignment is already quite high (cf.Fig. 5).
The alignment of larger models being less harmed by fine-tuning is coherent with the fact that those same larger models have been shown to have better CTL abilities.Fig. 4 shows that more layers seem to mitigate the potentially negative impact of fine-tuning on alignment, as it affects mainly the layers closest to the last one and as the initial alignment measure is globally higher for XLM-R than distilmBERT (before fine-tuning: ≈0.25 against ≈0.008).
Giving a definitive answer as to why different tasks have different impacts on alignment might need further research.But one could already argue that each task corresponds to different levels of abstraction in NLP.Tasks with a low level of abstraction like POS-tagging might rely on the word form itself and thus on more language-specific components of the representations, which when enhanced, decreases alignment.On the other hand, NLI has a higher level of abstraction, requiring the meaning rather than the word form, which might be encoded in deeper layers (Tenney et al., 2019) which are more aligned.
Fine-tuning MLLMs on a downstream task has an impact on the multilingual alignment of the representations produced by the model.For "smaller" language models, it is systematically detrimental, as well as for certain tasks like POS-tagging.This might explain why some realignment methods might not work for all models nor all tasks (Wu and Dredze, 2020b).

Impact of realignment on cross-lingual transfer
We have already shown that the correlation between multilingual alignment and cross-lingual transfer (CTL) is high (Section 4).But we do not know whether they are more directly linked.In this section, we try to identify the conditions under which improving alignment in multilingual models leads to improvement in CTL.Sequential realignment is the usual way to perform realignment: realignment steps are performed on the pre-trained model before fine-tuning.We propose to compare it with joint alignment, where we optimize simultaneously for the realignment and the downstream task (more details in Appendix A), to try and identify whether alignment before or after fine-tuning is more strongly related to CTL.
In the same settings as the previous experiments (tasks, models and languages, and number of seeds), we fine-tune models in English with different realignment methods and evaluate CTL on different languages.Following a similar setting as (Wu and Dredze, 2020b), realignment data from the five pairs of languages (English-target) is interleaved to form a single multilingual realignment dataset.Models are fine-tuned on POS-tagging or NER for five epochs and 2 epochs for NLI because its training data is larger.We use the opus100 translation dataset (Zhang et al., 2020) from which we extract pairs of words using bilingual dictionaries.We also tested with the multiUN translation data (Ziemski et al., 2016), which conditioned our choice of languages, and with other ways to extract alignment pairs: FastAlign (Dyer et al., 2013)  and AWESOME-align (Dou and Neubig, 2021).Changing the translation dataset does not fundamentally change the results, and using probabilistic alignment tools made realignment methods less effective.The results presented in this section were handpicked for the sake of clarity, but the reader can refer to Appendix F.
Condensed results are reported on Tab. 4, averaged on the five languages.A breakdown by languages for the POS-tagging task and two models is shown on Tab. 5.It shows that realignment methods improve performance only on certain tasks, models and language pairs.Realignment methods, either sequential or joint, provide significant improvement for all models for the POS-tagging task, but less significant ones for NER, and no significant improvement for NLI.The positive impact of realignment on cross-lingual transfer seems to be mirrored by the negative impact of fine-tuning over alignment.Indeed, POStagging is also the task for which fine-tuning is the most detrimental to multilingual alignment, as shown in the previous section.
The same parallel can be drawn for models.dis-tilmBERT is the model that benefits the most from realignment.It is also the one whose alignment suffers the most from fine-tuning.Smaller multi-lingual models seem to benefit more from realignment, as well as they see their multilingual alignment reduced after fine-tuning.In the same way that fine-tuning mainly affects the deeper layers, it is possible that realignment might affect only those deeper layers.This would mean that most layers would have their alignment significantly improved for small models like distilmBERT (6 layers), while larger models might be only superficially realigned.
Finally, besides tasks and models, it can also be observed that the impact of realignment varies across language pairs (Tab.5).Although we did not test on many language pairs, results are coherent with the idea that realignment methods tend to work better on distant pairs of languages (Kulshreshtha et al., 2020).
On a side note, our controlled experiment does not allow us to conclude whether it is more important to improve alignment before fine-tuning or after.It seems that alignment measured before and after fine-tuning are equally important to crosslingual transfer.
Realignment methods unsurprisingly provide better results when the alignment is lower, be it before or after fine-tuning.Distant languages and small models have lower alignment, and POS-tagging is a task where alignment decreases after fine-tuning.Realignment helps only up to a certain point where representations are already well aligned, and CTL gives already good results.For distilmBERT on POS-tagging for transfer from English to Arabic, it provides a +15.8 improvement over baseline, even outperforming XLM-R Large by 1.7 points.In such conditions, realignment is an interesting alternative to scaling for multilingual models.
If realignment succeeds in some favorable conditions, then how can we explain that realignment methods were shown to not be significantly improving CTL on several tasks, including POS-tagging (Wu and Dredze, 2020b)?Firstly, to the best of our knowledge, realignment was never tried on distilmBERT or other models of equivalent size.Secondly, Tab. 6 shows that it might be partly due to an element of the realignment methods that was overlooked: the source of related pairs of words.
The way pairs are extracted seems to be crucial to the success of realignment methods.Tab.6 shows the effect of different types of pairs extraction in realignment methods.Realignment methods using pairs extracted with FastAlign or AWESOME-align do not provide significant improvements over the baseline, whereas using a bilingual dictionary does.Using a bilingual dictionary might be more accurate for extracting translated pairs (Gaschi et al., 2022).Another explanation could be that the type of words contained in a dictionary might help since it might contain more lexical words holding meaning and fewer grammatical words.

Conclusion
We have shown that multilingual alignment, measured using a nearest-neighbor search among translated pairs of contextualized words, is highly correlated with the cross-lingual transfer abilities of multilingual models (or at least multilingual Transformers).Strong alignment was also revealed to be better correlated to cross-lingual transfer than weak alignment.
Then we investigated the impact of fine-tuning (necessary for cross-lingual transfer) on alignment as well as the impact of realignment methods on cross-lingual transfer.Fine-tuning was revealed to have a very different impact on alignment depending on the downstream task and the model.Where lower-level tasks seemed to have the most impact and smaller models seemed to be the most affected.Conversely, realignment methods were shown to work better on those same tasks and models.Ultimately, realignment works unsurprisingly better when the baseline alignment (before or after fine-tuning) is lower.
We also showed that using a bilingual dictionary for extracting pairs for realignment methods improves over the commonly used FastAlign and over a more precise neural aligner (AWESOME-align).
It's worth noting that realignment works particularly well for a small model like distilmBERT (66M parameters), allowing it in some cases to ob-tain competitive results with XLM-R Large (354M parameters).This advocates for further research on realignment for small Transformers to build more compute-efficient multilingual models.
Finally, further research is needed to investigate additional questions, like whether cross-lingual transfer is more directly linked to alignment before or after fine-tuning, or to alignment at certain layers for certain tasks.To answer these questions, more large-scale experiments could be performed on more tasks and especially on more languages to obtain correlation values with smaller confidence intervals.

Limitations
We worked with only five language pairs, all involving English and another language: Arabic, Spanish, French, Russian and Chinese.This is due to using the multiUN dataset (Ziemski et al., 2016) for evaluating alignment and performing realignment.We also used the opus100 dataset (Zhang et al., 2020), which contains more pairs and is the dataset that eventually figured in our paper, but we stuck to the same language pairs for a fair comparison with multiUN in Appendix F. This narrow choice of language limits our ability to understand why realignment methods work well for some languages and not others.And we believe that making a similar analysis with many language pairs, not necessarily involving English, would be a good lead for further research investigating the link between the success of the realignment method and how two languages relate to each other.
We chose a strong alignment objective with contrastive learning for our realignment task.Several other objectives could have been tried, like learning an orthogonal mapping between representations (Wang et al., 2019) or simply using a ℓ 2 -loss to collapse representations together (Cao et al., 2020), but both methods require an extra regularization step (Wu and Dredze, 2020b) since they do not leverage any negative samples.For the sake of simplicity, we focused on a contrastive loss, as trying different methods would have led to an explosion in the number of runs for the controlled experiment.This also explains why we used the same hyperparameters and pre-processing steps of Wu and Dredze (2020b).A more thorough search for the optimal parameters, and realignment loss, might lead to better results.

Acknowledgements
We would like to thank the anonymous reviewers for their comments, as well as Shijie Wu, who kindly explained some details in the implementation of his paper Wu and Dredze (2020b).We are also grateful for discussion and proof-reading brought by our colleagues at Posos: François Plesse, Xavier Fontaine and Baptiste Charnier.
Experiments presented in this paper were carried out using the Grid'5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations3 .

A Joint realignment
Existing realignment methods proceed in a sequential manner.The pre-trained model is first optimized for the realignment loss, before any finetuning.This assumes that the alignment before fine-tuning is positively linked to the cross-lingual transfer abilities of the model and that improving alignment before fine-tuning will improve transfer.However, fine-tuning itself might have an impact on alignment (Efimov et al., 2022).
To compare the importance of alignment before and after fine-tuning for CTL, we introduce a new realignment method where realignment and finetuning are performed jointly.We optimize simultaneously for a realignment loss and the fine-tuning loss.In practice, for each optimization step, we compute the loss L task for a batch of the fine-tuning task and the loss L realign for a batch of the alignment data.The total loss for each backward pass is then written as: This joint realignment can be framed as multi-task learning.The fine-tuning task would be the main task and the realignment task an auxiliary one.There are more elaborate methods for training a model with an auxiliary task (Liebel and Körner, 2018;Du et al., 2018;Zhang et al., 2018;Liu et al., 2019) but our aim is to propose the simplest method possible to compare joint and sequential alignment in a controlled setting.

B.2 Multilingual alignment data
From a translation dataset, pairs were extracted either using a bilingual dictionary, following (Gaschi et al., 2022), with FastAlign (Dyer et al., 2013) or AWESOME-align (Dou and Neubig, 2021).For FastAlign, we produced alignments in both direction and symmetrize with the grow-diag-final-and heuristic provided with FastAlign, following the setting of Wu and Dredze (2020b).For all methods of extraction, we kept only one-to-one alignment and discard trivial cases where both words are identical, again following Wu and Dredze (2020b).

B.3 Experimental setup
We performed two experiments: 1. Fine-tuning all models on all tasks for 5 epochs and measuring alignment before and after fine-tuning.This experiment provided the results for Section 4 and 5.
2. Performing different realignment methods before fine-tuning for 5 epochs (or 2 for XNLI), providing results for section 6.
For both experiments, we reused the experimental setup from Wu and Dredze (2020b).Fine-tuning on a downstream task is done with Adam, with a learning rate of 2e-5 with a linear decay and warmup for 10% of the steps.Fine-tuning is performed on 5 epochs, and 32 batch size, except for XNLI in the second experiment, where we trained for 2 epochs, which still leads to more fine-tuning steps than any of the two other tasks (cf.Table 1).
For the realignment methods, still following Wu and Dredze (2020b), we train in a multilingual fashion, where each batch contains examples from all target languages.However, we use the same learning rate and schedule as for fine-tuning for a fair comparison between joint and sequential realignment, since the same optimizer is used for fine-tuning and realignment when performing joint realignment.We use a maximum length of 96 like Wu and Dredze (2020b) but a batch size of 16 instead of 128 because of limited computing resources.

B.4 Discussion on the number of realignment samples
It is worth noting that our method uses fewer realignment samples.Since we alternate batches of 16 realignment samples and batches of 32 finetuning samples for joint realignment, this fixes the number of realignment samples we will use for a specific downstream task, for a fair comparison.This gives 31,358 sentence pairs for POS-tagging, 50,000 for NER, and 392,703 for NLI.For comparison, Wu and Dredze (2020b) used 100k steps of batches of size 128.The number of realignment samples used could have been a factor explaining why realignment works well for POS-tagging and less for NER and NLI, and why Wu and Dredze (2020b) do not find that realignment methods improve results significantly on any task.It could be argued that training on too many realignment samples might hurt performances.However, when testing on the POS-tagging task, we found that the number of realignment samples did not have significant impact on performances.

B.5 Computational budget
The first experiment was performed on Nvidia A40 GPUs for an equivalent of 3 days for a single GPU (including all models, tasks and seeds).For the second experiment, training (fine-tuning and/or realignment) was performed on various smaller GPUs (RTX 2080 Ti, GTX 1080 Ti, Tesla T4) for distilmBERT, mBERT and XLM-R Base, and on a Nvidia A40 for XLM-R Large.The experiment took more than 10 GPU-days on the smaller GPUs, combining all models, realignment methods (including baseline), random seeds, translation datasets and pairs extraction methods.For XLM-R Large, for which we only trained the baseline, it still required 30 GPU-hours on Nvidia A40.

C Confidence intervals for correlation
In Section 4 we compared correlation for different tasks, before and after fine-tuning, for Englishtarget and target-English alignment and for the last and penultimate layer.These correlations where computed across several models, languages and seeds.From this correlation statistics, we have drawn three conclusions: 1. Strong alignment is better correlated with cross-lingual transfer than weak alignment.
2. The NLI task, because of its sentence-level nature, have a cross-lingual transfer that correlates better with the penultimate layer than the last one.
3. The results do not significantly attribute higher correlation of cross-lingual transfer with alignment before or after fine-tuning neither with English-target compared to target-English alignment.
We verify here that these conclusions hold when looking at the confidence intervals (Tab.8 and Tab.9).Confidence intervals are obtained using the Bias-Corrected and Accelerated (BCA) bootstrap method, where several subsets (2000) subsets of our 100 points for each measure of the correlation coefficient are sampled to obtain an empirical distribution of the correlation from which the confidence interval can be deduced (Efron and Tibshirani, 1994).Since we are dealing with ordinal data (the rank in Spearman's rank correlation), bootstrap confidence intervals are expected to have better properties than methods based on assumptions about the distribution (Ruscio, 2008;Bishara and Hittner, 2017) Is strong alignment significantly better correlated with cross-lingual transfer than weak alignment?comparing both tables cell-by-cell reveals that confidence intervals for the last layer before fine-tuning hardly never overlap, and when they do it's with a small overlap.So in the case of alignment of the last layer before fine-tuning, strong alignment is significantly better correlated with cross-lingual transfer than weak alignment.For other situations, confidence interval overlap.But the fact that strong alignment has almost systematically a higher correlation makes our correlation still relevant.
Does the penultimate layer correlate better than the last one for NLI?For this task, we observe   that the confidence intervals of the penultimate and last layer do not overlap when the alignment is measured after fine-tuning.Otherwise, before finetuning, we can still observe that the measured correlation for the penultimate layer is systematically above the confidence interval for the last layer, except for target-English strong alignment.
We can see that confidence intervals overlap too much when comparing before and after fine-tuning, except in two cases.When looking at POS-tagging for the last layer, weak alignment after fine-tuning gives a significantly better correlation than before, but this does not translate to strong alignment which correlates better with cross-lingual transfer overall.The same observation can be made about NLI for the penultimate layer.On the other hand, for the NER task, strong alignment after-fine tuning gives a significantly worse correlation than before.It is thus difficult to conclude on whether alignment before or after fine-tuning is better correlated to cross-lingual transfer.
Finally, comparing target-English and Englishtarget alignment does not give significant results.If all other parameters are kept identical, every situation leads to an overlap between confidence intervals except for the last layer before fine-tuning for NLI, which might just be fortuitous since it's the second before last layer that correlates better with cross-lingual transfer for this task.

D Detailed results for alignment drop
Tab. 10 contains the detailed results when measuring the relative drop in strong alignment after fine-tuning.This is a detailed version of Tab. 3 in Section 5, with standard deviation measured over 5 different seeds for model initialization and data shuffling for fine-tuning, and all tested languages.This confirms that the observed increases and decreases in alignment are significant.It also seems to show that alignment for distant languages (en-ar, en-zh) is more affected by fine-tuning than other pairs.

E Breaking down correlation by models and layers
Tab. 12 shows a breakdown of the correlation between strong alignment and CTL across layers and models.These results tend to show that smaller models (distilmBERT and mBERT) have a better correlation at the last layer than larger models.It is also interesting to note that several correlation values are identical for alignment before fine-tuning, this might be explained by the fact that the seed of fine-tuning has unsurprisingly no effect on alignment measured before fine-tuning and by the possi-  before after last 0.89 (0.64 -0.96) 0.83 (0.68 -0.94) -1 0.79 (0.63 -0.89) 0.79 (0.64 -0.88) -2 0.79 (0.65 -0.89) 0.78 (0.65 -0.89) -3 0.79 (0.64 -0.89) 0.82 (0.69 -0.91) -4 0.79 (0.65 -0.89) 0.76 (0.62 -0.86) -5 0.79 (0.64 -0.89) 0.79 (0.64 -0.91) -6 0.79 (0.66 -0.90) 0.77 (0.62 -0.87)Table 11: Correlation between strong English-target alignment and CTL from English to target language for the POS-tagging task, with 95% confidence intervals bility that alignment measured at one layer might be almost perfectly correlated with alignment at another, especially when the correlation is measured across few languages.However, drawing any conclusion from those figures might be irrelevant.By breaking down results by model, we measure correlation only from 25 samples, with five languages and five seeds.Furthermore, those latter seeds have no effect on alignment measured before.Tab.11 shows a focus on distilmBERT for the same results with confidence intervals obtained with BCA bootstrapping.It demonstrates that the measured correlation is not precise enough to draw any conclusion on which layer has an alignment that is better correlated with CTL, or to determine whether alignment before or after fine-tuning is more relevant to CTL abilities.As a matter of fact, the results are so inconclusive that almost all correlation values in Tab. 12 lie in any of the confidence intervals in Tab.11.

F Detailed results of the controlled experiment
This section provides detailed results of realignment methods for POS-tagging and NER, for all tested models, languages, translation datasets, and methods of extraction for realignment data.It also contains results for XNLI, for which only one translation dataset (opus100) and one extraction method (bilingual dictionaries) were tested.Results are shown on Tab. 13 (POS, opus100), 14 (POS, multi-UN), 15 (NER, opus100), 16 (NER, multi-UN), 17 (NLI, opus100).
A light gray cell indicates that the realignment method obtained an average score that is closer to the baseline with the same model than the standard deviation of the said baseline.A dark gray cell indicates that the realignment method provokes a decrease w.r.t. the baseline that is bigger than the standard deviation.
Those detailed results emphasize on the conclusions of Section 6.Using bilingual dictionaries seems to provide significant improvements more often than other methods to extract realignment pairs of words.This is particularly visible for the POS-tagging tasks, where realigning with a bilingual dictionary, with joint or sequential realignment, provides the best results.For the NER task, this is less visible, but we've already seen that, on average, bilingual dictionaries give better results (Tab.5).
The detailed results also confirm that for smaller models and certain tasks like POS-tagging, realignment methods work better.Realignment methods for POS-tagging on distilmBERT bring significant Table 12: Correlation between strong English-target alignment and CTL from English to target language for the POS-tagging task.-i indicate depth of the model, with -1 being the second-before-last layer, and -2 the third-beforelast, etc... improvement for all languages.When using a bilingual dictionary, it also brings a systematically significant improvement over the baseline for mBERT on POS-tagging.For NER, the improvement is less often significant, but realignment methods still obtain some significant improvements for some languages like Arabic.For NLI, the only model on which there are some significant improvements for some languages is distilmBERT.
Using a supposedly higher quality translation dataset like multi-UN does not provide improvement over using opus100, which is said to be better reflecting the average quality of translation datasets (Wu and Dredze, 2020b).It might even seem that using multi-UN provide slightly worse results than opus100.There are more cases of unsignificant increase of results for multi-UN for POS-tagging and NER and also more cases of apparently significant degradation of results with respect to the baseline.This might be explained by the fact that multi-UN is a corpus obtained from translation of documents in the United Nations, which might lack diversity in their content.
Finally, we observe that realignment methods, at least with the small number of realignment steps we performed here, do not impact the evaluation on the fine-tuning language (English).Indeed, even if they sometimes provoke a decrease, namely on POS-tagging, this decrease is small, rarely of more than 0.1 points.

Figure 2 :
Figure 2: Plot of CTL abilities against the English-target strong alignment measured for the last and penultimate layer after fine-tuning on NLI.

Figure 3 :
Figure 3: CLT abilities against English-target strong alignment for the last layer after fine-tuning on NER.

Figure 4 :Figure 5 :
Figure4: Evolution across layers of English-Arabic alignment before and after fine-tuning of distilmBERT and XLM-R Large on POS-tagging, starting at 0 for the embedding layer.

Table 2 :
Spearman's rank correlation of CTL with the English-target alignment produced by the last and penultimate layer before and after fine-tuning.Evaluation is done across 5 languages, 5 seeds and 4 models (N = 100).All cells have p-value < 0.05.
and 5 show a breakdown by layer for

Table 4 :
Condensed results of the controlled experiment comparing joint and sequential realignment using a bilingual dictionary.Light gray indicates a difference with baseline lower than its standard deviation.Dark gray indicates lower than baseline minus standard deviation.

Table 5 :
Breakdown of realignment results for some languages and distilmBERT and XLM-R.

Table 8 :
95% confidence interval for Spearman rank correlation between weak alignment and CTL, obtained with BCA bootstraping with 2000 resamples.

Table 9 :
95% confidence interval for Spearman rank correlation between strong alignment and cross-lingual transfer, obtained with BCA bootstraping with 2000 resamples.

Table 10 :
Relative variation of strong alignment at the last layer before and after fine-tuning for different fine-tuning tasks." ± " indicates standard deviation.