Can Cross-Lingual Transferability of Multilingual Transformers Be Activated Without End-Task Data?

Pretrained multilingual Transformers have achieved great success in cross-lingual transfer learning. Current methods typically activate the cross-lingual transferability of multi-lingual Transformers by fine-tuning them on end-task data. However, the methods cannot perform cross-lingual transfer when end-task data are unavailable. In this work, we explore whether the cross-lingual transferability can be activated without end-task data. We pro-pose a cross-lingual transfer method, named P LUG I N -X. P LUG I N -X disassembles mono-lingual and multilingual Transformers into sub-modules, and reassembles them to be the multilingual end-task model. After representation adaptation, P LUG I N -X finally performs cross-lingual transfer in a plug-and-play style. Experimental results show that P LUG I N -X successfully activates the cross-lingual transferability of multilingual Transformers without accessing end-task data. Moreover, we analyze how the cross-model representation alignment affects the cross-lingual transferability.


Introduction
Annotated data is crucial for learning natural language processing (NLP) models, but they are mostly only available in high-resource languages, typically in English, making NLP applications hard to access in other languages.This motivates the studies on cross-lingual transfer, which aims to transfer knowledge from a source language to other languages.Cross-lingual transfer has greatly pushed the state of the art on NLP tasks in a wide range of languages (Conneau et al., 2020;Chi et al., 2021;Xue et al., 2021).
Advances in cross-lingual transfer can be substantially attributed to the cross-lingual transferability discovered in pretrained.multilingual Transformers (Devlin et al., 2019;Conneau and Lample, 2019).Pretrained on large-scale multilingual text data, the multilingual Transformers perform crosslingual transfer surprisingly well on a wide range of tasks by simply fine-tuning them (Wu and Dredze, 2019;K et al., 2020;Hu et al., 2020).Based on this finding, follow-up studies further improve the transfer performance in two aspects, by (1) designing pretraining tasks and pretraining multilingual models with better cross-lingual transferability (Wei et al., 2021;Chi et al., 2021), or (2) developing fine-tuning methods with reduced cross-lingual representation discrepancy (Zheng et al., 2021;Yang et al., 2022).
Current methods typically activate the transferability of multilingual Transformers by fine-tuning them on end-task data.However, they cannot perform cross-lingual transfer when end-task data are unavailable.It is common that some publicly available models are trained with non-public in-house data.In this situation, one can access an alreadytrained end-task model but cannot access the inhouse end-task data due to privacy policies or other legal issues.As a consequence, current methods cannot perform cross-lingual transfer for such models because of the lack of end-task data.
In this work, we study the research question: whether the cross-lingual transferability of multilingual Transformers can be activated without end-task data?We focus on the situation that we can access an already-trained monolingual end-task model but cannot access the in-house end-task data, and we would like to perform cross-lingual transfer for the model.To achieve this, we propose a cross-lingual transfer method named PLUGIN-X.PLUGIN-X disassembles the monolingual end-task model and multilingual models, and reassembles them into the multilingual end-task model.With cross-model representation adaptation, PLUGIN-X finally performs cross-lingual transfer in a plugand-play style.
To answer the research question, we conduct experiments on the cross-lingual transfer on the natu-ral language inference and the extractive question answering tasks.In the experiments, the multilingual model only sees unlabeled raw English text, so the performance of the reassembled model indicates whether the cross-lingual transferability is activated.Experimental results show that PLUGIN-X successfully transfers the already-trained monolingual end-task models to other languages.Moreover, we analyze how the cross-model representation alignment affects the cross-lingual transferability of multilingual Transformers, and discuss the benefits of our work.
Our contributions are summarized as follows: • We investigate whether the cross-lingual transferability of multilingual Transformers can be activated without end-task data.
• We propose PLUGIN-X, which transfers already-trained monolingual end-task models to other languages without end-task data.
• Experimental results demonstrate PLUGIN-X successfully activates the transferability.

Related Work
Cross-lingual transfer aims to transfer knowledge from a source language to target languages.Early work on cross-lingual transfer focuses on learning cross-lingual word embeddings (CLWE; Mikolov et al. 2013) with shared task modules upon the embeddings, which has been applied to document classification (Schwenk and Li, 2018), sequence labeling (Xie et al., 2018), dialogue systems (Schuster et al., 2019), etc. Follow-up studies design algorithms to better align the word embedding spaces (Xing et al., 2015;Grave et al., 2019) or relax the bilingual supervision of lexicons and parallel sentences (Lample et al., 2018;Artetxe et al., 2018).Later studies introduce sentence-level alignment objectives and obtain better results (Conneau et al., 2018).
Most recently, fine-tuning pretrained language models (PLM; Devlin et al. 2019;Conneau and Lample 2019;Conneau et al. 2020) have become the mainstream approach to cross-lingual transfer.Benefiting from large-scale pretraining, pretrained multilingual language models are shown to be of cross-lingual transferability without explicit constraints (Wu and Dredze, 2019;K et al., 2020).Based on this finding, much effort has been made to improve transferability via (1) pretraining new multilingual language models (Wei et al., 2021;Chi et al., 2021;Luo et al., 2020;Ouyang et al., 2020), or (2) introducing extra supervision such as translated data to the fine-tuning procedure (Fang et al., 2021;Zheng et al., 2021;Yang et al., 2022).PLM-based methods have pushed the state of the art of the cross-lingual transfer on a wide range of tasks (Goyal et al., 2021;Chi et al., 2022;Xue et al., 2021).

Methods
In this section, we first describe the problem definition.Then, we present how PLUGIN-X performs cross-lingual transfer with model reassembling and representation adaptation.

Problem Definition
For the common setting of cross-lingual transfer, the resulting multilingual end-task model is learned by finetuning pretrained multilingual Transformers: where D en t and L t stand for the end-task training data in the source language and the loss function for learning the task t, respectively.The initial parameters of the end-task model are from a pretrained multilingual Transformer, i.e., θ 0 := θ x .
Differently, we present the public-model-inhouse-data setting for cross-lingual transfer, or PMID.Specifically, given an already-trained monolingual end-task model, we assume that the model is obtained by finetuning a publicly available pretrained monolingual Transformer but the training data for the end task are non-public inhouse data.Under the PMID setting, we can access a monolingual end-task model ω en t and its corresponding pretrained model before finetuning ω en .The goal of cross-lingual transfer can be written as where using the easily-accessible unlabeled text data D en u is allowed.In what follows, we describe how PLUGIN-X performs cross-lingual transfer under the PMID setting.

Model Reassembling
Figure 1 illustrates the procedure of model reassembling by PLUGIN-X.PLUGIN-X disassembles monolingual and multilingual models and reassembles them into a new multilingual end-task model.The resulting model consists of three modules, multilingual encoder, cross-model connector, and endtask module.The multilingual encoder and crossmodel connector are assembled as a pipeline, which is then plugged into the end-task module.
Multilingual encoder To enable the monolingual end-task model to work with other languages, we use a pretrained multilingual language model as a new encoder.Inspired by the 'universal layer' (Chi et al., 2021) phenomenon, we divide the pretrained model into two sub-modules at a middle layer and keep the lower module as the encoder, because it produces the representations that are better aligned across languages (Jalili Sabet et al., 2020).
Cross-model connector Although the multilingual encoder provides language-invariant representations, the representations can not be directly used by the monolingual end-task model as they are unseen by the end-task model before.Thus, we introduce a cross-model connector, which aims to map the multilingual representations to the representation space of the monolingual end-task model.We simply employ a stack of Transformer (Vaswani et al., 2017) layers as the connector, because: (1) pretrained contextualized representations have more complex spatial structures, so simple linear mapping is not applicable; (2) using the Transformer structure enables us to leverage the knowledge from the remaining pretrained parameters that are discarded by the multilingual encoder.
End-task module We plug the aforementioned two modules into a middle layer of the end-task model.The bottom layers are discarded and the remaining top layers work as the end-task module.
Under the PMID setting, the end-task model is a white-box model, which means we can obtain its inner states and manipulate its compute graph.We reassemble the above three sub-modules as a pipeline.Formally, let f x (•; θ x ), f c (•; ω c ) and f t (•; ω en t ) denote the forward function of the multilingual encoder, cross-model connector, and endtask module, respectively.The whole parameter set of the reassembled model is ω x t = {θ x , ω c , ω en t }.Given an input sentence x, the output ŷ of our model is computed as (3) 3.3 Representation Adaptation PLUGIN-X activates the cross-lingual transferability by cross-model representation adaptation.It adapts the representation of the multilingual encoder to the representation space of the monolingual end-task module, by tuning the cross-model connector.We employ masked language modeling (MLM; Devlin et al. 2019) as the training objective, which ensures that the training does not require the in-house end-task data but only unlabeled text data.To predict the masked tokens, we use the original pretrained model of ω en t as the endtask module, denoted by ω en .
However, it is infeasible to directly apply MLM because the reassembled uses two different vocabularies for input and output.Therefore, we propose heterogeneous masked language modeling (HMLM) with different input and output vocabularies.As shown in Figure 2 an input sentence x, we tokenize x into subword tokens with the vocabulary of the monolingual model ω en .Then, we randomly select masked tokens as the labels.Next, we re-tokenize the text spans separated by mask tokens using the vocabulary of the multilingual encoder.Finally, the re-tokenized spans and the mask tokens are concatenated into a whole sequence as the input, denoted by x.The final loss function is defined as where p stands for the predicted distribution over the multilingual vocabulary, and M is the set of mask positions.Notice that only the connector ω c is updated during training, and the other two modules are frozen.Evaluation We evaluate the reassembled models on two natural language understanding tasks, i.e., natural language inference and extractive question answering.The experiments are conducted under the PMID setting, where the models are not allowed to access end-task data but only an already-trained monolingual task model.On both tasks, we use the finetuned RoBERTa (Liu et al., 2019) models as the monolingual task model to be transferred.

Plug-and-Play Transfer
Baselines We implement two cross-lingual transfer baselines that satisfy the PMID setting, and also include the direct finetuning method as a reference.1: Evaluation results on XNLI natural language inference under the PMID setting.We report the average results with three random seeds for baselines and PLUGIN-X.Results of FINETUNE are from Chi et al. (2021).Notice that the results are not comparable between the two settings.
(1) EMBMAP learns a linear mapping between the word embedding spaces of the monolingual RoBERTa model and the multilingual In-foXLM (Chi et al., 2021) model.Following Mikolov et al. (2013), the mapping is learned by minimizing L 2 distance.After mapping, we replace the word embeddings of the end-task model with the mapped multilingual embeddings.
(2) EMBLEARN learns multilingual word embeddings for the monolingual end-task model.We replace the vocabulary of RoBERTa with a joint multilingual vocabulary of 14 languages of XNLI target languages.Then, we build a new word embedding layer according to the new multilingual vocabulary.We learn the multilingual word embeddings by training the model on 14-language text from CCNet with 30K training steps and a batch size of 256.Following Liu et al. (2019), the training data is masked language modeling with 512-length text sequences.During training, we freeze all the parameters except the multilingual word embeddings.Finally, we replace the word embeddings of the end-task model with the newlylearned multilingual word embeddings.
(3) FINETUNE directly finetunes the multilingual Transformers for the end tasks, which does not satisfy the PMID setting.We include the results as a reference.
Notice that our goal is to investigate whether PLUGIN-X can activate the cross-lingual transferability of multilingual Transformers, rather than achieving state-of-the-art cross-lingual transfer results.Therefore, we do not compare our models with machine translation systems or state-of-the-art cross-lingual transfer methods.

Natural Language Inference
Natural language inference aims to recognize the textual entailment between the input sentence pairs.We use the XNLI (Conneau et al., 2018) dataset that provides sentence pairs in fifteen languages for validation and test.Given an input sentence pair, models are required to determine whether the input should be labeled as 'entailment', 'neural', or 'contradiction'.For both baselines and PLUGIN-X, we provide the same monolingual NLI task model, which is a RoBERTa model finetuned on MNLI (Williams et al., 2018).
We present the XNLI accuracy scores in Table 1, which provides the average F1 scores over three runs.Overall, PLUGIN-X outperforms the baseline methods on XNLI cross-lingual natural language inference in terms of average accuracy, achieving average accuracy of 60.2 and 62.9.The results demonstrate that PLUGIN-X successfully activates the cross-lingual transferability of XLM-R and InfoXLM on XNLI without accessing XNLI data.In addition to high-resource languages such as French, our models perform surprisingly well for low-resource languages such as Urdu.Besides, we see that the choice of the multilingual Transformer can affect the cross-lingual transfer results.

Question Answering
Our method is also evaluated on the extractive question answering task to validate cross-lingual transferability.Given an input passage and a question, the task aims to find a span in the passage that can answer the question.We use the XQuAD (Artetxe et al., 2020) dataset, which provides passages and question-answer pairs in ten languages.
The evaluation results are shown in spans from runs with three random seeds.Similar to the results on XNLI, PLUGIN-X obtains the best average F1 score among the baseline methods.The results demonstrate the effectiveness of our model on question answering under the PMID setting, which also indicates PLUGIN-X successfully activates the cross-lingual transferability.Nonetheless, it shows that PLUGIN-X lags behind FINETUNE, showing that PMID is a challenging setting for cross-lingual transfer.

Ablation Studies
In the ablation studies, we train various models with PLUGIN-X with different architectural or hyper-parameter configurations.Notice that the models are plugged into the same English end-task model for plug-and-play cross-lingual transfer, so the end-task performance can directly indicate the cross-lingual transferability.

Key architectural components
We conduct experiments to validate the effects of key architectural components of PLUGIN-X.We train several models with a batch size of 64 for 30K steps.The models are described as follows.nector layers from 6 to 2; (3) the '− Multilingual encoder' model discards the frozen multilingual encoder except for the word embeddings, and regards the whole Transformer body as a connector.The models are evaluated on the XNLI and XQuAD under the PMID setting.The evaluation results are presented in Table 3.It can be observed that the model performs less well when removing any of the components.Discarding the frozen multilingual encoder leads to severe performance drops on both tasks, demonstrating the importance of the frozen multilingual encoder.Besides, using a shallower connector produces the worst results on XNLI and '− Middle-layer plugging' performs worst on XQuAD.

Model
en-zh en-ur with more steps, indicating that the representation adaptation leads to better activation of cross-lingual transferability.Besides, PLUGIN-X also tends to activate better cross-lingual transferability when using larger batch sizes and obtains the best performance with a batch size of 256.

Analysis
We present analyses on the cross-model representation alignment of the reassembled models, and investigate their cross-lingual transferability.
Cross-model alignment A key factor for our method to achieve cross-lingual transfer under the PMID setting is that PLUGIN-X performs representation adaptation.We conduct experiments to directly provide quantitative analysis on the alignment property of the reassembled models.To this end, we leverage the parallel sentences provided by XNLI as input, and compute their sentence embeddings.Specifically, we first extract the sentence embeddings of the English sentences using the monolingual end-task model, where the embeddings are computed by an average pooling over the hidden vectors from the sixth layer.Then, the sentence embeddings of other languages are obtained from the connector of PLUGIN-X.We also compute the sentence embeddings of other languages using the hidden vectors from the sixth layer of InfoXLM for comparison with our model.Finally, we measure the alignment of the representation spaces by measuring the L 2 distance and cosine similarity between the sentence embeddings output.We compare results between the original InfoXLM model and the reassembled model.Table 4 and Figure 5 show the quantitative analysis results of representation alignment and the distance/similarity distribution on XNLI validation sets, respectively.Compared with InfoXLM, our reassembled model achieves a notably lower L 2 distance than the monolingual end-task model.Consistently, our model also obtains larger cosine similarity scores with low variance.The results show that, although the InfoXLM provides wellaligned representations across languages, there is a mismatch between its representation space and the space of the monolingual end-task model.On the contrary, PLUGIN-X successfully maps the representation space without accessing the in-house end-task data.
Transferability For a better understanding of how PLUGIN-X activates the cross-lingual transferability, we analyze the relation between transferability and cross-model representation alignment.We use the transfer gap metric (Hu et al., 2020) to measure the cross-lingual transferability.In specific, the transfer gap score is computed by subtracting the XNLI accuracy score in the target language from the score in the source language, which means how much performance is lost after transfer.When computing the transfer gap scores, we use the monolingual end-task model results for the source language, and our reassembled model results for target languages.To measure the representation alignment, we follow the procedure mentioned above, using the metrics of L2 distance and  cosine similarity.We compute transfer gap, L 2 distance, and cosine similarity scores with the reassembled models from various steps on the validation sets of XNLI in fourteen target languages.
In Figure 6 we plot the results.We see a clear trend that the transfer gap decreases as PLUGIN-X achieves lower cross-model L 2 distance.The trend is also confirmed when we switch the representation alignment metric to the cosine similarity.This highlights the importance of cross-model representation alignment between the monolingual model and the multilingual model for the activation of cross-lingual transferability.More interestingly, the data points have the same trend no matter what language they belong to.Besides, we also observe that the data points of blue colors are highresource languages, which typically have lower transfer gaps.Our findings indicate that the crosslingual transfer can be improved by encouraging cross-model alignment.

Discussion
Transferability activation To answer our research question, we have conducted experiments on cross-lingual transfer under the public-modelin-house-data (PMID) setting.Our experimental results in Section 4.2 and Section 4.3 show that PLUGIN-X successfully activates the cross-lingual transferability of multilingual Transformers without using the in-house end-task data.Notice that our goal is to answer the research question, rather than develop a state-of-the-art algorithm for the common cross-lingual transfer setting.
Transferability quantification It is difficult to quantify cross-lingual transferability because the results are non-comparable and the compared models typically have different performances in the source language.We propose to transfer an alreadytrained end-task model to other languages.As the end-task model is stationary, the transfer gap is only dependent on cross-lingual transferability.Therefore, we recommend that the models to be evaluated should transfer the same end-task model to obtain comparable transferability scores.

Model fusion
We show that two models with two different capabilities, i.e., end-task ability and multilingual understanding ability, can be fused into a single end-to-end model with a new ability, performing the end task in multiple languages.We hope this finding can inspire research on the fusion of models with different languages, modalities, and capabilities.

Conclusion
In this paper, we have investigated whether the cross-lingual transferability of multilingual Transformers can be activated without end-task data.We present a new problem setting of cross-lingual transfer, the public-model-in-house-data (PMID) setting.To achieve cross-lingual transfer under PMID, we propose PLUGIN-X, which reassembles the monolingual end-task model and multilingual models as a multilingual end-task model.Our results show that PLUGIN-X successfully activates the cross-lingual transferability of multilingual Transformers without accessing the in-house end-task data.For future work, we would like to study the research question on more types of models such as large language models (Huang et al., 2023).

Limitations
Our study has limitations in two aspects.First, multilingual Transformers support a wide range of task types, and it is challenging to study our research question on all types of end tasks.We conduct experiments on two common types of end tasks, i.e., text classification and question answering.We leave the study on other types of end tasks in further work.Second, under PMID, we only consider the situation that the end-task models are obtained by finetuning public pretrained models.The cross-lingual transfer of black-box end-task models is also an interesting research topic to study.Besides, PLUGIN-X reassembles the modules from publicly-available models rather than training from scratch, so it can naturally inherit the risks from those models.

Figure 1 :
Figure 1: Model reassembling by PLUGIN-X.PLUGIN-X disassembles monolingual and multilingual models and then reassembles them into a new multilingual end-task model.The resulting model consists of three modules, namely multilingual encoder, cross-model connector, and end-task module.

Figure 2 :
Figure 2: Heterogeneous masked language modeling with different input and output vocabularies for representation adaptation.

Figure 3
Figure3illustrates how the resulting reassembled model performs cross-lingual transfer in a plugand-play manner.After the aforementioned crossmodel representation adaptation procedure, we remove the current end-task module ω en on the top, which is for the HMLM task.Then, we plug the remaining part of the model into the end-task module ω en t , and now the model can directly perform the end-task t in target languages.4Experiments4.1 Setup Data We perform PLUGIN-X representation adaptation training on the unlabeled English text

Figure 3 :
Figure 3: Illustration of how the reassembled model performs cross-lingual transfer in a plug-and-play manner.
(1) The '− Middlelayer plugging' model plugs the connector to the bottom of the monolingual task model and replaces the embedding layer with the output of the connector; (2) the '− Deeper connector' model uses a shallower connector, reducing the number of con-

Figure 4 :
Figure 4: The average XNLI-14 accuracy scores, where we perform PLUGIN-X representation adaptation various batch sizes and training steps.
Figure 5: Cross-model representation distance/similarity distribution on XNLI validation sets.
Relation between transfer gap and cosine similarity

Figure 6 :
Figure 6: Relation between cross-lingual transferability and cross-model representation alignment.

Table 2 :
Pfeiffer et al. (2020)eport averaged F1 scores of extracted InfoXLM 53.3 52.4 41.2 51.4 42.4 45.1 51.6 37.3 54.7 40.947.0Evaluationresults on XQuAD extractive question answering under the PMID setting.We report the average results with three random seeds for baselines and PLUGIN-X.Results of FINETUNE are fromPfeiffer et al. (2020).Notice that the results are not comparable between the two settings.

Table 3 :
Ablation studies on key components of PLUGIN-X.

Table 4 :
Quantitative analysis on cross-model representation alignment between the monolingual and multilingual models.We measure the L 2 distance and cosine similarity between the monolingual and multilingual models.