InteMATs: Integrating Granularity-Specific Multilingual Adapters for Cross-Lingual Transfer

,

Nevertheless, these approaches often require fine-tuning the entire backbone model, which is computationally expensive and can result in catastrophic forgetting (Kirkpatrick et al., 2017).Moreover, they mainly focus on enhancing sentencelevel representations.As a result, their performance gains on document-level cross-lingual tasks are not as impressive as sentence-level ones (Hu et al., 2020), indicating a need for a finer-grained perspective in enhancing cross-lingual alignment.
Motivated by recent advancements in parameterefficient adaptation for large language models (LLMs) (Ding et al., 2022;Guo and Yu, 2022), we propose a novel approach named InteMATs, which stands for Integrating Granularity-specific Multilingual Adapters, to enhance cross-lingual transfer performance of MLLMs.Adapter tuning works by training a set of adapter modules conditioned on a frozen LLM (Houlsby et al., 2019).They can equip MLLMs with new knowledge (Hou et al., 2022) and facilitate language-specific adaptation (Pfeiffer et al., 2020(Pfeiffer et al., , 2022) ) without modifying pretrained parameters.Different from them, we offer a new perspective for enhancing cross-lingual alignment by exploiting a set of multilingual adapters pre-trained on different levels of text granularity.
To obtain these adapters, we curate a multilingual parallel dataset consisting of 42 languages from Wikipedia1 and process them into the sentence-level corpus (D ST ) and document-level corpus (D DT ).Recent findings (Park et al., 2023) reveal that contrastive learning (CL) tends to ex-tract global information, whereas masked image modeling (MIM) focuses on capturing local features, which well explained earlier results (Wang et al., 2022).Therefore, we employ CL to train adapters to capture global cross-lingual alignment information to augment the MLLM representations that may initially focus on extracting local information.Specifically, we first train a set of multilingual adapters on D ST and D DT respectively.Then, we learn to fuse them by incorporating layer-wise fusion modules (e.g., AdapterFusion (Pfeiffer et al., 2021)) and train them on downstream data to determine the fusion weights.
Extensive experiments on five kinds of crosslingual transfer tasks, including sentence-level and document-level ones, demonstrate that InteM-ATs significantly improves the cross-lingual transfer performance of MLLMs (i.e., mBERT and XLMR).In particular, InteMATs excels the stateof-the-art baseline by 4% on BUCC (Zweigenbaum et al., 2017), 7% on Tatoeba (Artetxe and Schwenk, 2019a), and 3.5% on TydiQA (Clark et al., 2020).Notably, InteMATs brings a substantial 30% improvement over its backbone MLLMs in low-resource languages that are never seen during pre-training, demonstrating the high quality of the representations learned by InteMATs.We finally conduct a comprehensive analysis on InteM-ATs to unravel the distribution of contributions on each component and layer-wise impact within In-teMATs.We will make our data and model publicly available for future research.

Related Works
Cross-lingual Representation Learning Existing researches mainly adopt the full-model finetuning approach with monolingual or parallel corpus to obtain cross-lingual representations.They employ pretext tasks such as masked language modeling (MLM) (Devlin et al., 2019;Chi et al., 2021b), casual language modeling (CLM) (Conneau and Lample, 2019), and translation language modeling (TLM) (Chi et al., 2021a,b), to train MLLMs.However, from the results reported on the XTREME benchmark (Hu et al., 2020), we observe a decrease in cross-lingual transfer performance for MLLMs as the input text length increases (Hu et al., 2020).Previous researches mainly focus on improving cross-lingual alignment for sentence representations (Wang et al., 2022;Feng et al., 2022;Artetxe and Schwenk, 2019b).There is a lack of research specifically addressing the enhancement of document-level cross-lingual representations.
Adapters for MLLMs Recently, Adapter-based approaches (Houlsby et al., 2019;Pfeiffer et al., 2020;Artetxe et al., 2020b;Li and Liang, 2021;Pfeiffer et al., 2022) have gained popularity as a parameter-efficient alternative to traditional finetuning for large language models (LLMs).These adapters are inserted between the transformer layers and are learned from data conditioned on a frozen LLM.They learn language-specific transformations to facilitate quick adaptation to new languages (Artetxe et al., 2020b) or new tasks (Pfeiffer et al., 2020).Recent research show that it is possible to inject new knowledge into MLLMs to enhance their cross-lingual representations (Hou et al., 2022).Different from previous research, this paper fills the gap of granularity perspective in cross-lingual alignment via adapter tuning.

Contrastive Learning for Language Models
Contrastive learning (CL) (Gao et al., 2021;He et al., 2022;Chen et al., 2020a,b) has shown much promise in NLP for its capability of capturing discriminative information in an unsupervised manner.Many research works have adopted this pretext task in their MLLM pre-training, such as mSim-CSE (Wang et al., 2022), LASER (Artetxe and Schwenk, 2019b), and InfoXLM (Chi et al., 2021a).However, these approaches require fine-tuning the entire backbone, resulting in non-trivial computational overhead.This paper circumvents this challenge by applying CL for pre-training multilingual adapters while leaving MLLMs frozen.

Preliminaries
We start by introducing some basic knowledge about adapter tuning and contrastive learning.
Adapter Tuning: Adapter tuning (Houlsby et al., 2019) is a parameter-efficient transfer learning technique for adapting large pre-trained models to downstream tasks based on adapters (Rebuffi et al., 2017).These adapters are small, new taskspecific modules inserted between the layers of a pre-trained model.Instead of fine-tuning the entire model, adapter tuning only trains the parameters of the adapter modules while keeping the pre-trained model fixed.This approach allows us to specialize a pre-trained model in different tasks while retaining the knowledge acquired during pre-training.
Following Houlsby et al. (2019), we use ϕ w to denote a pre-trained model with parameters w: ϕ w (x).For adapter tuning, a new function, ψ w,v (x), is composed, where parameters v denote all the adapter modules and w is copied from pretrained weights.The architecture of an adapter module is shown in Figure 1 (a).It consists of two feed-forward layers and a non-linear activation function.For the hidden states h l at layer l, an adapter module works as follows: where W 2 l represents the down-projection matrix, W 1 l represents the up-projection matrix, and ReLU is the activation function.r l represents the residual information from the original input, which bypasses the adapter's transformations.
InfoNCE: InfoNCE (InfoMax with Noise Contrastive Estimation) (Oord et al., 2018) is a loss function commonly used in self-supervised learning, particularly in contrastive learning methods.The goal is to maximize the mutual information between related samples while minimizing it between unrelated samples, facilitating the discovery of discriminative features in an unsupervised manner.
Given a sample x, and a set of N random samples, X = {x 1 , ..., x N }, which contains one positive sample x + and N − 1 negative samples x − , we minimize the following negative log-likelihood: e sim(x,x + )/τ x i ∈X e sim(x,x i )τ , (2) where τ is a temperature hyperparameter.Following SimCSE (Gao et al., 2021), we employ cosine similarity as a measure to compare the representations of positive pairs and negative pairs: h 1 •h 2 ∥h 1 ∥∥h 2 ∥ .In this paper, we use pre-trained models such as mBERT (Devlin et al., 2019) and XLMR (Conneau et al., 2019) to encode the input texts and only train all the adapters using the InfoNCE objective.

The InteMATs Approach
We now introduce InteMATs, which integrates multilingual adapters to enhance the representations of a fixed MLLM.InteMATs involves two stages.In the first stage, we pre-train two multilingual adapters to be specialized in processing texts of different levels of granularity: sentence-level multilingual adapters (MATs-ST) ( §4.1) and documentlevel multilingual adapters (MATs-DT) ( §4.2).Inspired by the findings (Park et al., 2023) which unraveled the complementary properties of CL and masked image modeling objectives, we use CL to train adapters to augment the MLLMs that have been initially trained with the MLM objective.The goal is to enhance the cross-lingual alignment of MLLMs with pluggable adapters while retaining their pre-learned knowledge.In the second stage, we show how to integrate these adapters for crosslingual transfer tasks.In this paper, we employ AdapterFusion (Pfeiffer et al., 2021) for this purpose ( §4.3).InteMATs, however, is not limited to the choice of MLLMs and fusion algorithms.We show the working mechanisms of InteMATs and a concrete example of MATs-ST in Figure 1.

Sentence-level Multilingual Adapters
Notations.We designate English as the source language and all the other languages as target languages.To train sentence-level multilingual adapters (MATs-ST), we curate an entity-aligned parallel dataset consisting of N sentences in J languages: , where D j = {x i } N i=1 .We employ a pre-trained MLLM as the text encoder and use the hidden states from the penultimate layer of the MLLM as text representations.For example, with mBERT, x i is encoded into a sequence of m + 1 token representations: h i =< e([CLS]), e(t 1 ), ..., e(t m−1 ), e([SEP]) >.Here, [CLS] and [SEP] are classification and separator tokens specially used for learning positional and structural information.
The pre-training objective.For every English sample x en , we randomly select another English sample x en + and N non-English samples x j − to create contrastive data pairs.We denote the average hidden states of the sequence of tokens as h st , and take it as the sentence representation for a given sentence sample.Similarly, we use h et as the representation for the aligned entity.
We train MATs-ST on top of a fixed MLLM by minimizing the following contrastive loss on the sentence-level corpus D ST : where the superscript en represents English and j ∈ [1, J] represents one of the J languages.v s denotes the parameters of MATs-ST.We encourage cross-lingual alignment at both sentence-level and entity-level representations.

Document-level Multilingual Adapter
As Hu et al. (2020) demonstrates, the cross-lingual transfer performance of MLLMs tends to degenerate on longer texts, potentially due to their limited capability of understanding longer contexts.Motivated by this observation, we specially train document-level multilingual adapters (MATs-DT) for MLLMs.Similar to the pre-training of MATs-ST, we curate a document-level parallel dataset comprising J languages: D DT = {D j } J j=1 .For each document d in D j , we also average the hidden states from the penultimate layer of an MLLM as the document-level representation: h dt .The pre-training objective.The pre-training setup of MATs-DT is similar to that of MATs-ST except using longer input texts.We encourage adapters to capture cross-lingual alignment on the documentlevel corpus, D DT , by minimizing the following contrastive loss: where v d denotes the parameters of MATs-DT.Note that we do not explicitly include the loss for entity-level alignment as Eq.4.Our experiments show that incorporating such constraint does not improve cross-lingual transfer performance and, in fact, leads to a decrease in performance.This suggests a conflict between entity-level alignment, which focuses on capturing local information, and document-level alignment, which emphasizes capturing global information.

Integrating Multilingual Adapters
We now introduce the second stage: knowledge composition.Inspired by AdapterFusion (Pfeiffer et al., 2021), which trains multiple task-specific adapters on top of a shared pre-trained model.We propose to train InteMATs in a similar way to fuse the multilingual adapters on cross-lingual tasks.By incorporating both MATs-ST and MATs-DT at each transformer layer, we seek a more comprehensive view for capturing transferrable features from varying lengths of context.As illustrated in Figure 1, we append the AdapterFusion module right after the MATs-ST and MATs-DT modules, followed by a residual connection to the original transformer output, h l , at layer l.The outputs of MATs-ST (h st,l ) and MATs-DT (h dt,l ) are used as inputs for both the Value and Key transformations, and the new hidden states after fusion are as follows: where d represents the size of hidden states, [•, •] indicates the concatenation of vectors, and W q , W k , W v are d×d matrices for computing cross attentions (Vaswani et al., 2017).We train InteMATs on each cross-lingual task to determine the task-specific importance weights for MATs-ST and MATs-DT.In this process, the parameters of MATs-ST and MATs-DT are fixed while the parameters of the fusion module, v f , are optimized by minimizing, e.g., cross-entropy loss, on each task-specific dataset D T :

Experiments
In this section, we present the evaluation details and results of InteMATs across sentence-level and document-level cross-lingual tasks.

Experimental Setup
Pre-training Corpus.We collect a large, entityaligned, multilingual dataset from Wikipedia 2 .Each data sample is a summary text with a maximum length of 384 to describe an entity, which spans multiple languages.The dataset covers 42 languages in total, including those extensively studied in the popular XTREME benchmark (Hu et al., 2020).We treat English as the source language and ensure each English sample has at least four parallel samples from other languages.We utilize the first sentence of each summary text and curate the sentence-level parallel dataset D ST for training MATs-ST and use the entire raw text D DT for training MATs-DT.More details about the dataset can be found in Appendix A.1.
MLLM Backbones.We experiment on three representative MLLMs, the base version of mBERT (Devlin et al., 2019), both the base and large versions of XLMR (Conneau et al., 2019), to show the effects on different model types and model scales.Hyperparameters setup under each MLLM can be found in Appendix A.1.
Evaluation Setup.We divide the languages in each downstream task into two types: Sup.: stands for supervised language, which is used for finetuning, and ZS: stands for zero-shot language, which is only used for testing.

Cross-lingual Semantic Textual Similarity
We first evaluate the models on the Cross-lingual Semantic Textual Similarity (STS) task (Cer et al., 2017) to assess the quality of their representations in capturing universal semantics.Following mSim-CSE (Wang et al., 2022), we average the embeddings from the first and last layers.Table 1 presents the semantic textual similarity among the Arabic (ar), Spanish (es), English (en), and Turkish (tr) languages.Comparing InteMATs (XLMR) with the SOTA model, mSimCSE, we observe that on monolingual groups (ar→ar, es→es), mSimCSE demonstrates the highest textual similarity, while on cross-lingual groups (ar→en, es→en, tr→en), InteMATs produces the highest textual similarity.Note that mSimCSE employs the large version of XLMR as the backbone during pretraining.This implies that InteMATs can enhance the representations of XLMR in capturing crosslingual semantics.Moreover, mSimCSE requires fine-tuning the entire MLLM in the pre-training stage while InteMATs only trains a set of adapters (See Appendix 5.11.).However, when conditioned on mBERT, it does not bring improvement, showing the limit on the choice of MLLMs.

Cross-lingual Sentence Retrieval
We use BUCC (Zweigenbaum et al., 2017) and Tatoeba (Artetxe and Schwenk, 2019a) datasets to evaluate the cross-lingual sentence retrieval performance of InteMATs.Specifically, given a sample from the source language, e.g., English, the model should correctly retrieve all the similar samples from other xx languages, and vice versa.On Tatoeba, we follow XLM-E (Chi et al., 2022) and mSimCSE (Wang et al., 2022) and report the results on both the 14 common languages (Tatoeba-14) and all the 36 languages (Tatoeba-36).
Table 2 presents the overall performance comparison.InteMATs, when conditioned on XLMR large , outperforms all the baselines by a large margin, establishing a new state-of-the-art on the unsupervised settings of BUCC and Tatoeba.On all selected MLLMs, InteMATs outperforms its counterparts regardless of the model scale, indicating that there is significant potential for enhancing crosslingual alignment in the representations provided by pre-trained MLLMs.InteMATs outperforms the second-best state-of-the-art model, mSimCSE, implying that pre-training adapters through CL is better than the traditional full-model fine-tuning approach in aligning cross-lingual representations.Moreover, InteMATs outperforms other competitive MLLMs, namely XLM-E and InfoXLM, which employ the same backbone but with distinct pretraining tasks.

Cross-lingual Sentence Classification
We conduct performance evaluation on RELX (Köksal and Özgür, 2020), the cross-lingual relation extraction dataset, and PAWS-X (Zhang et al., 2019), the paraphrase detection dataset.RELX contains five languages and PAWS-X contains seven languages.We follow the XTREM benchmark (Hu et al., 2020) and train adapters only on English data and test on non-English data.Following MLKG (Hou et al., 2022), we report the performance on the source language, Sup(en), the average performance on all the other n zero-shot languages, ZS(n), and the average performance on all languages in the dataset.
Table 3 presents the performance comparison results.Generally, InteMATs still takes the highest spot on both datasets, especially on zero-shot languages (ZS).Interestingly, the average performance of InteMATs matches that of VECO on PAWS-X.However, VECO requires fine-tuning the entire XLMR model and is supported with a huge pre-training corpus covering 50 languages (Luo et al., 2021) while InteMATs only trains a few pluggable adapters.Compared with its backbone models, mBERT and XLMR, InteMATs consistently enhances their cross-lingual zero-shot performance by 2% ∼ 3.2%.However, the performance gains on the source language (en) are not consistent, indicating that InteMATs may focus on enhancing the cross-lingual alignment on target languages.

Cross-lingual Syntactic Analysis
We conduct evaluations on a cross-lingual Part-of-Speech (POS) dataset (Nivre et al., 2020) from the XTREME benchmark (Hu et al., 2020) to assess a model's capability of capturing syntactic structures and grammatical properties across 33 languages.All models are trained on English data.As shown in Table 4, InteMATs achieves comparable performance to XLM-ALIGN and outperforms VECO and XLM-E on average.This suggests that instead of fine-tuning the entire MLLMs, transferring the syntactic knowledge acquired from English data to other languages can be easily achieved by only tuning a few adapters.It holds true regardless of the choice of MLLMs.Moreover, InteMATs achieves more performance gains for zeros-hot languages, showing the advantage of adapter tuning for MLLMs.

Cross-lingual Question-Answering
Question-answering (QA) requires a model to understand a given long context so as to correctly answer the questions by extracting the text span for the true answer from the context.We conduct evaluations on two popular multilingual QA benchmarks: XQuAD (Artetxe et al., 2020a) and TydiQA (Clark et al., 2020).Table 5 presents the performance comparison between InteMATs and SOTA models.In general, InteMATs excels all the full-model fine-tuning baselines (VECO, XLM-E, XLM-ALIGN, and KMLM), and adapter-based baselines (MLKG and MAD-X).Conditioned on the same large version of XLMR, InteMATs surpasses the second-best baseline, KMLM, by 0.3% on XQuAD and 3.2% on TydiQA, at a low cost of training.On both supervised and zero-shot benchmarks, InteMATs consistently outperforms fine-tuning the backbone models, mBERT and XLMR, indicating the advantage of incorporating multilingual adapters into pre-trained MLLMs.

Closing the Cross-Lingual Transfer Gap
We summarize the above cross-lingual performance results and conclude that InteMATs can effectively mitigate the issue of cross-lingual performance degeneration in pre-trained MLLMs, as shown in Table 6.All models are trained using the source English language, achieving a perfor-mance of S en .They are subsequently tested on other languages, yielding an average performance of S tgt .The performance gap, ∆ = S en − S tgt , serves as an evaluation metric to assess the degree of cross-lingual transfer degradation.
On all but the POS task, InteMATs based on XLMR large , demonstrates the lowest cross-lingual performance degeneration, highlighting its advantage in enhancing cross-lingual knowledge transfer.

Scaling to Low-resource Languages
We further study how InteMATs performs on languages out of our pre-training corpus.We use eight low-resource languages from the Tatoeba dataset: Cantonese (yue), Vietnamese (vie), Tagalog (tgl), Irish (gle), Georgian (kat), Khmer (khm), Telugu (tel), Serbian (srp).Table 7 presents the results.We find that directly applying MLLMs on low-resource languages yields poor performance, with an average accuracy of approximately 25% ∼ 35%.In contrast, when conditioned on XLMR large , InteMATs significantly improves the performance to 62%.These findings confirm our hypothesis that pre-trained MLLMs exhibit poor cross-lingual alignment on low-resource languages due to their scarcity of training data during pre-training.InteMATs effectively enhances MLLMs by capturing better cross-lingual alignment information, enabling generalization to unseen low-resource languages.Appendix A.2 provides more details.

Ablation Study and Analysis
We conduct ablation studies on six tasks to unravel the individual impact of each component in InteMATs.Specifically, we compare the backbone MLLMs, MATs-ST, MATs-DT, and InteMATs.The ablation results in Table 8 show that MATs-ST generally outperforms MATs-DT on sentence-level tasks, while MATs-DT performs better than MATs-ST on document-level tasks.Both MATs-ST and MATs-DT outperform the backbone MLLMs, particularly with substantial gains on BUCC, Tatoeba, and TydiQA.By effectively incorporating these two modules, InteMATs achieves the best performance across the benchmarks.We further study whether our proposed different granularities of pre-training corpus boosts MLLMs' performance on cross-lingual understanding tasks.We compare with Initialized (adding a randomly initialized adapter without pre-training) and Mixed (adding an adapter trained on a non-distinguishing granularity of the pre-training corpus), as shown in Table 9.We find that InteMATs consistently outperform Initialized across all tasks, with an average improvement of 2% on mBERT and 4.2% on XLMR backbones.Meanwhile, InteMATs surpass Mixed on all tasks except Tatoeba, achieving average gains of 2.1% on mBERT and 4.4% on XLMR backbones.These findings emphasize that InteMATs by pre-training individually on more diverse range of text granularities can precisely capture and integrate different text granularities of cross-lingual alignments, transferring this alignment knowledge to understanding tasks and yielding performance improvements.

Analysis of Pre-training Tasks
We evaluate the pre-training tasks of CL's advantage in capturing global cross-lingual alignment information.We compare with mBERT-FT, which is fully fine-tuned on the entire raw text D DT , and InteMATs-MLM, which uses the pre-training task of MLM instead of CL to train each adapters.
Table 10 shows the results.We can observe that MLLMs with adapters can achieve performance enhancements on all tasks but Tatoeba.This suggests employing external adapters for training is more effective for incremental cross-lingual knowledge's learning, than full model fine-tuning.Meanwhile, InteMATs excel over other MLLMs on four tasks, showing average gains of 1.6% over InteMATs-MLM, 2% over Initialized, and 3% over mBERT-FT.These results underscore that applying the pre-training task of CL to train adapter can enhance global cross-lingual alignment information, enabling knowledge transferring to understanding tasks and boosting performance.

Analysis of Fusion Activation
InteMATs learns to fuse knowledge from MATs-ST and MATs-DT for different tasks.We retrieve the activation values of its fusion module and visualize the weight distributions for four tasks in Figure 2 (POS, RELX, XQuAD, and TydiQA), with increasing input text lengths.We observe a consistent pattern in InteMATs, where it favors MATs-ST for sentence-level tasks and MATs-DT for document-level tasks.Specifically, on POS and RELX tasks, it relies more on the representations from MATs-ST, while on XQuAD and TydiQA, it relies more on MATs-DT.Moreover, as the network goes deeper, the degree of reliance on MATs-DT increases.This finding confirms with our intuition that granularity-specific adapters are specialized in handling texts of varying lengths.As a result, InteMATs can effectively leverage these adapters to enhance cross-lingual alignment regardless of the specific task at hand.

Layer-wise Representation Analysis
We examine InteMATs layer by layer to unravel on which layers it enhances cross-lingual transfer performance.Figure 3 compares InteMATs and XLMR in both base and large versions.We report the sentence retrieval accuracy on the Tatoeba dataset (Artetxe and Schwenk, 2019a) using representations from each transformer layer.We find that InteMATs achieves similar performance to XLMR in the early layers.However, in later layers (layer 2 onwards for the base version and layer 9 onwards for the large version), InteM-ATs significantly outperforms XLMR.This comparison reveals that the cross-lingual transfer capability of InteMATs is gradually developed in the later layers, which provides improved cross-lingual alignment.This finding complies with previous research that later layers of Transformers tend to extract high-level features (Clark et al., 2019).

Model Configuration
We compare the pre-training budget for enhancing the cross-lingual alignments against existing MLLMs in Table 11.The results reveal that In-teMATs offers better parameter efficiency during pre-training and performance improvements on cross-lingual understanding tasks across various text granularities, requiring fewer computational parameters and smaller training corpus.

Limitations
We identify a few limitations of our current work.
• First, InteMATs demonstrates limited improvements on structure prediction tasks, i.e., POS dataset.This is not surprising as syntactic structures are not universal across different languages.However, it is possible to share knowledge between languages from the same family, e,g., Romance languages (es, pt, it, fr, ro).We encourage future researchers to pay more attention to the syntactic cross-lingual alignment for MLLMs.
• Second, the publicly available benchmarks for cross-lingual transfer evaluation are dominated by sentence-level tasks.As a result, performance comparisons on existing benchmarks could be inadequate to demonstrate a model's capability of handling longer contexts and transfer that ability to different languages.

A.3 Detailed Experimental Results
We provide detailed results for each language on the cross-lingual language tasks.Specifically, we present the results for cross-lingual sentence-level retrieval benchmarks in Tables 13 (based on BUCC dataset), while the results for the cross-lingual relation extraction and classification benchmark are displayed in Tables 14 (on RELX dataset) and 15 (on PAWS-X dataset).The results for cross-lingual structure prediction are shown in Table 16.Meanwhile, we present the results for cross-lingual question answering tasks in Table 17 (on XQuAD dataset) and 18 (on TydiQA dataset).

Figure 1 :
Figure 1: (a): The structure of InteMATs in each transformer layer.It includes two multilingual adapters (i.e., MATs-ST and MATs-DT) and an adapter-fusion module.The fusion module learns the Key, Value, Query matrices to fuse two pre-trained adapters.MATs-ST and MATs-DT are fixed when updating the fusion module.MHA means multi-heads attention mechanism.Add&Norm means addition and layer normalization operation.FFN means the feed-forward neural network.(b): An example of training MATs-ST on the sentence-level parallel corpus.The MLLM backbone is fixed during training.

Figure 2 :
Figure 2: InteMATs activation at different layers.We follow the setting from Pfeiffer et al. (2021) to calculate the SoftMax activation for the two adapters, MATs-ST and MATs-DT.

Figure 3 :
Figure 3: Average accuracy on 36 language pairs from Tatoeba dataset in the xx->en directions.

Figure 4 :
Figure 4: The cosine similarities of aligned embeddings of two parallel sample sets.

Table 2 :
The cross-lingual sentence retrieval performance.We report the average F1 score of four languages for BUCC and the average accuracy@1 score for Tatoeba.

Table 3 :
The cross-lingual sentence classification performance.We report F1 score for RELX and Accuracy (Acc.) for PAWS-X.Results of VECO, XLM-E, XLM-ALIGN, KMLM, MTMB, and MLKG are taken from their original papers.

Table 4 :
The cross-lingual structure prediction performance.Accuracy (Acc.) is used for POS evaluation.

Table 5 :
The cross-lingual question-answering performance.We report their F1 score and Exact Matching (F1/EM) scores.

Table 6 :
The performance gap (∆) between the source language and target languages on cross-lingual transfer tasks.A lower score indicates better cross-lingual transferability.

Table 7 :
The performance on the low-resource languages from Tatoeba dataset for cross-lingual retrieval accuracy (Acc.).

Table 8 :
The ablation results for both MLLMs on various cross-lingual language tasks.

Table 9 :
The performance on various cross-lingual tasks under the MLLMs pre-trained on different pre-training corpus.

Table 10 :
The performance on various cross-lingual tasks under different pre-training tasks.

Table 11 :
The implementation details of the existing MLLMs and InteMATs.

Table 14 :
The detailed results of F1 score on RELX dataset for each languages.

Table 15 :
The detailed results of Acc. on PAWS-X dataset for each languages.

Table 16 :
The detailed results of Acc. on POS dataset for each languages.