Structural Contrastive Pretraining for Cross-Lingual Comprehension

Multilingual language models trained using various pre-training tasks like mask language modeling (MLM) have yielded encouraging results on a wide range of downstream tasks. Despite the promising performances, structural knowledge in cross-lingual corpus is less explored in current works, leading to the semantic misalignment. In this paper, we propose a new pre-training task named Structural Contrast Pretraining (SCP) to align the structural words in a parallel sentence, improving the models’ linguistic versatility and their capacity to understand representations in multilingual languages. Concretely, SCP treats each structural word in source and target languages as a positive pair. We further propose Cross-lingual Momentum Contrast (CL-MoCo) to optimize negative pairs by maintaining a large size of the queue. CL-MoCo extends the original MoCo approach into cross-lingual training and jointly optimizes the source-to-target language and target-to-source language representations in SCP, resulting in a more suitable encoder for cross-lingual transfer learning. We conduct extensive experiments and prove the effectiveness of our resulting model, named XLM-SCP , on three cross-lingual tasks across five datasets such as MLQA, WikiAnn. Our codes are available at https://github.


Introduction
Following the promising results of the pre-training paradigm in the monolingual natural language domain, the efforts of multilingual pre-trained language models (xPLMs) (Huang et al., 2019;Liang et al., 2020;Conneau et al., 2019;Chi et al., 2021a;Chen et al., 2022) have been proposed rapidly.In general, these xPLMs are always trained on large-scale multilingual corpora using various pretraining language modeling tasks, such as MLM Figure 1: A visualization example from XLM-Roberta and Ours.Here we present the same sentence in English and German.The words in red and blue box refers to the aligned verb and object words, separately.(Devlin et al., 2018;Lan et al., 2020), NSP (Pires et al., 2019), CLISM (Chen et al., 2022), and TRTD (Chi et al., 2021c).In this manner, xPLMs acquire robust contextually relevant representations and, as a result, excel at a variety of downstream tasks, like question answering (Hermann et al., 2015;He et al., 2018;Chen et al., 2021a) and name entity recognition (Liang et al., 2021).For instance, Chen et al. (2022) propose to train xPLMs with CLISM and MLM, achieving remarkable performances in multilingual sequence labeling tasks (Huang et al., 2019;Lewis et al., 2020;Artetxe et al., 2019a).
Although these pre-training tasks help xPLMs learn promising multilingual contextualized representations at hierarchical level (i.e., token or sentence-level) (Li et al., 2022a), they don't take structural knowledge into consideration.One obvious limitation of the above approaches is the semantic misalignment between structural words from different languages, leading to a bias in the understanding of the multilingual representations.We showcase the parallel sentences in English and German in Figure 1 that are quite different in the syntax structure.The main components of this sentence are "Ebydos AG" (subject), "founded" (verb), "subsidiary" (object) and "Wroclaw" (entity).Unfortunately, as one of the current state-of-the-art xPLMs: XLM-Roberta (XLM-R) (Conneau et al., 2019) is incapable of capturing the alignment of these crucial words in German, leading to semantic deviation.Specifically, XLM-R pays less attention to the corresponding words of "founded" and "subsidiary" in German due to the sentence structure barrier between these two languages.
One step further, from the perspective of human behavior, when a language learner reads a sentence in another language, it can help him/her understand this sentence quickly and accurately by pointing out the structural words in a sentence, including subject, verb, object and entities.This effect will be more noticeable when the sentence is lengthy and complex.Similarly, by providing the extra clues of aligned crucial/informative words in the parallel sentence, the model can benefit from a closer gap of cross-lingual representations.Motivated by the above factors, we design a Structural Contrastive Pretraining (SCP) task to enhance xPLMs' comprehension ability via contrastive learning, bridging the misalignment between structural words in a parallel corpus.Considering the facts that subject, verb, object (S-V-O) are the backbone of a sentence and aligned entities in cross-lingual parallel sentences convey coreference and information short-cuts (Chen et al., 2022), in this work, we consider S-V-O and entities as the structural words in a sentence, which are all insightful or crucial.Concretely, we divide the parallel corpus into a number of smaller groups.Each sub-group has two versions of the same sentence, one in the source language (high resource) and one in the target language (low resource).Each structural word in the source and target languages is considered as a positive pair.
Due to the nature of contrastive learning, wherein comparisons are made between positive and negative pairs, an increase in the number of negative pairings may potentially improve performances of the resulting model (Chen et al., 2020).Inspired by momentum contrast in computer vision (He et al., 2020), we keep a queue and employ the encoded embeddings from the previous mini-batch to increase the quantity of negative pairs.In this method, momentum contrast employs a pair of fast and slow encoders to encode the source language sentences and target language sentences, separately.And the fast encoder is saved for fine-tuning on down-stream datasets.However, directly applying this approach to cross-lingual pre-training could lead to another problem: As the fast encoder only sees the source language during pre-training, the training becomes insensitive to other target languages.As a consequence, the resulting model may underperform on cross-lingual transfer.To address this issue, we creatively incorporate the original momentum contrast into the cross-lingual setting, naming it Cross-lingual Momentum Contrast (short for CL-MoCo).Specifically, CL-MoCo utilizes two pairs of fast/slow encoders to jointly optimize source-to-target language and target-tosource language representations, further bridging the cross-lingual gap.In light of the fact that almost all down-stream cross-lingual understanding tasks only need one encoder, the two fast encoders share parameters in our pre-training.
Based on the above two proposed strategies for building positive and negative pairs in SCP, our resulting model XLM-SCP can accurately capture the alignment of sentence structures across different languages, improving the performances on crosslingual understanding tasks.As seen in Figure 1 (b), ours successfully grasp the correspondence between sentence verbs ("founded"-"gegründet") and objects ("subsidiary"-"Ableger") in English and German.We conduct experiments with two different xPLMs encoders on three multilingual tasks to test the effectiveness of our approach: Name Entity Recognition (NER) (Sang, 2002;Pan et al., 2017), Machine Reading Comprehension (MRC) (Lewis et al., 2020;Artetxe et al., 2019b) and Partof-Speech Tagging (POS) (Zeman et al., 2019).Extensive results show our method can improve the baseline performances across 5 datasets in terms of all evaluated metrics.For example, ours initialize from XLM-R improves the baselines from 61.35% to 63.39% on WikiAnn dataset (Pan et al., 2017).
In general, our contributions can be summarized as follows: • We observe that misalignment of the informative and crucial structural words occurs in xPLMs, and design a new pre-trained task called SCP to alleviate this problem.
• We propose CL-MoCo via keeping a large queue to increase the amount of negative pairings via momentum updating, which pushes the model toward more nuanced learning in cross-lingual.
• We conduct extensive experiments on different tasks, demonstrating the effectiveness of our approaches.

Related Work
Multilingual Pre-trained Language Models To date, transformer-based large-scale PLMs have become the standard in natural language processing and generation (Devlin et al., 2018;Liu et al., 2019;Lan et al., 2020;Sun et al., 2020).Currently, more and more communities are working to bring PLMs into the actual world of various languages (xPLMs), and several efforts have been proposed such as XLM-Roberta (Conneau et al., 2019) (short for XLM-R), info-XLM (Chi et al., 2021a), CLISM (Chen et al., 2022).These works are pre-trained on a large multilingual corpus with token-level or sentence-level pre-training tasks.Despite their promising performances in multiple down-stream tasks, they all don't explicitly consider structural knowledge in the parallel corpus.
Contrastive Learning As a result of its potential to improve upon existing methods for learning effective representations, contrastive learning (Hadsell et al., 2006) has gained popularity in recent years.It works by grouping representations that are semantically close together (positives) in an embedding space and then pushing apart others (negatives) that are not neighbors.Contrastive learning objective has been particularly successful in different contexts of natural language processing (Gao et al., 2021;Wu et al., 2020).Moreover, several efforts (Chen et al., 2021a(Chen et al., , 2022;;Gao et al., 2021;Chen et al., 2021b;You et al., 2021;You et al.;Chen et al., 2023b,a) are well-designed for cross-lingual language understanding.For instance, Liang et al. (2022) proposed multi-level contrastive learning towards cross-lingual spoken language understanding.Chen et al. (2022) employed contrastive learning to learn noise-invariant representation from multilingual corpora for downstream tasks.Different from previous works, we utilize contrastive learning to learn the alignments of the structural words (Tang et al., 2023;Li et al., 2022b), leading to a more comprehensive and accurate understanding on the cross-lingual sentence.
Momentum Contrast Recently, several works (Yang et al., 2021;Wu et al., 2022) have explored momentum contrast in natural language understanding tasks, such as sentence representation and passage retrieval.Specifically, Yang et al. (2021) propose xMoCo to learn a dual-encoder for querypassage matching via two pairs of fast/slow encoders.Although we share a similar topic on mo-mentum contrast, our research questions, application areas, and methods differ.xMoco are designed for query-matching tasks while our proposed CL-MoCo is tailored for cross-lingual representation learning.Moreover, Yang et al. (2021) employs two different encoders for query and passage, separately.However, we share parameters of the two fast encoders in our training.At last, we focus on the representation learning of cross-lingual transfer, but they only take monolingual into consideration.
Recent works Recently, several works (Schuster et al., 2019;Pan et al., 2021;Chi et al., 2021b;Ouyang et al., 2021) also focus on word alignment for multilingual tasks.For clarity, we list some key differences: All of them align each token in the parallel corpus in an "all-to-all" fashion, but we only consider structural words like S-V-O via contrastive learning.The motivations are: (1) In our pilot analysis and experiments, we have two different settings in the proposed SCP: a. training the model with only structural words; b. training the model with all tokens in the sentences.Experimentally, we observe that they achieve comparable performances on MRC tasks but the latter achieves slightly worse results on NER tasks.This is due to the fact that aligning some words with no precise meaning like stopwords may have visible side effects on token-level tasks like NER.
(2) Futhermore, the latter could result in more computation cost than the current method.
(3) From a human perspective, structural words are the backbone of each sentence, and a solid grasp of them is sufficient to strengthen the management of the majority of situations.

Methodology
In this section, we first illustrate our proposed Structural Contrastive Pretraining (SCP) in detail.
Then we introduce how to incorporate our method with momentum contrast.Due to the fact that our proposed methods are flexible and can be built on top of any xPLMs, we leverage E to represent a series of pre-trained language models, where E could be the E f ast in Section 3.2.We aim at enhancing E's ability to capture consistency between parallel structural representations via SCP.The overview of our approach is illustrated in Figure 2.

Structural Contrastive Pretraining
Definition To bridge the misalignment between structural words from different languages, we for- In this part, we introduce how to collect the structural words in the inputs.Given a source language input sentence s s and its target language counterpart s t , we start by using current online name entity recognition tools (e.g., Spacy) to select structural words in the source language, including the subject, verb, object, and entities in the sentence 1 .As some extracted words are illogical due to the performance limitations of commercially available NER tools, these uninformative words could result in sub-optimizing the model during pre-training.Hence, we follow (Chen et al., 2022) to filter out some uninformative spans: • Any spans that include solely stop words will be eliminated.
• Selected structural words should not include any punctuation.
• The maximum sequence length of an entity is limited to 6.
As the translation of the same phrase may vary when it is entered independently or combined with a full sentence, we utilize an off-the-shelf alignment tool, GIZA++ (Pei et al., 2020) to align the 1 If the extracted words of one sentence are none, we would remove it corresponding ones of the selected structural words in the target language.As a result, we can get structural words W s = {w s 1 , w s 2 , ..., w s k } in s s and their counterparts W t = {w t 1 , w t 2 , ..., w t k } in s t .Notice that the length of k could be more than 4 when there are multiple entities in the sentence.
Pre-training It is essential to obtain the representations of each word from W s and W t in SCP.Before going further, we first formulate the input sequences as: (1) where [CLS] and [SEP] denote the special beginning and separated tokens.X s and X t refer to the input sequences in source and target languages, respectively.
Then we can pass X s and X t into the E, producing contextualized representations of each token in the sequences: where where F refers to the average pooling of the beginning and ending tokens representations of H s i and H t i .r s i and r t i belong to R 1×d .Intuitively, (r s i , r t i ) are regarded as positive pairs in SCP.

Cross-lingual Momentum Contrast
In this part, we first introduce how to apply momentum contrast on our method in a straight way.Then we illustrate our proposed CL-MoCo.
MoCo As opposed to merely collecting from mini-batch negatives, we use the momentum contrast approach to increase the number of negatives by maintaining a queue of constant size.In particular, the queued embeddings are gradually replaced.When the current mini-batch's sentence embeddings are queued, the "oldest" ones in the queue are eliminated if the queue is full.Intuitively, when directly applying momentum contrast on cross-lingual training, we can employ a pair of encoders E f ast and E slow .In one training step, E f ast encodes s s into H s and E slow maps s t into H t .We employs momentum update on the encoder E slow , thereby turning E slow into a sluggish moving-average duplicate of the encoder E f ast , to lessen the discrepancy.Formally, we update the E slow in the following way: where λ determines how quickly the slow encoder updates parameters and is normally set to a small positive value.After pre-training, only E f ast (E f ast is equal to E) is saved for fine-tuning and E slow will be discarded.
With the enqueued sentence embeddings, our optimized objective of (r s i , r t i ) is formulated as L i : where N and M are the size of the mini-batch and the queue, respectively.r m denotes a sentence embedding in the momentum-updated queue, and τ represents the temperature.Moreover, Ψ refers to the cosine similarity function.

CL-MoCo
In the above method, target language sentences are only encoded by the slow encoder, which is not directly affected by the gradients from the loss.Moreover, the fast encoder only encodes the source languages in pre-training, making it insensitive to the input sequences in other low-resource languages.These two problems could make the encoder sub-optimized and unable to learn reasonable cross-lingual representations.Therefore, we propose CL-MoCo to alleviate the above issues.In particular, CL-MoCo employs two sets of fast/slow encoders: E s f ast and E s slow for source languages and E t f ast and E t slow for target languages.In addition, two separate queues Q s and Q t are used to store previous encoded sentence embeddings in source and target languages, respectively.The vectors encoded by E s slow and E t slow will be pushed into Q s and Q t , separately.In CL-MoCo, we jointly optimize the two sets of encoders to learn effective source-to-target language and target-to-source language representations, and Eq.5 can be extended as: Hence, the optimized objective of positive pair (r s i , r t i ) in source-to-target language can be formulated as L i (r s i , r t i ): −log exp(Ψ(r s i , r t i )/τ ) N j=1 (exp(Ψ(r s i , r t j )/τ ) + M q s ∈Q s exp(Ψ(r s i , r q s )/τ ) Similarly, our CL-MoCo works in both ways, and the objective in target-to-source language L i (r t i , r s i ) is: For all selected structural words in s s and s t , the overall objective of our SCP can be summarized as: where k is the number of structural words in the input sentence.We share the parameters of two fast encoders and two slow encoders, because of the following facts: 1) We focus on cross-lingual understanding tasks rather than passage retrieval, which mostly only needs one encoder; 2) Two separated fast and slow encoders could result in more computation and training time.

Pre-training Strategy
Following the line of (Liu et al., 2019;Chi et al., 2021a), we also pre-train E with the mask language modeling (MLM) task.Concretely, we train the model in multi-task manner.The total objective in our pre-training can be defined as: 4 Experiment In this section, we first introduce how we collect the pre-training data for the proposed SCP.Then we illustrate experiment settings for pre-training and fine-tuning.At last, we present our experimental results on various corss-lingual datasets, including baseline introduction and main results.

Pre-training Data
As aforementioned, our proposed task SCP requires parallel corpus.We choose MT dataset (Conneau and Lample, 2019) to construct our pre-training data.In contrast to earlier research (Chi et al., 2021a) that used billion-level corpora across about one hundred languages to generate training corpus, we only use six languages from the MT dataset, including English(en), Spanish(es), Arabic(ar), German(de), Holland(nl), and Vietamese(vi), demonstrating that our approach also makes significant gains in languages where we do not have data.
Given the promising performance of off-the-shelf NER techniques (e.g., Spacy) in English, we choose English as our source language, with the remaining five languages serving as target languages in turn.As a result, we get 3.9 million pre-training parallel sentences after using the rules in Section 3.1.The amount of distribution for each language is reported in Table 1.

Evaluation
We evaluate XLM-SCP on three cross-lingual tasks: cross-lingual machine reading comprehension (xMRC), cross-lingual name entity recognition (xNER) and cross-lingual Part-of-Speech (xPOS).Concretely, we conduct experiments on five datasets: MLQA (Lewis et al., 2020), XQUAD (Artetxe et al., 2019b), CoNLL (Sang, 2002) and WikiAnn (Pan et al., 2017) and UPDOS (Zeman et al., 2019).We introduce each dataset and test languages in Appendix A.1.We use a zero-shot configuration to fine-tune our model for all datasets, which means that we just use the English training set to optimize the model, and then test the final model on other target languages.Besides, we also test the cross-lingual transfer ability of XLM-SCP on these datasets, that is, we also validate the model performances on some target languages that are not included in our pre-training data.
We employ two evaluation measures for the xMRC task: Exact Match (EM) and span-level F1 score, which are commonly used for MRC model accuracy evaluation.The span overlap between the ground-truth answer and the model predictions is measured by span-level F1.If the forecast is precisely the same as the ground truth, the exact match (EM) score is 1, otherwise 0. In the case of the xNER challenge, we employ entity-level F1 scores to evaluate our model, which demands that the boundary and type between the prediction and the ground-truth entity be exactly matched.Similarly, we also use F1 score to validate the model performances in UPDOS.

Training Details
Model Structure To show the generalization of our approach, we initialize our model from two commonly used xPLMs encoders: XLM-R and Info-XLM.The resulting model is named XLM-SCP in our experiments.We use the base version checkpoints of the above two models from Hugging Face Transformers2 .Our XLM-SCP contains 12 transformer layers, and the vector dimension size is set to 768.

Pre-training Details
Our training codes are based on PyTorch 1.11 and Transformers 4.10.0.Along the line of the research (Devlin et al., 2018), we randomly mask 15% tokens of the input sequence3 to implement MLM.In pre-training, we optimize our model using the Adam optimizer and a batch size of 128 for a total of 4 epochs.Moreover, learning rate is set to 1e-6 with 1.5K warmup steps.The max input sequence length is set to 128.Experimentally, τ in Eq.10 is set to 0.05 and the queue size of Q s and Q t are both 20k set to 0.99.We pre-train our model using 8×V100-32G GPUs for about one day.Fine-tuning details can be seen in Appendix A.2.

Results
Baselines xMRC Results Table 2 compares our method to that of typical systems on five datasets.On two xMRC datasets, our models outperform these baselines by an interesting amount.For instance, ours built on XLM-R achieves 65.14%/47.20%(vs.63.24%/45.88%) in terms of F1/EM score on MLQA.Similarly, we also obtain 1.81%/1.65%gains on XQUAD dataset.We can also draw another interesting conclusion: When compared to Info-XLM, which is both built on top of XLM-R and continues to be pre-trained on 130 million data across 94 languages, our model initialized from XLM-R performs comparably.Nevertheless, XLM-SCP only needs 3.9 million parallel corpora from six languages, demonstrating the efficacy of our proposed approaches (3.9M≪130M).included during pre-training.From Table 3, we can observe that XLM-SCP also achieves about 1.5% improvements on three datasets under the zero-shot cross-lingual transfer setting.In general, the results in Table 2 and Table 3 prove that our approach not only improves the performance in the languages that included in our SCP pre-training but also has better transferability capabilities in other low-resource languages.

Analysis
Aside from the high performances achieved by our proposed approaches, we are still concerned about the following questions: Q 1 : What are the effects of each key component in our XLM-SCP?Q 2 : Is CL-MoCo really superior to MoCo in cross-lingual understanding tasks?Q 3 : Does the size of the queue in CL-MoCo affect the performance of our model?Q 4 : What are the model performances with different τ in Eq.10? (Seen in Appendix C, Figure 5) Q 5 : Within the chosen objects, verbs, objects, and entities in structural words, which part has the biggest effect on our XLM-SCP's performance?(Seen in Appendix C, Table 10) In this section, we conduct extensive experiments to answer the above questions.
Answer to Q 1 : Experiments are carried out to confirm the independent contributions of each component in our proposed pre-training scheme.on cross-lingual understanding tasks.We conduct ablation experiments on three tasks across four datasets and show the results in Figure 3.We can find that our proposed CL-MoCo can achieve better results on all these datasets when compared with the original MoCo.The results further prove CL-MoCo has a stronger ability to learn effective cross-lingual representations.
Answer to Q 3 : The main assumption of CL-MoCo is that the size of negative samples is important in contrastive learning.Here we empirically study this assumption in cross-lingual understanding tasks via varying the queue size of keeping negative pairs.As shown in Figure 4, we validate XLM-SCP with M ∈ {5k, 10k, 20k, 30k, 40k} on WikiAnn and MLQA datasets.We can draw the conclusion that the model performs slightly better as the queue size increases initially, especially for xMRC tasks.Interestingly, the model achieves best results on WikiAnn when M is equal to 20k, and its performances slightly decrease when M passes 20k.One possible explanation is that larger size of the queue may introduce some "false negative samples", which could have a more obvious side effect on xNER tasks.In light of the fact that the queue size has a negligible effect on training speed and memory use, we have chosen a queue size of 20k for all downstream datasets.

Conclusion
In this paper, we observe that misalignment of crucial structural words occurs in the parallel sentences of current xPLMs.We propose a new pretraining task called Structural Contrastive Pretraining (SCP) to alleviate this problem, enabling the model to comprehend the cross-lingual representations more accurately.We further incorporate momentum contrast into cross-lingual pre-training, named CL-MoCo.In particular, CL-MoCo employs two sets of fast/slow encoders to jointly learn the source-to-target language and target-to-source language cross-lingual representations.Because of this, the resulting model is better for cross-lingual transfer.Extensive experiments and analysis across various datasets show the effectiveness and generalizability of our approach.As an extension of our future work, we will apply our method to other natural language understanding tasks and find a proper way to reduce data preprocessing costs.

Limitations
The main target of this paper is to utilize structural knowledge for cross-lingual comprehension.We present a new pre-training task named SCP in the hope of bridging the misalignment of structural words in the parallel corpus.More generally, we expect the proposed method can facilitate the research of cross-lingual understanding.Admittedly, the main limitation of this work is that we rely on off-the-shelf tools to extract and align words in different languages, which would result in some mistakes at some situations.For example, GIZA++ only achieves 80%-85% accuracy in aligning the corresponding words in another language.Currently, no tech can achieve this goal in 100% accuracy.As a result, some bias data in pre-training calls for further research and consideration when utilizing this work to build xPLMs.Answer to Q 5 : We further conduct analysis to find that which part of the chosen nouns, verbs, objects, and entities in structural words has the most impact on how well our model works?Hence, we remove each S-V-O and entity word in turn and test out the model's performances on xNER tasks and xPOS tasks.As the Table 10 shows, each component in the selected structural word has different impact on our XLM-SCP.Interestingly, the model's performance drops significantly on the WikiAnn dataset without entity while very somewhat on the UDPOS dataset without entity.The possible reason is that xNER tasks require the model has a stronger ability of entity-level understanding while xPOS tasks need more fine-grained understanding on token-level.
We compare our model with the following xPLM-based baselines: (1) M-BERT(Devlin et al., 2018) pre-trained with MLM and NSP tasks on Wikipedia data over 104 languages; (2) XLM(Conneau and Lample, 2019) is jointly optimized with MLM and TLM tasks in 100 languages during pre-training; (3) XLM-R(Conneau et al.,  2019), a multilingual version of Roberta which is pre-trained with MLM in large-scale CC-100 dataset; (4) Info-XLM(Chi et al., 2021a), another popular and effective xPLM which initializes from XLM-R with the proposed pre-training task XLCO in 94 languages.

Figure 4 :
Figure 4: Queue size sensitivity experiments across two datasets, and F1 score is used for evaluation.In the experiments, We initialize XLM-SCP from XLM-R.

Figure 5 :
Figure 5: Temperature sensitivity experiments across three datasets, and F1 score is used for evaluation.In the experiments, We initialize XLM-SCP from XLM-R.
l and d represent the max sequence length and hidden size, separately.Subsequently, for each word w s i ∈ W s , where i ∈ [1,k], we obtain its representation H s i from H s .Similarly, we can get its positive pair representation H t i from H t .Notice that we can not directly employ H s i and H t i in our SCP because w s i and w t i may produce multiple sub-tokens after tokenization.Therefore, we apply extra aggregation function F on H s i and H t i to obtain the final representations:

Table 1 :
Total parallel sentences used in pre-training.

Table 2 :
. And λ is Average evaluation results on five datasets.The results of our model are averaged over 5 runs.* denotes the model build upon of XLM-R.♡ refers to model based on Info-XLM.The results of each language are represented in the Appendix B. We highlight the highest numbers among models with the same xPLM encoder.Here, we average the F1 scores on these datasets.

Table 3 :
Model performances under zero-shot crosslingual transfer.In the experiments, We initialize XLM-SCP from XLM-R.

Table 4
Answer to Q 2 : We further conduct analysis to verify the effectiveness of CL-MoCo vs. MoCo