Language Agnostic Multilingual Information Retrieval with Contrastive Learning

Multilingual information retrieval (IR) is challenging since annotated training data is costly to obtain in many languages. We present an effective method to train multilingual IR systems when only English IR training data and some parallel corpora between English and other languages are available. We leverage parallel and non-parallel corpora to improve the pretrained multilingual language models' cross-lingual transfer ability. We design a semantic contrastive loss to align representations of parallel sentences that share the same semantics in different languages, and a new language contrastive loss to leverage parallel sentence pairs to remove language-specific information in sentence representations from non-parallel corpora. When trained on English IR data with these losses and evaluated zero-shot on non-English data, our model demonstrates significant improvement to prior work on retrieval performance, while it requires much less computational effort. We also demonstrate the value of our model for a practical setting when a parallel corpus is only available for a few languages, but a lack of parallel corpora resources persists for many other low-resource languages. Our model can work well even with a small number of parallel sentences, and be used as an add-on module to any backbones and other tasks.


Introduction
Information retrieval (IR) is an important natural language processing task that helps users efficiently gather information from a large corpus (some representative downstream tasks include question answering, summarization, search, recommendation, etc.), but developing effective IR systems for all languages is challenging due to the cost of, and therefore lack of, annotated training data in many languages.While this problem is not unique to IR The language contrastive loss incorporates the non-parallel corpora in addition to the parallel ones.It encourages the distances from a sentence representation, which can be a sample from both the parallel corpora and the nonparallel corpora, to the two embeddings of a paralleled pair to be the same.
research (Joshi et al., 2020), constructing IR data is often more costly due to the need to either translate a large text corpus or gather relevancy annotations, or both, which makes it difficult to generalize IR models to lower-resource languages.
One solution to this is to leverage the pretrained multilingual language models to encode queries and corpora for multilingual IR tasks (Zhang et al., 2021;Sun and Duh, 2020).One series of work on multilingual representation learning is based on training a masked language model, some with the next sentence prediction task, on monolingual corpora of many languages, such as mBERT and XLM-R (Conneau et al., 2020).They generally do not explicitly learn the alignment across different languages and do not perform effectively in empirical IR experiments.Other works directly leverage multilingual parallel corpora or translation pairs to explicitly align the sentences in two languages, such as InfoXLM (Chi et al., 2021) and LaBSE (Feng et al., 2022).
In this work, we propose to use the semantic contrastive loss and the language contrastive loss to jointly train with the information retrieval objective, for learning cross-lingual representations that encourage efficient lingual transfer ability on retrieval tasks.Our semantic contrastive loss aims to align the embeddings of sentences that have the same semantics.It is similar to the regular InfoNCE (Oord et al., 2018) loss, which forces the representations of parallel sentence pairs in two languages to be close to each other, and away from other negative samples.Our language contrastive loss aims to leverage the non-parallel corpora for languages without any parallel data, which are ignored by the semantic contrastive loss.It addresses the practical scenario wherein parallel corpora are easily accessible for a few languages, but the lack of such resources persists for many low-resource languages.The language contrastive loss encourages the distances from a sentence representation to the two embeddings of a paralleled pair to be the same.Figure 1 illustrates how the two losses improve language alignment.In experiments, we evaluate the zero-shot cross-lingual transfer ability of our model on monolingual information retrieval tasks for 10 different languages.Experimental results show that our proposed method obtains significant gains, and it can be used as an add-on module to any backbones.We also demonstrate that our method is much more computationally efficient than prior work.Our method works well with only a small number of parallel sentence pairs and works well on languages without any parallel corpora.
2 Background: Multilingual DPR Dense Passage Retriever (DPR) (Karpukhin et al., 2020) uses a dual-encoder structure to encode the queries and passages separately for information retrieval.To generalize to multilingual scenarios, we replace DPR's original BERT encoders with a multilingual language model XLM-R (Conneau et al., 2020) to transfer English training knowledge to other languages.
Concretely, given a batch of N query-passage pairs (p i , q i ), we consider all other passages p j , j ̸ = i in the batch irrelevant (negative) passages, and optimize the retrieval loss function as the negative log-likelihood of the gold passage: where the similarity of two vectors is defined as 3 Contrastive Learning for Cross-Lingual Generalization The multilingual dense passage retriever only uses English corpora for training.To improve the model's generalization ability to other languages, we leverage two contrastive losses, semantic contrastive loss and language contrastive loss.Fig- ure 2 shows our model framework.Specifically, the semantic contrastive loss (Chen et al., 2020a) pushes the embedding vectors of a pair of parallel sentences close to each other, and at the same time away from other in-batch samples that have different semantics.The language contrastive loss focuses on the scenario when there is no parallel corpora for some languages, which encourages the distance from a sentence embedding to paralleled embedding pairs to be the same.

Semantic Contrastive Loss
To learn a language-agnostic IR model, we wish to encode the sentences with the same semantics but from different languages to have the same embeddings.For each parallel corpora batch, we do not limit our sample to just one specific language pair.We randomly sample different language pairs for a batch.For example, a sampled batch could contain multiple language pairs of En-Ar, En-Ru, En-Zh, etc.This strategy can increase the difficulty of our contrastive learning and make the training more stable.
Concretely, we randomly sample a mini-batch of 2N data points (N here does not have to be the same value as the N in Section 2).The batch contains N pairs of parallel sentences from multiple different languages.Given a positive pair z i and z j , the embedding vectors of a pair of parallel sentences (i, j) from two languages, the rest 2(N − 1) samples are used as negative samples.The semantic contrastive loss for a batch is: where τ is a temperature hyperparameter.Figure 2: Our model framework contains two parts: the main task (IR), and the parallel corpora task.For the main task part, we use a dual-encoder dense passage retrieval module for information retrieval.For the parallel corpora task part, we adopt the semantic contrastive loss to improve cross-lingual domain adaptation with parallel corpora.We also use the language contrastive loss by leveraging parallel corpora and non-parallel corpora altogether.

Language Contrastive Loss
When training multilingual IR systems, we might not always have parallel corpora for all languages of interest.In a realistic scenario, we have easy access to a few high-resource languages' parallel corpora, but no such availability for many lowresource languages.We propose a language contrastive loss to generalize the model's ability to the languages which do not have any parallel corpora.For a batch B consisting of both parallel corpora P and non-parallel corpora Q, we denote z i and z j as the embeddings of a pair of parallel sentences (i, j) from two languages.We wish the cosine similarity from any other sentence embedding z k to the two embeddings of a parallel pair to be the same.Therefore, we minimize the following loss.
The optimum can be reached when sim (z i , z k ) = sim (z j , z k ) for all i, j, k.Note that the parallel corpus involved is not the target language's parallel corpus.For example, in Formula 3, i and j are two languages that are parallel with each other, and k is a third language (target language) that does not have any parallel corpus with other languages.

Semantic vs Language Contrastive Losses
While both the semantic contrastive loss and language contrastive loss can serve to align the representations of parallel sentences and remove lan-guage bias, they achieve this goal differently, one via contrasting against in-batch negative samples, the other using in-batch parallel examples to constrain the target language embeddings.Moreover, a key property of the language contrastive loss is that as long as there is some parallel corpus, we can use this loss function to remove the language bias from representations of sentences where no parallel data exists, which makes it more broadly applicable.

Training
The two contrastive losses are applied to the passage encoder only.Experiments show that applying them to both the passage encoder and the query encoder would result in unstable optimization, where we see weird jumps in the training loss curves.
The joint loss with the information retrieval loss, the semantic contrastive loss, and the language contrastive loss is where w s and w l are hyperparameters for the semantic contrastive loss and the language contrastive loss weights which need to be tuned adaptively in different tasks.We train our model using 8 Nvidia Tesla V100 32GB GPUs.We use a batch size of 48.We use the AdamW optimizer with β 1 = 0.9, β 2 = 0.999 and a learning rate of 10 −5 .For the three losses L IR , L semaCL , L langCL , we sequentially calculate the loss and the gradients.We use w s = 0.01 and w l = 0.001.The hyperparameters are determined through a simple grid search.

Datasets
Our IR experiments involve two types of datasets: IR datasets and the parallel corpora.

Information Retrieval
In our experiments, we only use English information retrieval corpora (Natural Questions), and we evaluate the model's zero-shot transfer ability on other target languages (Mr.TyDi).
• Natural Questions (Kwiatkowski et al., 2019) is an English QA dataset.Following Zhang et al. (2021), we use NQ dataset to train IR.

Parallel Corpora
WikiMatrix parallel corpora contains extracted parallel sentences from the Wikipedia articles in 85 different languages (Schwenk et al., 2021).For those languages involved in the Mr.Tydi dataset, the number of parallel pairs between them and the English of the WikiMatrix dataset ranges from 51,000 and 1,019,000.During training, we sample the same number of parallel pairs (50K) for them.

Baseline Models
We apply our contrastive loss functions on three multilingual pretrained language models: • XLM-R (Conneau et al., 2020) is a pretrained transformer-based multilingual language model.It is trained on a corpus from 100 languages only with the Masked language Model (MLM) objective in a Roberta way.
• LaBSE (Feng et al., 2022)  and bilingual translation pairs.They train the model by 1.8M steps using a batch size of 8192.
In Table 1, we compare the computational efforts needed by each model to improve the language transfer ability.Both InfoXLM and LaBSE require a large-scale pre-training which needs a larger batch size and a larger number of training steps than ours.Our model only requires "co-training" on the parallel corpora along with the main task.In Table 1, we list our model's training steps on the information retrieval task.This comparison indicates that for the retrieval task, our model does not need the costly pre-training as InfoXLM and LaBSE.

Information Retrieval -All languages have parallel corpora with English
For the information retrieval training, we follow the previous literature (Zhang et al., 2021;Wu et al., 2022) to use an English QA dataset -the Natural Questions dataset (Kwiatkowski et al., 2019) for both training and validation.We evaluate our model performance on the Mr.TyDi dataset (Zhang et al., 2021) for monolingual query passage retrieval in eleven languages.We follow Zhang et al. (2021) to use MRR@100 and Recall@100 as metrics.
In this section, we experimented with the setting when we have parallel corpora from English to all other target languages.We tested three different variants of our model using the XLM-R as the backbone: 1. we only include the semantic contrastive loss for the parallel corpora: L IR + w s L semaCL ; 2. we only include the language contrastive loss for the parallel corpora: L IR + w l L langCL ; 3. we use both the semantic contrastive loss and the language contrastive loss: Table 2 shows the results of our model and the baseline XLM-R model.We also report the results Table 3: MRR@100 on the monolingual information retrieval task of Mr.TyDi dataset.
of Wu et al. (2022), which propose a model called contrastive context prediction (CCP) to learn multilingual representations by leveraging sentencelevel contextual relations as self-supervision signals.For our analysis, we mainly focus on MRR, since MRR is more aligned with our retrieval loss function, which aims to rank relevant passages at higher orders.We also report Recall@100 in Table 7 in Appendix A. We find that overall our model performs significantly better than the basic XLM-R.
For our different model variants, we find that: (1) using only the semantic contrastive loss for the parallel corpora would achieve the best average performance; (2) using only the language contrastive loss for the parallel corpora also achieves a significant performance improvement, which is lower than but close to using only the semantic contrastive loss; (3) using both semantic contrastive loss and language contrastive loss would only contribute to a few languages like Ar, but does not improve the overall performance.Our assumption is that the semantic contrastive loss has already efficiently removed the language embedding shifts by leveraging the parallel pairs, so it is not helpful to use additional language contrastive loss when we have parallel corpora for all the languages.In Section 5.4, we experiment with a more practical scenario when we only have parallel corpora for some of the target languages but non-parallel corpora for the rest.
And we find our language contrastive loss brings significant performance gains in that case.
We then further compare the performance of our best model -XLM-R + semantic contrastive loss, with those of other strong baselines, i.e.In-foXLM and LaBSE.We also examine if the semantic contrastive loss can be used as an add-on module to InfoXLM and LaBSE to further boost their performance.Table 3 shows the MRR@100 results of XLM-R, InfoXLM, LaBSE themselves -all of them are trained with the IR loss, and the results trained jointly with the semantic contrastive loss.We find that our best model -XLM-R with only semantic contrastive loss -significantly outperforms these strong baselines.Note that both InfoXLM and LaBSE involve a largescale pre-training to improve the lingual transfer ability, which is not required in our method.Our model only requires joint training with the contrastive loss, which needs much less computational effort as in Table 1.We also find that the semantic contrastive loss can be used as an add-on module to effectively boost the performance of In-foXLM and LaBSE.But such an add-on module's improvements on InfoXLM and LaBSE are not as large as that on XLM-R.We speculate that this phenomenon could be attributed to that InfoXLM and LaBSE have already been pre-trained on other datasets, which have some distribution shifts away from the WikiMatrix dataset we used for the semantic contrastive loss add-on module.We also report the Recall@100 results in Table 8 of Appendix A.
In addition to the above results output by our own runs, we also list the results reported by Wu et al. (2022) in Table 3 as a reference.The difference in without using any parallel dataset without using any parallel dataset without using any parallel dataset without using any parallel dataset without using any parallel dataset without using any parallel dataset without using any parallel dataset the baseline model performances may be due to the randomness during model training.We also present the performance of the traditional BM25 method.
The average MRR@100 of BM25 is significantly lower than that of our method.

Effect of the Size of Parallel Dataset
We further investigate the effect of the size of the parallel dataset on the multilingual retrieval performance.We train our model by varying the parallel dataset size using the XLM-R with only semantic contrastive loss.Figure 3 shows the results.We find that: (1) using parallel corpora can significantly boost the retrieval performance, compared with the dashed horizontal line when we do not have parallel corpora at all (the basic XLM-R); (2) even when we only have a small parallel corpus of 500 pairs for each language, we can already achieve a good performance MRR@100=0.38.When we gradually increase the parallel corpora to 50,000, the MRR@100 grows gradually to 0.396.But the increase is not very large.This suggests that our model framework can work well even with a small parallel corpora dataset.This makes our method promising for those low-resource languages which lack parallel corpora with English.

Effect of Language Pair Connection
In order to understand how different language pair connections affect performance, we conduct experiments using different language pairs on En, Fi, Ja, Ko.We experimented with six different settings: 1. Basic setting: Train XLM-R without using any parallel corpora, which is the same as the first row in Table 2; 2. Setting 1: Train XLM-R with parallel corpora between English and all other languages, i.e.En-Fi, En-Ja, En-Ko; 3. Setting 2: Train XLM-R with parallel corpora between English and Korean, and between Korean and the rest languages, i.e.En-Ko, Ko-Fi, Ko-Ja; 4. Setting 3: Train XLM-R with parallel corpora between English and Korean, and between Japanese to Finnish, i.e.En-Ko, Ja-Fi; 5. Setting 4: Train XLM-R with parallel corpora between English and Korean, i.e.En-Ko; 6. Setting 5: Train XLM-R with parallel corpora between Japanese to Finnish, i.e.Ja-Fi.
Table 4 shows the MRR@100 results.We find that all settings 1 to 5 significantly surpass the basic setting.This echoes our previous finding that it is helpful to leverage parallel corpora.Among settings 1 to 5, the differences are not large -the minimum MRR of them is 0.325, and the maximum one is 0.343.This suggests that the connection among language pairs is not a deterministic factor for our method.We also report the Recall@100 in Table 9 of Appendix A.

Information Retrieval -Some languages do not have parallel data
In this section, we investigate the scenario when we have parallel corpora only for some of the target languages, but not for the rest languages.This scenario emphasizes a realistic constraint that we lack parallel corpora for many low-resource languages.
To test it, we leave Ru, Sw, Te, Th as languages that do not have parallel corpora, and keep these parallel corpora for all other languages, i.e.Ar, Bn, Fi, Id, Ja, Ko.We experimented with three different settings: 1. XLM-R + Semantic CL: we only use the semantic contrastive loss on languages which have parallel corpora (Ar, Bn, Fi, Id, Ja, Ko): 2. XLM-R + Semantic CL + Language CL (WikiMatrix): we use the semantic con- Note: Avg for languages with (∥) and without (∦) parallel data.
trastive loss on languages which have parallel corpora (Ar, Bn, Fi, Id, Ja, Ko), and the language contrastive loss on these parallel corpora (Ar, Bn, Fi, Id, Ja, Ko) along with the non-parallel WikiMatrix corpora: 3. XLM-R + Semantic CL + Language CL (Mr.TyDi): we use the semantic contrastive loss on languages which have parallel corpora (Ar, Bn, Fi, Id, Ja, Ko), and the language contrastive loss on these parallel corpora (Ar, Bn, Fi, Id, Ja, Ko) along with the nonparallel Mr.TyDi corpora: Table 5 shows the MRR@100 results of our experiments.The language contrastive loss can effectively leverage the non-parallel corpora to improve the information retrieval performance.For the XLM-R + Semantic CL + Language CL (Mr.TyDi) setting, the language contrastive loss boosts the average MRR@100 from 0.358 to 0.385.We also calculate the average performance on the languages with parallel corpora (Ar, Bn, Fi, Id, Ja, Ko), and the languages without parallel corpora (Ru, Sw, Te, Th).The Avg (withParallel) column and the Avg (noParallel) column in Table 5 are their corresponding results.We find that the language contrastive loss can improve the performance on both types of languages.For languages with parallel corpora (Ar, Bn, Fi, Id, Ja, Ko), the MRR@100 increases from 0.365 to 0.391; for languages without parallel corpora (Ru, Sw, Te, Th), the MRR@100 increase from 0.360 to 0.389.This result suggests our model can be effectively deployed in situations when we have no parallel corpora for low-resource languages.Appendix A Table 10 reports the Recall@100 results.
Since using the Mr.TyDi corpora brings in the target domain information, we also examine the XLM-R + Semantic CL + Language CL (Wiki-Matrix) setting.This setting uses the WikiMatrix non-parallel corpora for Ru, Sw, Te, Thit does not introduce the target domain informa-without using any non−parallel dataset without using any non−parallel dataset without using any non−parallel dataset without using any non−parallel dataset without using any non−parallel dataset without using any non−parallel dataset without using any non−parallel dataset The size of the non−parallel dataset for each language (log10) MRR @ 100 Figure 4: The effect of the size of the non-parallel dataset for each language, with the 95% CI in shadow.
tion, and reflects the clean gain from the language contrastive loss.We find that using the WikiMatrix non-parallel corpora achieves a little lower but close performance than the one using the Mr.TyDi corpora.This suggests that the introduction of the target domain information is very minor in improving IR performance.

Effect of the Size of Non-Parallel Dataset
We further investigate the effect of the size of the non-parallel dataset on the multilingual retrieval performance.We train our model by varying the non-parallel dataset size using XLM-R with both the semantic contrastive loss and the language contrastive loss.We keep the size of the parallel dataset fixed at 50,000. Figure 4 shows the results.The dashed horizontal line is the one using only parallel corpora, i.e. the first row in Table 5.We find that: (1) using non-parallel corpora can significantly boost the retrieval performance, compared with the most left point when we do not use the nonparallel corpora at all; (2) when the non-parallel corpora dataset size increases from 0 to 10,000, the MRR@100 improves quickly; (3) when the non-parallel dataset size increases from 10,000 to 50,000, the MRR@100 has minor changes, but its variance decreases.

BUCC: Bitext Retrieval
The information retrieval task above is not common to see in multilingual NLP papers.A closely related task they often work on is the BUCC1 task (Zweigenbaum et al., 2018).The BUCC task has been tested in the LaBSE benchmark model we used in the previous section (Feng et al., 2022), and in many other multilingual NLP works, such as Artetxe and Schwenk 2019;Yang et al. 2019;Schwenk 2018;Reimers and Gurevych 2020, etc.Therefore, following these works, we also investigate our model's performance on the BUCC task.
For the BUCC bitext mining task, we follow previous work (Artetxe and Schwenk, 2019;Reimers and Gurevych, 2020) to first encode texts, and then use the equation below to calculate the score of two sentence embeddings u, v: sim(u,z) 2k where NN k (u) denotes u's k nearest neighbors in another language.The training set is used to find a threshold value of the score, for which pairs with scores above this threshold are predicted as parallel sentences.We use F1 to measure the model performance on BUCC.
Table 6 shows F1 score of our model based on the XLM-R, InfoXLM, and LaBSE.We first examine the vanilla XLM-R, InfoXLM, and LaBSE as the text encoder.LaBSE and InfoXLM outperform XLM-R a lot due to their large-scale pretraining on improving lingual adaptation using parallel datasets.When we add our semantic contrastive loss to XLM-R, we get a large improvement across all four languages.We find that our model (XLM-R + Semantic CL) outperforms XLM-R, but underperforms InfoXLM and LaBSE.We attribute LaBSE's great performance to its much larger pretraining than ours, and LaBSE's training involves Translation Language Model (Conneau and Lample, 2019) with translation corpora.This is exactly the same type of corpora as BUCC's translation parallel pairs.When we add our semantic contrastive loss to InfoXLM, we obtain performance gain for all languages.The gain is smaller than that of XLM-R because InfoXLM has already been trained on parallel corpora.When we add our semantic contrastive loss module to LaBSE, we obtained a small increase in the average performance -the performance on Zh has a significant increase.
One important insight we get from comparing Table 6 and Table 3 is that a model's better performance in NLP tasks like BUCC does not necessarily mean better performance in information retrieval.Most existing multilingual NLP papers only examine the BUCC bi-text retrieval task, and we highlight the inconsistency between models' performances on the two types of retrieval tasks.

Related Work
Dense monolingual / multilingual information retrieval study recently attracts great attention, which mainly benefits from (1) supervised finetuning based on large pre-trained language models and (2) self-supervised contrastive learning.
Dense Passage Retrieval (Karpukhin et al., 2020) is the framework first proposed for monolingual superivsed finetuning on information retrieval.It uses a BERT-based dual-encoder structure to encode the query and the candidate passages into embeddings.Similar to monolingual IR, supervised finetuning can also be applied to multilingual pretrained language models (LMs) for multilingual IR.Commonly uesd multilingual pretrained LMs include multilingual BERT (mBERT, Devlin et al., 2019) and XLM-R (Conneau et al., 2020), both of which are trained on large corpora representing about 100 languages primarily with the masked language modeling task.These models do not use any explicit objective to improve the alignment between language sentences.Recent efforts in NLP field have provided easy access to parallel corpora, e.g.Schwenk et al. (2021).Many multilingual language models use additional parallel data to improve lingual transfer ability.InfoXLM (Chi et al., 2021) uses parallel corpora to pre-train XLM-R by maximizing mutual information between multilingual multi-granularity texts.LaBSE (Feng et al., 2022) pre-trains BERT with Masked Language Model and Translation Language Model on the monolin-gual data and bilingual translation pairs.
Self-supervised contrastive learning is another way used to improve the cross-lingual alignment.Contrastive learning maximizes the agreement between positive samples, and minimizes the similarity of positive and negative ones (He et al., 2020;Chen et al., 2020a,b,c).For language representation learning, Clark et al. (2019) apply contrastive learning to train a discriminative model to learn language representations.For multilingual representation learning, contrastive learning has been used to improve cross-lingual transfer ability by using additional parallel data (Hu et al., 2021) or by leveraging other self-supervision signals (Wu et al., 2022).

Conclusion
In this paper, we present a model framework for multilingual information retrieval by improving lingual adaptation through contrastive learning.Our experiments demonstrate the effectiveness of our methods in learning better cross-lingual representations for information retrieval tasks.The two contrastive losses can be used as an add-on module to any backbones and many other tasks besides information retrieval.

Limitations
In this work, we did not conduct a detailed analysis of how language-specific characteristics contribute to our model's cross-lingual generalization capabilities.Future work may address this question through extensive matrix experiments -traverse the training on each possible language pair combination and evaluate on all languages.

A Appendix
Table 7 and Table 8 show the Recall@100 results of experiments in Section 5.3.Table 9 shows the Recall@100 results of experiments in Section 5.3.2.Table 10 shows the Recall@100 results of experiments in Section 5.4.D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure1: (a) The semantic contrastive loss encourages the embeddings of parallel pairs, i.e. sentences that have the same semantics but from different languages, to be close to each other, and away from the rest negative samples -sentences with different semantics.(b) The language contrastive loss incorporates the non-parallel corpora in addition to the parallel ones.It encourages the distances from a sentence representation, which can be a sample from both the parallel corpora and the nonparallel corpora, to the two embeddings of a paralleled pair to be the same.

Figure 3 :
Figure3: The effect of the size of the parallel dataset for each language, with the 95% CI in shadow.

C2.
Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Section 5 C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Section 5 C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Section 3, Section 4, and Section 5 D Did you use human annotators (e.g., crowdworkers) or research with human participants?Left blank.

Table 1 :
A comparison of our model and baseline models' pre-training for lingual adaptation.Ours actually uses a "co-training" mode rather than "pre-training", so our training steps are the same as the main task.
pre-trains BERT with Masked Language Model and Translation Language Model on the monolingual data

Table 2 :
MRR@100 on the monolingual information retrieval task of Mr.TyDi dataset.

Table 6 :
Feng et al. (2022))C task.Row LaBSE(Feng et al., 2022)is the results reported inFeng et al. (2022).All other rows are the output of our own implementation.