Boosting Zero-shot Cross-lingual Retrieval by Training on Artificially Code-Switched Data

Transferring information retrieval (IR) models from a high-resource language (typically English) to other languages in a zero-shot fashion has become a widely adopted approach. In this work, we show that the effectiveness of zero-shot rankers diminishes when queries and documents are present in different languages. Motivated by this, we propose to train ranking models on artificially code-switched data instead, which we generate by utilizing bilingual lexicons. To this end, we experiment with lexicons induced from (1) cross-lingual word embeddings and (2) parallel Wikipedia page titles. We use the mMARCO dataset to extensively evaluate reranking models on 36 language pairs spanning Monolingual IR (MoIR), Cross-lingual IR (CLIR), and Multilingual IR (MLIR). Our results show that code-switching can yield consistent and substantial gains of 5.1 MRR@10 in CLIR and 3.9 MRR@10 in MLIR, while maintaining stable performance in MoIR. Encouragingly, the gains are especially pronounced for distant languages (up to 2x absolute gain). We further show that our approach is robust towards the ratio of code-switched tokens and also extends to unseen languages. Our results demonstrate that training on code-switched data is a cheap and effective way of generalizing zero-shot rankers for cross-lingual and multilingual retrieval.


Introduction
Cross-lingual Information Retrieval (CLIR) is the task of retrieving relevant documents written in a language different from a query language.The large number of languages and limited amounts of training data pose a serious challenge for training ranking models.Previous work address this issue by using machine translation (MT), effectively casting CLIR into a noisy variant of monolingual retrieval (Li and Cheng, 2018;Shi et al., 2020Shi et al., , 2021;;Moraes et al., 2021).MT systems are used to either train ranking models on translated train-ing data (translate train), or by translating queries into the document language at retrieval time (translate test).However, CLIR approaches relying on MT systems are limited by their language coverage.Because training MT models is bounded by the availability of parallel data, it does not scale well to a large number of languages.Furthermore, using MT for IR has been shown to be prone to propagation of unwanted translation artifacts such as topic shifts, repetition, hallucinations and lexical ambiguity (Artetxe et al., 2020;Litschko et al., 2022a;Li et al., 2022).In this work, we propose a resourcelean MT alternative to bridge the language gap and propose to use artificially code-switched data.
We focus on zero-shot cross-encoder (CE) models for reranking (MacAvaney et al., 2020;Jiang et al., 2020).Our study is motivated by the observation that the performance of CEs diminishes when they are transferred into CLIR and MLIR as opposed to MoIR.We hypothesize that training on queries and documents from the same language leads to monolingual overfitting where the ranker learns features, such as exact keyword matches, which are useful in MoIR but do not transfer well to CLIR and MLIR setups due to the lack of lexical overlap (Litschko et al., 2022b).In fact, as shown by Roy et al. (2020) on bi-encoders, representations from zero-shot models are weakly aligned between languages, where models prefer non-relevant documents in the same language over relevant documents in a different language.To address this problem, we propose to use code-switching as an inductive bias to regularize monolingual overfitting in CEs.
Generation of synthetic code-switched data has served as a way to augment data in cross-lingual setups in a number of NLP tasks (Singh et al., 2019;Einolghozati et al., 2021;Tan and Joty, 2021).They utilize substitution techniques ranging from simplistic re-writing in the target script (Gautam et al., 2021), looking up bilingual lexicons (Tan and Joty, 2021) to MT (Tarunesh et al., 2021).Previous work on improving zero-shot transfer for IR includes weak supervision (Shi et al., 2021), tuning the pivot language (Turc et al., 2021), multilingual query expansion (Blloshmi et al., 2021) and crosslingual pre-training (Yang et al., 2020;Yu et al., 2021;Yang et al., 2022;Lee et al., 2023).To this end, code-switching is complementary to existing approaches.Our work is most similar to Shi et al. (2020), who use bilingual lexicons for full termby-term translation to improve MoIR.Concurrent to our work, Huang et al. (2023) show that codeswitching improves the retrieval performance on low-resource languages, however, their focus lies on CLIR with English documents.To the best of our knowledge, we are the first to systematically investigate (1) artificial code-switching to train CEs and (2) the interaction between MoIR, CLIR and MLIR.
Our contributions are as follows: (i) We show that training on artificially code-switched data improves zero-shot cross-lingual and multilingual rankers.(ii) We demonstrate its robustness towards the ratio of code-switched tokens and effectiveness in generalizing to unseen languages.(iii) We release our code and resources.1

Methodology
Reranking with Cross-Encoders.We follow the standard cross-encoder reranking approach (CE) proposed by Nogueira and Cho (2019), which formulates relevance prediction as a sequence pair (query-document pair) classification task.CEs are composed of an encoder model and a relevance prediction model.The encoder is a pre-trained language model (Devlin et al., 2019) into a joint query-document feature representation, from which the classification head predicts relevance.Finally, documents are reranked according to their predicted relevance.We argue that fine-tuning CEs on monolingual data biases the encoder towards encoding features that are only useful when the target setup is MoIR.To mitigate this bias, we propose to perturb the training data with code-switching, as described next.
Artificial Code-Switching.While previous work has studied code-switching (CS) as a natural phenomenon where speakers borrow words from other languages (e.g.anglicism) (Ganguly et al., 2016;Wang and Komlodi, 2018), we here refer to codeswitching as a method to artificially modify monolingual training data.In the following we assume availability of English (EN-EN) training data.The goal is to improve the zero-shot transfer of ranking models into cross-lingual language pairs X-Y by training on code-switched data EN X -EN Y instead, which we obtain by exploiting bilingual lexicons similar to Tan and Joty (2021).We now describe two CS approaches based on lexicons: one derived from word embeddings and one from Wikipedia page titles (cf.Appendix A for examples).
Code-Switching with Word Embeddings.We rely on bilingual dictionaries D induced from crosslingual word embeddings (Mikolov et al., 2013;Heyman et al., 2017) and compute for each EN term its nearest (cosine) cross-lingual neighbor.In order to generate EN X -EN Y we then use D EN )X and D EN )Y to code-switch query and document terms from EN into the languages X and Y, each with probability p.This approach, dubbed Bilingual CS (BL-CS), allows a ranker to learn interlingual semantics between EN, X and Y.In our second approach, Multilingual CS (ML-CS), we additionally sample for each term a different target language into which it gets translated; we refer to the pool of available languages as seen languages.
Code-Switching with Wikipedia Titles.Our third approach, Wiki-CS, follows (Lan et al., 2020;Fetahu et al., 2021) and uses bilingual lexicons derived from parallel Wikipedia page titles obtained from inter-language links.We first extract word n-grams from queries and documents with different sliding window of sizes n P t1, 2, 3u.Longer n-gram are favored over shorter ones in order to account for multi-term expressions, which are commonly observed in named entities.In Wiki CS we create a single multilingual dataset where queries and documents from different training instances are code-switched into different languages.

Experimental Setup
Models and Dictionaries.We follow Bonifacio et al. (2021) and initialize rankers with the multilingual encoder mMiniLM provided by Reimers and Gurevych (2020).We report hyperparameters in Appendix C. For BL-CS and ML-CS we use multilingual MUSE embeddings2 to induce bilingual  lexicons (Lample et al., 2018), which have been aligned with initial seed dictionaries of 5k word translation pairs.We set the translation probability p " 0.5.For Wiki-CS, we use the lexicons provided by the linguatools project. 3 Baselines.To compare whether training on CS'ed data EN X -EN Y improves the transfer into CLIR setups, we include the zero-shot ranker trained on EN-EN as our main baseline (henceforth, Zero-shot).Our upper-bound reference, dubbed Fine-tuning, refers to ranking models that are directly trained on the target language pair X-Y, i.e. no zero-shot transfer.Following Roy et al. (2020), we adopt the Translate Test baseline and translate any test data into EN using using our bilingual lexicons induced from word embeddings.On this data we evaluate both the Zero-shot baseline (Zero-shot Translate Test ) and our ML-CS model (ML-CS Translate Test ).
3 https://linguatools.org/wikipedia-parallel-titlesDatasets and Evaluation.We use use the publicly available multilingual mMARCO data set (Bonifacio et al., 2021), which includes fourteen different languages.We group those into six seen languages (EN, DE, RU, AR, NL, IT) and eight unseen languages (HI, ID, IT, JP, PT, ES, VT, FR) and construct a total of 36 language pairs.4Out of those, we construct setups where we have documents in different languages (EN-X), queries in different languages (X-EN), and both in different languages (X-X).Specifically, for each document ID (query ID) we sample the content from one of the available languages.For evaluation, we use the official evaluation metric MRR@10. 5All models re-rank the top 1,000 passages provided for the passage re-ranking task.We report all results as averages over three random seeds.

Results and Discussion
We observe that code-switching improves crosslingual and multilingual re-ranking, while not impeding monolingual setups, as shown next.
Transfer into MoIR vs. CLIR.We first quantify the performance drop when transferring models trained on EN-EN to MoIR as opposed to CLIR and MLIR.Comparing Zero-shot results between different settings we find that the average MoIR performance of 25.5 MRR@10 (Table 1) is substantially higher than CLIR with 15.7 MRR@10 (Table 2) and MLIR with 16.6 MRR@10 (Table 3).
The transfer performance greatly varies with the language proximity, in CLIR the drop is larger for setups involving typologically distant languages (AR-IT, AR-RU), to a lesser extent the same observation holds for MoIR (AR-AR, RU-RU).This is consistent with previous findings made in other syntactic and semantic NLP tasks (He et al., 2019;Lauscher et al., 2020).The performance gap to Fine-tuning on translated data is much smaller in MoIR (+4 MRR@10) than in CLIR (+11.1 MRR@10) and MLIR (+8.3 MRR@10).Our aim to is close this gap between zero-shot and full finetuning in a resource-lean way by training on codeswitched queries and documents.
Code-Switching Results.Training on codeswitched data consistently outperforms zero-shot models in CLIR and MLIR (Table 2 and Table 3).In AR-IT and AR-RU we see improvements from 7.7 and 7.1 MRR@10 up to 15.6 and 14.1 MRR@10, rendering our approach particularly effective for distant languages.Encouragingly, Table 1 shows that the differences between both of our CS approaches (BL-CS and ML-CS) versus Zero-shot is not statistically significant, showing that gains can be obtained without impairing MoIR performance.Table 2 shows that specializing one zero-shot model for multiple CLIR language pairs (ML-CS, Wiki-CS) performs almost on par with specializing one model for each language pair (BL-CS).
The results of Wiki-CS are slightly worse in MoIR and on par with ML-CS on MLIR and CLIR.
Translate Test vs. Code-Switch Train.In MoIR (Table 1) both Zero-shot Translate Test and ML-CS Translate Test underperform compared to other approaches.This shows that zero-shot rankers work better on clean monolingual data in the target language than noisy monolingual data in English.In CLIR, where Translate Test bridges the language gap between X and Y, we observe slight improvements of +0.2 and +2.2 MRR@10 (Table 2).However, in both MoIR and CLIR Translate Test consistently falls behind code-switching at training time.
Multilingual Retrieval and Unseen Languages.
Here we compare how code-switching fares against Zero-shot on languages to which neither model has been exposed to at training time.Table 3 shows the gains remain virtually unchanged when moving from six seen (+4.1 MRR@10 / +3.8 MRR@10) to fourteen languages including eight unseen languages (+3.9 MRR@10 / +4.0 MRR@10).Results in Appendix B confirm that this holds for unseen languages on the query, document and both sides, suggesting that the best pivot language for zeroshot transfer (Turc et al., 2021) may not be monolingual but a code-switched language.On seen languages ML-CS is close to MT (Fine-tuning).
Ablation: Translation Probability.The translation probability p allows us to control the ratio of code-switched tokens to original tokens, with p " 0.0 we default back to the Zero-shot base- Table 4: MLIR results on seen languages (MRR@10) broken down into queries that share no common tokens (no overlap), between one and three tokens (some overlap) and more than three tokens (significant overlap) with their relevant documents.Gains of ML-CS are shown in brackets.EN-X has 3,116 queries with no overlap, 3,095 with some overlap and 769 with significant overlap.X-EN has 3,147 queries with no overlap, 2,972 with some overlap and 861 with significant overlap.X-X has 3,671 queries with no overlap, 2,502 with some overlap and 807 with significant overlap.
line, with p " 1.0 we attempt to code-switch every token. 6Figure 1 (top) shows that code-switching a smaller portion of tokens is already beneficial for the zero-shot transfer into CLIR.The gains are robust towards different values for p.The best results are achieved with p " 0.5 and p " 0.75 for BL-CS and ML-CS, respectively.Figure 1 (bottom) shows that the absolute differences to Zero-shot are much smaller in MoIR.
Monolingual Overfitting.Exact matches between query and document keywords is a strong relevance signal in MoIR, but does not transfer well to CLIR and MLIR due to mismatching vocabularies.Training zero-shot rankers on monolingual data biases rankers towards learning features that cannot be exploited at test time.Code-Switching reduces this bias by replacing exact matches with translation pairs, 7 steering model training towards learning interlingual semantics instead.To investigate this, we group queries by their average token overlap with their relevant documents and evaluate each 6 Due to out-of-vocabulary tokens the percentage of translated tokens is slightly lower: 23% for p " 0.25, 45% for p " 0.5, 68% for p " 0.75 and 92% for p " 1.0.In Wiki CS 90% of queries and documents contain at least one translated n-gram, leading to 20% of translated tokens overall. 7We analyzed a sample of 1M positive training instances and found a total of 4,409,974 overlapping tokens before and 3,039,750 overlapping tokens after code-switching (ML-CS, p " 0.5), a reduction rate of ~31%.group separately on MLIR. 8The results are shown in Table 4. Unsurprisingly, rankers work best when there is significant overlap between query and document tokens.However, the performance gains resulting from training on code-switched data (ML-CS) are most pronounced for queries with some token overlap (up to +5.4 MRR@10) and no token overlap (up to +6.8 MRR@10).On the other hand, the gains are much lower for queries with more than three overlapping tokens and range from -0.5 to +1.4 MRR@10.This supports our hypothesis that code-switching indeed regularizes monolingual overfitting.

Conclusion
We propose a simple and effective method to improve zero-shot rankers: training on artificially code-switched data.We empirically test our approach on 36 language pairs, spanning monolingual, cross-lingual, and multilingual setups.Our method outperforms zero-shot models trained only monolingually and provides a resource-lean alternative to MT for CLIR.In MLIR our approach can match MT performance while relying only on bilingual dictionaries.To the best of our knowledge, this work is the first to propose artificial code-switched training data for cross-lingual and multilingual IR.

Limitations
This paper does not utilize any major linguistic theories of code-switching, such as (Belazi et al., 1994;Myers-Scotton, 1997;Poplack, 2013).Our approach to generating code-switched texts replaces words with their synonyms in target languages, looked up in a bilingual lexicon.Furthermore, we do not make any special efforts to resolve word sense or part-of-speech ambiguity.To this end, the resulting sentences may appear implausible and incoherent.(Lample et al., 2018), (ii) Wikipedia page titles obtained from inter-language links, provided by linguatools project. 9The Wikipedia-based lexicons are several times larger that the MUSE vocabulary.

Figure 1 :
Figure 1: Retrieval performance in terms of mean average precision (MAP) for different translation probabilities, averaged across all language pairs.

Table 1 :
EN-EN DE-DE RU-RU AR-AR NL-NL IT-IT AVG ∆ZS MoIR: Monolingual results on mMARCO languages and averaged over all languages (excluding EN-EN) in terms of MRR@10.Bold: Best zero-shot performance for each language.∆ ZS : Absolute difference to Zero-shot.Results significantly different from Zero-shot are marked with * (paired t-test, Bonferroni correction, p ă 0.05).
X-EN EN-X X-X AVGseen ∆seen X-EN EN-X X-X AVGall ∆all

Table 3 :
MLIR: Multilingual results on mMARCO in terms of MRR@10.Left: Six seen languages for which we used bilingual lexicons to code-switch training data.Right: All fourteen languages included in mMARCO.

Table 10 :
Size of bilingual lexicons.Two lexicons are used to substitute the words in English with their respective cross-lingual synonyms: (i) multilingual word embeddings provided by MUSE