Empowering Dual-Encoder with Query Generator for Cross-Lingual Dense Retrieval

In monolingual dense retrieval, lots of works focus on how to distill knowledge from cross-encoder re-ranker to dual-encoder retriever and these methods achieve better performance due to the effectiveness of cross-encoder re-ranker. However, we find that the performance of the cross-encoder re-ranker is heavily influenced by the number of training samples and the quality of negative samples, which is hard to obtain in the cross-lingual setting. In this paper, we propose to use a query generator as the teacher in the cross-lingual setting, which is less dependent on enough training samples and high-quality negative samples. In addition to traditional knowledge distillation, we further propose a novel enhancement method, which uses the query generator to help the dual-encoder align queries from different languages, but does not need any additional parallel sentences. The experimental results show that our method outperforms the state-of-the-art methods on two benchmark datasets.


Introduction
Information Retrieval (IR) aims to retrieve pieces of evidence for a given query.Traditional methods mainly use sparse retrieval systems such as BM25 (Robertson and Zaragoza, 2009), which depend on keyword matching between queries and passages.With the development of large-scale pretrained language models (PLMs) (Vaswani et al., 2017;Devlin et al., 2019) such as BERT, dense retrieval methods (Lee et al., 2019;Karpukhin et al., 2020) show quite effective performance.These methods usually employed a dual-encoder architecture to encode both queries and passages into dense embeddings and then perform approximate nearest neighbor searching (Johnson et al., 2021).
Recently, leveraging a cross-encoder re-ranker as the teacher model to distill knowledge to a dual-  Figure 1: The performance of cross-encoder and query generator when varying the number of training samples and retrievers.We use BM25 and DPR as retrievers, respectively.For the cross-encoder (BERT-Large), we use retrieved top-100 passages which do not contain the answer as negative and contrastive loss for training.For the query generator (T5-Base), we firstly train it with the query generation task and then fine-tune the model with the same setting as BERT-Large.The reported performance is the top-5 score of re-ranked top 500 passages on the NQ test set.
encoder has shown quite effective to boost the dualencoder performance.Specifically, these methods first train a warm-up dual-encoder and a warm-up cross-encoder.Then, they perform knowledge distillation from the cross-encoder to the dual-encoder by KL-Divergence or specially designed methods.For example, RocketQAv2 (Qu et al., 2021) proposed dynamic distillation, and AR2 (Zhang et al., 2021) proposed adversarial training.
However, there are two major problems when scaling the method to the cross-lingual dense retrieval setting.Firstly, the cross-encoder typically requires large amounts of training data and high-quality negative samples due to the gap between pre-training (token-level task) and finetuning (sentence-level task), which are usually not satisfied in the cross-lingual setting (Asai et al., 2021a).Due to expensive labeling and lack of annotators in global languages, especially low-resource languages, the training data in cross-lingual are quite limited.Then with the limited training data, the dual-encoder is not good enough to provide high-quality negative samples to facilitate the crossencoder.Secondly, the cross-lingual gaps between different languages have a detrimental effect on the performance of cross-lingual models.Although some cross-lingual pre-training methods such as In-foXLM (Chi et al., 2021) and LaBSE (Feng et al., 2022) have put lots of effort into this aspect by leveraging parallel corpus for better alignment between different languages, these parallel data are usually expensive to obtain and the language alignment could be damaged in the fine-tuning stage if without any constraint.
To solve these problems, we propose to employ a query generator in the cross-lingual setting, which uses the likelihood of a query against a passage to measure the relevance.On the one hand, the query generator can utilize pre-training knowledge with small training data in fine-tuning stage, because both of its pre-training and fine-tuning have a consistent generative objective.On the other hand, the query generation task is defined over all tokens from the query rather than just the [CLS] token in the cross-encoder, which has been demonstrated to be a more efficient training paradigm (Clark et al., 2020).As shown in Figure 1, with the number of training samples dropping, the performance of BERT-Large drops more sharply than T5-Base.Besides, the query generator is less sensitive to highquality negative samples.As we can see, using BM25 as the retriever to mine negative samples for re-ranker training, the gap between cross-encoder and query generator is smaller than the gap using DPR as the retriever.Finally, the query generator can provide more training data by generation, which is precious in the cross-lingual setting.To sum up, the query generator is more effective than the cross-encoder in the cross-lingual setting.
Based on these findings, we propose a novel method, namely QuiCK, which stands Query generator improved dual-encoder by Cross-lingual Knowledge distillation.Firstly, at the passage level, we employ a query generator as the teacher to distill the relevant score between a query and a passage into the dual-encoder.Secondly, at the language level, we use the query generator to generate synonymous queries in other languages for each training sample and align their retrieved results by KL-Divergence.Considering the noise in the generated queries, we further propose a scheduled sampling method to achieve better performance.
The contributions of this paper are as follows: • We propose a cross-lingual query generator as a teacher model to empower the cross-lingual dense retrieval model and a novel iterative training approach is leveraged for the joint optimizations of these two models.
• On top of the cross-lingual query generator, a novel cost-effective alignment method is further designed to boost the dense retrieval performance in low-resource languages, which does not require any additional expensive parallel corpus.
• Extensive experiments on two public crosslingual retrieval datasets demonstrate the effectiveness of the proposed method.
2 Related Work Retrieval.Retrieval aims to search relevant passages from a large corpus for a given query.Traditionally, researchers use bag-of-words (BOW) based methods such as TF-IDF and BM25 (Robertson and Zaragoza, 2009).These methods use a sparse vector to represent the text, so we call them sparse retrievers.Recently, some studies use neural networks to improve the sparse retriever such as DocTQuery (Nogueira et al., 2019) and DeepCT (Dai and Callan, 2020).
Recently, with the development of cross-lingual pre-trained models (Conneau et al., 2020), researchers pay more attention to cross-lingual dense (c) Query Generator.
Figure 2: Overview of different model architectures designed for retrieval or re-ranking.
Re-ranking.Re-ranking aims to reorder the retrieved passages as the relevant scores.Due to the small number of retrieved passages, re-ranking usually employs high-latency methods to obtain better performance, e.g., cross-encoder.Traditionally, the re-ranking task is heavily driven by manual feature engineering (Guo et al., 2016;Hui et al., 2018).
With the development of pre-trained language models (e.g., BERT), researchers use the pre-trained models to perform re-ranking tasks (Nogueira and Cho, 2019;Li et al., 2020).In addition to crossencoder, researchers also try to apply generator to re-ranking.For example, monoT5 (Nogueira et al., 2020) proposes a prompt-based method to re-rank passages with T5 (Raffel et al., 2020) and other studies (dos Santos et al., 2020;Zhuang et al., 2021;Lesota et al., 2021) propose to use the loglikelihood of the query against the passage as the relevance to perform the re-ranking task.
Recently, with the size of pre-trained models scaling up, the generative models show competitive zero-shot and few-shot ability.Researchers start to apply large generative models to zero-shot and few-shot re-ranking.For example, SGPT (Muennighoff, 2022) and UPR (Sachan et al., 2022) propose to use generative models to perform zero-shot re-ranking.P 3 Ranker (Hu et al., 2022) demonstrates that generative models achieve better performance in the few-shot setting.Note that all of these works are concurrent to our work.Instead of using a query generator as a re-ranker only, we propose to leverage the query generator as a teacher model to enhance the performance of the crosslingual dual-encoder.In addition to the traditional knowledge distillation, we further propose a novel cost-effective alignment method to boost the dense retrieval performance in low-resource languages.

Preliminaries
In this section, we give a brief review of dense retrieval and re-ranking.The overviews of all methods are presented in Figure 2.
Dual-Encoder.Given a query q and a large corpus C, the retrieval task aims to find the relevant passages for the query from a large corpus.Usually, a dense retrieval model employs two dense encoders (e.g., BERT) E Q (•) and E P (•).They encode queries and passages into dense embeddings, respectively.Then, the model uses a similarity function, often dot-product, to perform retrieval: where q and p denote the query and the passage, respectively.During the inference stage, we apply the passage encoder E P (•) to all the passages and index them using FAISS (Johnson et al., 2021) which is an extremely efficient, open-source library for similarity search.Then given a query q, we derive its embedding by v q = E Q (q) and retrieve the top k passages with embeddings closest to v q .
Cross-Encoder Re-ranker.Given a query q and top k retrieved passages C, the re-ranking task aims to reorder the passages as the relevant scores.Due to the limited size of the corpus, the re-ranking task usually employs a cross-encoder to perform interaction between words across queries and passages at the same time.These methods also introduce a special token [SEP] to separate q and p, and then the hidden state of the [CLS] token from the crossencoder is fed into a fully-connected layer to output the relevant score: where "||" denotes concatenation with the [SEP] token.During the inference stage, we apply the cross-encoder E C (•) to all <q, p> pair and reorder the passages by the scores.
Query Generator Re-ranker.Similar to crossencoder re-ranker, query generator re-ranker also aims to reorder the passages as the relevant scores.
For the query generator re-ranker, we use the loglikelyhood of the query against the passage to measure the relevance: (3) where q <t denotes the previous tokens before q t .The rest of settings are the same as the crossencoder re-ranker and are omitted here.
Training.The goal of retrieval and re-ranking is to enlarge the relevant score between the query and the relevant passages (a.k.a., positive passages) and lessen the relevant score between the query and the irrelevant passages (a.k.a., negative passages).
It consists of a query, a positive passage, and n negative passages.Then we can employ the contrastive loss function, called InfoNCE (van den Oord et al., 2018), to optimize the model: where f denotes the similarity function, e.g., f de in Eq. ( 1), f ce in Eq. ( 2), or f qg in Eq. (3).
Cross-lingual Retrieval.In the cross-lingual information retrieval task, passages and queries are in different languages.In this paper, we consider the passages are in English and the queries are in non-English languages.A sample consists of three components: a query in a non-English language, a positive passage in English, and a span answer in English.Given a non-English query, the task aims to retrieve relevant passages in English to answer the query.If a retrieved passage contains the given span answer, it is regarded as a positive passage, otherwise, it is a negative passage.

Methodology
In this section, we present the proposed QuiCK.
The overview of the proposed method is presented in Figure 3.We start with the training of the query generator, then present how to perform distillation and alignment training for the dual-encoder, and we finally discuss the entire training process.

Training of Query Generator
In our method, we employ mT5 (Xue et al., 2021) as the query generator.The query generator has two roles: teacher and generator.As a teacher, it aims to better re-rank the candidate passages with relevance and distill the knowledge to the dual-encoder.
As a generator, it aims to generate synonymous queries in different languages.
Input Format.As we employ mT5, we design a prompt template for input sentences.Considering that most passages are long, we propose introducing the span answer as input to encourage the generator to focus on the same segment and generate parallel queries in different languages.As a result, we use "generate [language] query: answer: [span answer] content: [content]" as the template.
For a specific sample, we fill the three placeholders with the language of the target query, the span answer, and the passage content, respectively.
Training.Considering the two roles of the query generator, the entire training process for the query generator contains two stages: query generation training and re-ranking training.Firstly, we train the generator with the generation task, which takes the positive passage as input and aims to generate the query.The task can be for-mulated as maximizing the conditional probability: q = arg max q P (q|p, a) = arg max q t=0 P (q t |p, a, q <t ) , where q t is the t-th token of the generated query, a denotes the span answer, and q <t represents the previous decoded tokens.Then we can employ cross-entropy loss to optimize the model: where T denotes the number of the query tokens.Secondly, we train the generator with the reranking task, which takes a query and a passage as input and outputs the relevant score of the two sentences.The detailed training process is introduced in Section 3 and is omitted here.

Distillation for Dual-Encoder
We then present how to distill knowledge from query generator to dual-encoder.Similar to previous methods (Ren et al., 2021b;Zhang et al., 2021), we employ KL-Divergence to perform distillation.Formally, given a query q and a candidate passage set C q = {p i } 1≤i≤n which is retrieved by the dual-encoder, we compute relevant scores by query generator and dual-encoder, respectively.After that, we normalize the scores by softmax and compute the KL-Divergence as the loss: , s de (q, p) = exp(f de (q, p)) , where f qg and f de denote the relevant score given by the query generator and the dual-encoder which are presented in Section 3.

Alignment for Dual-Encoder
Alignment is a common topic in the cross-lingual setting, which can help the model better handle sentences in different languages.Previous works (Zheng et al., 2021;Yang et al., 2022) usually use parallel data or translated data to perform alignment training among different languages.
Here, we propose a novel method to align queries in different languages for cross-lingual retrieval, which does not need any parallel data.The core idea of our method is to leverage the query generator to generate synonymous queries in other languages to form parallel cases.
Generation.For each case in the training set, we generate a query in each target language (a.k.a., if there are seven target languages, we generate seven queries for the case).Then we use the confidence of the generator to filter the generated queries.Specially, we set filter thresholds to accept 50% of generated queries.
Scheduled Sampling.In this work, we select a generated query to form a pair-wise case with the source query.Considering the semantics of generated queries, we carefully design a scheduled sampling method to replace the random sampling.For a generated query q ′ , we first use the dual-encoder to retrieve passages for the source query q and generated query q ′ , respectively, namely C q and C ′ q .Then we calculate a coefficient for the generated query q ′ as where threshold T is a hyper-parameter and | • | denotes the size of the set.The basic idea is that the larger the union of retrieved passages, the more likely the queries are to be synonymous.When sampling the generated query, we first calculate coefficients {c ′ 1 , . . ., c ′ m } for all generated queries {q ′ 1 , . . ., q ′ m }, then normalize them as the final sampling probability p: where m denotes the number of generated queries.
During the training stage, for each training case, we sample a generated query to form the pair-case with the source query q based on the probabilities.
Alignment Training.After sampling a generated query, we present the how to align the source query and the generated query.Different to previous works (Zheng et al., 2021), we employ asymmetric KL-Divergence rather than symmetric KL-Divergence due to the different quality of the source Fine-tune the G with Eq. ( 4) on D and retrieved negative passages.13 end query and the generated query: where q denotes the query, C q denotes the set of retrieved passages, superscript "′" denotes the generated case, and c ′ is the coefficient of the generated query.Note that s de in Eq. ( 10) are normalized across C q ∪ C ′ q instead of C q or C ′ q in Eq. ( 7).

Training of Dual-Encoder
As shown in Figure 3, we combine the distillation loss and the alignment loss as final loss: where L D denotes the distillation loss for the source queries, L ′ D denotes the distillation loss for the generated queries, L A denotes the alignment loss, and α is a hyper-parameter to balance the loss.
Based on the training method of dual-encoder and query generator, we conduct an iterative procedure to improve the performance.We present the entire training procedure in Algorithm 1.

Experiments
In this section, we construct experiments to demonstrate the effectiveness of our method.

Experimental Setup
Datasets.We evaluate the proposed method on two public cross-lingual retrieval datasets: XOR-Retrieve (Asai et al., 2021a) and MKQA (Longpre et al., 2020).The detailed descriptions of the two datasets are presented in Appendix A.
Evaluation Metrics.Following previous works (Asai et al., 2021a;Sorokin et al., 2022), we use R@2kt and R@5kt as evaluation metrics for the XOR-Retrieve dataset and R@2kt as evaluation metrics for the MKQA dataset.The metrics measure the proportion of queries to which the top k retrieved tokens contain the span answer, which is fairer with different passage sizes.
Implementation Details.For the warm-up training stage, we follow XOR-Retrieve to first train the model on NQ (Kwiatkowski et al., 2019) data and then fine-tune the model with XOR-Retrieve data.For the iteratively training stage, we generate seven queries for each case (because the XOR-Retrieve data contains seven languages).We set the number of retrieved passages as 100, the number of iterations as 5, threshold T in Eq. ( 8) as 0.3 and coefficient α in Eq. ( 11 All the experiments run on 8 NVIDIA Tesla A100 GPUs.The implementation code is based on HuggingFace Transformers (Wolf et al., 2020).For the dual-encoder, we use XLM-R Base (Conneau et al., 2020) as the pre-trained model and use the average hidden states of all tokens to represent the sentence.For the query generator, we leverage mT5 Base (Xue et al., 2021) as the pre-trained model, which has almost the same number of parameters as a large cross-encoder.

Results
Baselines.We compare the proposed QuiCK with previous state-of-the-art methods, including mDPR, DPR+MT (Asai et al., 2021a), Sentri (Sorokin et al., 2022), DR.DECR (Li et al., 2021).Note that Sentri introduces a shared encoder with large size, DR.DECR introduces parallel queries and parallel corpus, but our method only utilizes an encoder with base size, XOR-Retrieve and NQ training data.For more fairly comparison, we also report their ablation results.Here, "Bi-Encoder" denotes two unshared encoders with base size."KD XOR " de- notes a distillation method which introduces synonymous English queries."KD P C " denotes a distillation method which introduces parallel corpus.
In addition, we also employ LaBSE base (Feng et al., 2022) to evaluate the proposed QuiCK with parallel corpus, which is a state-of-the-art model pre-trained with parallel corpus.
XOR-Retrieve.  the leaderboard1 on June 15, 2022.As we can see, our method achieves the top position on the leaderboard of XOR-Retrieve.
MKQA.Furthermore, we evaluate the zero-shot performance of our method on the MKQA test set.Following previous works (Sorokin et al., 2022), we directly evaluate the dual-encoder training on XOR-Retrieve data and report the performance of unseen languages on MKQA.As shown in Table 3, our method outperforms all baselines and even performs better than Sentri.Note that Sentri uses a shared encoder with large size.The comparison between Sentri and Sentri w/ Bi-Encoder shows that the large encoder has better transfer ability.Finally, the proposed QuiCK w/ LaBSE outperforms all baselines with a clear edge.It shows the better transfer ability of our methods.

Methods Analysis
Ablation Study.Here, we check how each component contributes to the final performance.We construct the ablation experiments on XOR-Retrieve data.We prepare four variants of our method:  11); (3) w/o Generation denotes without L ′ D and L A in Eq. ( 11); (4) w/o All denotes without the enhanced training, a.k.a., the warm-up dual-encoder.
Table 4 presents all comparison results of the four variants.As we can see, the performance rank of R@5kt can be given as: w/o All < w/o Generation < w/o Alignment < w/o Sampling < QuiCK.These results indicate that all components are essential to improve performance.And we can find the margin between w/o Alignment and w/o Sampling is small, it denotes that the generated queries are noisy and demonstrate the effectiveness of our schedule sampling strategy.
Effect of Alignment.As we mentioned in Section 1, the alignment established in the pre-training stage may be damaged without any constraint in the fine-tuning stage.Here, we construct experiments on both XLM-R and LaBSE to analyze the effectiveness of the proposed alignment training.As shown in Table 5, the proposed alignment training is effective based on the two models.It indicates that the alignment constraint in the finetuning stage is effective for models which pretrained with parallel corpus.And we find that the gains of alignment training based on XLM-R are larger than LaBSE, which shows that the alignment constraint is more effective for models which do not pre-trained with parallel corpus.
Cross-Encoder versus Query Generator.Here, we analyze the re-ranking ability of cross-encoder and query generator.Here, we use the warm-up  dual-encoder to retrieve passages, vary the number of candidate passages, and then evaluate the reranked result.As shown in Figure 4, when we use the top-100 candidate passages, the performance of the cross-encoder and generator is almost the same.But as the number of candidate passages increases, especially when it surpasses 500, the gap between the performance of the cross-encoder and query generator gradually becomes larger.It shows low generalization performance of the cross-encoder when there are not enough training samples.
Visualization of the Training Procedure.We visualize the performance changes of R@2kt during the training of both dual-encoder and query generator re-ranker which re-ranks the retrieved top-100 passages.We also incorporate a cross-encoder (initialized with XLM-R Large) to perform distillation and re-ranking for comparison.As shown in Figure 5, the R@2kt of all models gradually increases as the iteration increases.While the training advances closer to convergence, the improvement gradually slows down.In the end, the performance of the dual-encoder is improved by approximately 17%, and the performance of the query generator is improved by approximately 20%.Finally, comparing the performance of the cross-encoder and the query generator, we can find that there are approximately 6% gaps for both teachers and students.It shows the effectiveness of our method.

Conclusion
In this paper, we showed that the cross-encoder performs poorly when there are not sufficient training samples which are hard to obtain in the crosslingual setting.Then we proposed a novel method that utilizes the query generator to improve the dual-encoder.We firstly proposed to use a query generator as the teacher.After that, we proposed a novel alignment method for cross-lingual retrieval which does not need any parallel corpus.Extensive experimental results show that the proposed method outperforms the baselines and significantly improves the state-of-the-art performance.Currently, our method depends on training data in all target languages.As future work, we will investigate how to perform the proposed method for zero-shot cross-lingual dense retrieval.

Limitations
The limitations are summarized as follows.
• The method depends on training data in all target languages.Intuitively, the method can be directly applied to the zero-shot cross-lingual dense retrieval if we only take the passage as input for the query generator, but the query generator performs poorly in the zero-shot setting.As future work, novel pre-training tasks for crosslingual generation can be considered.
• The method does not investigate how to effectively train the query generator for the reranking task, just directly applies the training method for the cross-encoder re-ranker.We believe the potential of query generators for re-ranking is strong and designing a special reranking training method for query generators such as token-level supervision may be interesting for future work.
• The method requires large GPU sources.The final model approximately costs 12 hours on 8 NVIDIA Tesla A100 GPUs.Although researchers who do not have enough GPU sources can use the "gradient accumulation" technique to reduce GPU memory consumption, they also need to pay more time.
• This work does not consider the inconsistency between different countries (e.g., law and religion), which leads to inconsistent positive passages for synonymous queries in different languages (e.g., the legal age of marriage varies from country to country).Because we find that most of the queries in XOR-Retrieve contain the target country such as "Mikä on yleisin kissa laji Suomessa?"(translation: What is the most common cat breed in Finland?).   the leaderboard.
In addition, we find that the optimal R@2kt and R@5kt are led by different thresholds T .Because the two metrics have different sensitivities to data quality, low-quality data is helpful to R@5kt but harmful to R@2kt.As a result, a small threshold T leads to more low-quality alignment training data and further leads to higher R@5kt but lower R@2kt.On the contrary, the optimal R@2kt and R@5kt are led by the same coefficient α.
Overall, our model is relatively stable when varying the two parameters, and consistently better than Sentri and Dr.DECR w/o KD P C .

D.2 Effect of The Number of Candidates
Here, we investigate the effect of the number of candidates which is demonstrated to have a significant effect on the final performance.As shown in Figure 7, a large number of candidates leads to better performance.And when the number surpasses 32, the improvement gradually slows down.The results indicate that 32 candidates can better represent the whole corpus.

D.3 Effect of Model Size
Following Sentri (Sorokin et al., 2022), we employ a shared encoder (i.e., the parameters of the query encoder and the passage encoder are the same) with large size as the dual-encoder and evaluate our method.As shown in Table 9, QuiCK with XLM-R  Large achieves a significant performance improvement, which further demonstrates the effectiveness of the proposed QuiCK.

D.4 Effect of Scheduled Sampling
Previously, we demonstrate the effectiveness of the scheduled sampling by ablation study.Here, we present two generated samples to qualitatively analyze the scheduled sampling.As shown in Table 10, for the first case, the generated queries in other languages have the same semantics as the query from the source language.The sample is effective in alignment training and is helpful to achieve better performance.For the second case, the generated query in Finnish is relevant to the query from the source language but not synonymous.The sample is harmful to the model training.These samples indicate that the scheduled sampling is necessary for alignment training.In this way, we can reduce the impact of the cases which do not have the same semantics and further achieve better performance.

D.5 Effect of Span Answer
In our method, we employ the span answer to encourage the query generator to generate synonymous queries.Here, we conduct experiments to evaluate the effect of the span answer.Specially, we use another template: "generate [language] query: [content]" where we only need to fill two placeholders with the language of the target query and  the passage content.We also incorporate the crossencoder for comparison.We use the re-rankers to re-rank the retrieved results of the warm-up dual-encoder initialized with XLM-R.Note that introducing the span answer into the cross-encoder makes the re-ranking task easier, because the crossencoder only needs to check whether the passage contains the span answer.The scores of this crossencoder almost degenerate into hard labels and it is difficult to effectively train the dual-encoder by distilling knowledge from this cross-encoder.
We show the results in Table 12.Based on these results, we have the following findings.On the one hand, the query generator trained with span answers is better than the query generator without span answers.It shows that taking span answers  as input leads to better performance on re-ranking tasks for the query generator.On the other hand, both the two query generator is better than the cross-encoder when re-ranking top-1000 retrieved passages, it shows the effectiveness of the query generator in the cross-lingual setting.
In addition, we show queries generated by the two query generators in Table 11.As we can see, for the query generator that does not take the span answer as input, the generated queries can be answered by the passage, but they focus on different segments of the passage and they are not synonymous.On contrary, for the query generator that takes the span answer as input, generated queries can be answered by the passage and they are synonymous.It shows that taking the span answer as input can effectively encourage the generator to generate synonymous queries.

E Detailed Results
Due to the limited space, we only present average performance for some experiments in Section 5. Here, we present the detailed performance in all languages of these experiments.Firstly, we present the detailed performance of all methods on the MKQA test set in Table 13.Secondly, we present the detailed performance of ablation results in Table 14.Finally, we present the detailed performance for evaluating the effect of alignment in Table 15.

Figure 3 :
Figure 3: Overview of the proposed QuiCK.

Algorithm 1 :
The training algorithm.Input: Dual-Encoder R, Query Generator G, Corpus C, and Training Set D. 1 Initialize R and G with pre-trained model; 2 Train the warm-up R with Eq. (4) on D; 3 Train the warm-up G with Eq. (6) on D; 4 Generate queries for each sample in D; 5 Build ANN index for R; 6 Retrieve relevant passages on corpus C; 7 Fine-tune the G with Eq. (4) on D and retrieved negative passages.8 while models has not converged do 9 Fine-tune the R with Eq. (11) on D and retrieved passages; 10 Refresh ANN index for R; 11 Retrieve relevant passages on corpus C; 12 ) as 0.5.The detailed hyperparameters are shown in Appendix C.And we conduct more experiments to analyze the parameter sensitivity in Appendix D.

Figure 4 :
Figure 4: Re-ranking performance of cross-encoder and query generator on XOR-Retrieve dev set with different numbers of candidate passages.

Figure 5 :
Figure5: The changes of R@2kt during the iteratively training on XOR-Retrieve dev set.Here, "QG" denotes Query Generator and "CE" denotes Cross Encoder.

Figure 7 :
Figure 7: Effect of the number of candidate passages.
ᅭ? Translation: How long was the reign of Charles V of the Holy Roman Empire?Generated Query (Ru): Сколько лет правил Карл V? Translation: How many years did Charles V rule?Generated Query (Fi): Minä vuonna Charles V hallitsi Rooman valtakuntaa?Translation: In what year did Charles V rule the Roman Empire?

Passage:
The Higgs boson is an elementary particle in the Standard Model of particle physics, produced by the quantum excitation of the Higgs field, one of the fields in particle physics theory.It is named after physicist Peter Higgs, who in 1964, along with five other scientists, proposed the mechanism which suggested the existence of such a particle.Its existence was confirmed in 2012 by the ATLAS and CMS collaborations based on collisions in the LHC at CERN.Generated Query by QG w/ span answer (Fi): Kuka on kehittänyt Higgs-boson?Translation: Who developed the Higgs boson?Generated Query by QG w/ span answer (Ko):ᄒ ᅵᄀ ᅳᄉ ᅳ ᆫ ᄋ ᅳ ᆯ ᄎ ᅥᄋ ᅳ ᆷ ᄇ ᅡ ᆯᄀ ᅧ ᆫᄒ ᅡ ᆫ ᄉ ᅡᄅ ᅡ ᆷᄋ ᅳ ᆫ ᄂ ᅮᄀ ᅮᄋ ᅵᆫᄀ ᅡ? Translation: Who first discovered Higson?Generated Query by QG w/o span answer (Fi): Milloin Higgs on löydetty?Translation: When was Higgs Found?Generated Query by QG w/o span answer (Ko): Кто был первым исследователем физики Higgs?Translation: Who was Higgs' first physics researcher?

Table 1 :
Comparison results on XOR-Retrieve dev set.The best results are in bold." * " denotes the results are copied from the source paper.Results unavailable are left blank.

Table 2 :
Comparison results on XOR-Retrieve test set.

Table 3 :
Average performance of 20 unseen languages in MKQA test set." * " denotes the results are copied from the Sentri paper.

Table 4 :
Ablation results on XOR-Retrieve dev set.
(1) w/o Sampling denotes without the scheduled sampling but keep the threshold T for c ′ , a.k.a., if c ′ ≥ T , then c ′ = 1, otherwise c ′ = 0; (2) w/o Alignment denotes without L A in Eq. (

Table 9 :
Effect of model size.

Table 10 :
Two generated examples.The span answers are in bold.Passage: Charles V (24 February 1500 -21 September 1558) was ruler of both the Holy Roman Empire from 1519 and the Spanish Empire (as Charles V of Spain) from 1516, as well as of the lands of the former Duchy of Burgundy from 1506.He stepped down from these and other positions by a series of abdications between 1554 and 1556.Through inheritance, he brought together under his rule extensive territories in western, central, and southern Europe, and the Spanish viceroyalties in the Americas and Asia.

Table 11 :
A generated example with different input templates.The span answer is in bold.Here, "QG" denotes query generator.