Noisy Self-Training with Synthetic Queries for Dense Retrieval

Although existing neural retrieval models reveal promising results when training data is abundant and the performance keeps improving as training data increases, collecting high-quality annotated data is prohibitively costly. To this end, we introduce a novel noisy self-training framework combined with synthetic queries, showing that neural retrievers can be improved in a self-evolution manner with no reliance on any external models. Experimental results show that our method improves consistently over existing methods on both general-domain (e.g., MS-MARCO) and out-of-domain (i.e., BEIR) retrieval benchmarks. Extra analysis on low-resource settings reveals that our method is data efficient and outperforms competitive baselines, with as little as 30% of labelled training data. Further extending the framework for reranker training demonstrates that the proposed method is general and yields additional gains on tasks of diverse domains.\footnote{Source code is available at \url{https://github.com/Fantabulous-J/Self-Training-DPR}}


Introduction
As an important task, information retrieval (IR) refers to the task of finding relevant texts from a large collection of passages or documents to satisfy the specific information needs of users.The information need is usually expressed as a short textual query, and the task is formulated as retrieving texts that are most relevant to the given query.
Recently, impressive achievements have been made in neural retrieval models through adopting large-scale pre-trained models (Devlin et al., 2019).Dual-encoders typically serve as the backbone architecture, which enables retrieving relevant knowledge from collections with millions or billions of passages in a fraction of time (Karpukhin et al., 2020).Unlike traditional term-matching-based lexical retrievers, such as TF-IDF (Manning et al., 2008) or BM25 (Robertson and Zaragoza, 2009), which can be applied without any training, neural retrieval methods normally require training on a sufficient number of human-labelled query-passage pairs to work well.Nevertheless, due to the high cost of human annotations, the number of available query-passage pairs is way smaller compared to the size of passage collections (500k at most (Bajaj et al., 2016) v.s.21M passages (Kwiatkowski et al., 2019)).Moreover, the situation becomes increasingly worse in out-of-domain applications, where only a few or no training examples are available.Directly applying neural retrieval models trained on high-resource datasets typically achieves low out-of-domain performance and lags behind their lexical counterparts (Thakur et al., 2021).
In this work, we aim to improve the performance of state-of-the-art neural retrieval models on both general domain and out-of-domain datasets, by using automatically synthesised queries.For this purpose, we use a query generator to automatically synthesise queries for each passage in the target dataset, which we then use for pre-training, based on a self-training objective (Scudder, 1965).More specifically, instead of directly training the model on synthetic query-passage pairs (Ma et al., 2021), we use a neural retrieval model that was trained on the labelled dataset as the teacher to generate soft labels for each synthetic query, providing a supervision signal which is more robust to noise in the generated data.This ameliorates issues with the synthetic queries which are often generic or ambiguous, meaning that the query is only weakly related to its originating passage, and accordingly, other passages may match the synthetic query equally well or better.Table 1 shows a motivating example, where the synthetic query is spuriously-related to its originating passage with only the keyword achievement matched.However, the top-2 sampled "negatives" are highly related to the query in topics and semantics, according to both human judgment ...success of the 20th century was the Manhattan Project.The Manhattan Project assimilated concepts and leaders from all scientific fields and engineering disciplines to construct the first two atomic bombs.
1.0 0 The Manhattan Project was an epic, secret, wartime effort to design and build the world's first nuclear weapon.... the $20 billion project resulted in the production of the first uranium and plutonium bombs.0.92 0 The Manhattan Project is an American film, released in 1986.The plot revolves around a gifted high school student who decides to construct an atomic bomb for a national science fair.0.4 0 and model predictions.By contrast, simply taking the originating passage as positive and equally treating all negatives with hard labels regardless of their relevance to the query will completely mislead the model, leading to poor semantic matching patterns between query and passages.
Furthermore, to prevent the student from blindly imitating the teacher's behaviour, we pollute its inputs either in the query or passage by injecting noise.This ensures that the student has a different view on inputs compared to the teacher, encouraging it to learn generalised signals from the teacher.Moreover, by polluting the inputs (e.g., shuffling), the student is encouraged to capture salient phrases in addition to semantic matching, being more robust to possible perturbed inputs (Figure 4).After completing pre-training on synthetic queries, we further finetune the model on labelled data.Through iterating the pre-training and finetuning steps by using the latest model to relabel synthetic data, the resulting model can significantly outperform the one trained only using labelled data.In summary, the contributions are as follows: 1. We proposed a novel self-training framework to make use of automatically generated data more wisely for neural retrieval models ( §3). 2. Our experimental results on general-domain benchmarks show that the proposed framework can not only significantly boost the performance of state-of-the-art neural retrievers, but also yield superior results in low-resource settings.We further show our framework is general and can be extended to improve more powerful cross-encoder-based rerankers ( §4). 3. Further experiments are conducted on the outof-domain BEIR benchmark (Thakur et al., 2021), and both neural retrievers and rerankers surpass a series of strong models ( §4.6).

Task Definition
We focus on the task of short-passage retrieval (IR) in this work.Given a query in the form of a short text, the task requires retrieving a small set of passages that can satisfy the information needs, from a collection of passages in million or billion scale.
Formally, suppose we have a passage collection P = {p 1 , p 2 , • • • , p n }, the retriever is required to fetch top-k passages P q = {p 1 , p 2 , • • • , p k } from P that are relevant to a specific query q.

Dense Passage Retrieval
In contrast to traditional IR methods, such as BM25 (Robertson and Zaragoza, 2009), which represent texts in high dimensional and sparse vectors with inverted index, dense retrieval methods alternatively adopt neural models to encode texts (queries or passages) in dense latent vectors with much smaller dimensions.A dense passage retrieval model (Karpukhin et al., 2020) typically adopts the dual-encoder architecture, where neural models are used to encode the query and passage into dense vectors separately.The relevance is measured by the dot product between their embeddings: where E • (•; θ) is an encoder parameterised by θ.
The adoption of this form of 'dual-encoder' architecture decouples the encoding of query and passage.At inference, all passages in P can be encoded offline.When a query q comes in, efficient nearest neighbour search (Johnson et al., 2021) can be performed to fetch the top-k passages.Contrastive learning is applied to train the dualencoder.Given a query q, we have a positive passage p + and a set of n negative passages P − q = {p − i } n i=1 .The model is being optimised by minimising the negative log likelihood of the positive passage: P − q is the set of irrelevant passages constructed from in-batch negatives (Chen et al., 2020) (i.e.positive passages of other queries in the same minibatch) and mined hard negatives from existing retrievers (Karpukhin et al., 2020;Xiong et al., 2021).

Self-Training with Synthetic Queries
Self-training (Yarowsky, 1995) has been shown effective in improving model performance through using unlabelled or synthetic data.It first use labelled data to train a good teacher, then the teacher is used to generate pseudo labels on unlabelled data.Finally, an identical student model is first pre-trained on unlabelled data with soft labels and then finetuned on labelled data.Recently, it has been shown to work well for a variety of tasks, including image classification (Xie et al., 2020) and neural machine translation (He et al., 2020).However, its effectiveness has not yet been evaluated for dense passage retrieval, especially with automatically synthesised queries.
Suppose we have a well-trained query generator, which is used to generate a set of synthetic query-passage pairs T ′ = {(q ′ , p ′+ )}.Moreover, we assume that a teacher dense retrieval model that is fully trained on labelled data is available.The student retriever has the same architecture as the teacher, but it only accesses the soft labels produced by the teacher during training.This ensures that all involved negatives are not treated equally but with different soft labels, and more relevant passages are more likely to be penalised less.Moreover, we assume the use of teacher supervision will make the supervision signal for problematic queries (e.g., general or ambiguous queries) much more diverse, as the teacher will have little clue and a corresponding high entropy predictive distribution.Formally, the learning process is conducted via KL divergence by closing the distribution distance between the student and teacher: where q ′ , which is the union of the pseudo positive passage of synthetic query q ′ and the sampled negatives as in §2.2.T (•|q ′ , P ′ q ′ , θ T ) and S(•|q ′ , P ′ q ′ , θ S ) are the distributions from the teacher and student, respectively.
We further inject noises into student's inputs, hoping that it can generate more robust embeddings.Through injecting noises, the student needs to imitate the behaviour of the teacher with perturbed inputs (i.e.different views), being encouraged to learn generalised signals from the teacher (He et al., 2020).Inspired by Wu et al. (2019), we use following strategies for noise injection: 1. Shuffle: randomly choose some words in the query or passage as candidates for shuffling, then randomly shuffle these candidates.2. Delete: randomly delete some words in the query or passage.3. Mask: randomly mask some words in the query or passage with a [MASK] token.Empirically, we found that applying them sequentially to both queries and passages with a probability 0.1 achieves the best results.Thus, the selftraining loss of noised version becomes: where q ′ and P ′ q ′ are noised query and passages.

Training Pipeline
Algorithm 1 summarises the training pipeline, with an overview shown in Figure 1.More specifically: Teacher Preparation (line 3 in Alg. 1) We first train a teacher model in labelled data using a twostage training similar to (Gao and Callan, 2022).In the first stage, the retriever is trained with hard negatives sampled from a BM25 retriever.Then, the retriever trained in the first stage is used to discover hard negatives, which are later used to train a second-stage retriever.The resulted retriever serves as the teacher θ T in our algorithm.
Index Building & Hard Negative Mining (lines 4-5 in Alg. 1) The teacher model θ T encodes all passages on P into dense vectors, which are later used to build the ANN search index with FAISS.Algorithm 1 Noisy Self-Training with Synthetic Queries Require: Gold query-passage pairs T = {(q, p + )} Require: Passage collection P 1: Train a query generator G on gold pairs T .2: Generate queries from P and construct synthetic querypassage pairs . 3: Train a dual-encoder retriever θT on gold pairs T .4: Use θT to build ANN index on P 5: Retrieve negatives P − q for each (q, p + ) ∈ T and P − q ′ for each (q ′ , p ′+ ) ∈ T ′ 6: Use θT as the teacher to generate soft-labels on synthetic queries T ′ = {(q ′ , p ′+ , P − q ′ )} 7: Pre-train a student retriever θS on soft-labelled synthetic queries T ′ , with noises injected to θS's inputs 8: Finetune θS on gold pairs T = {(q, p + , P − q )} 9: Take θS as new teacher: θT ← θS and go back to line 6 Hard negatives for both labelled data {(q, p + )} and synthetic data {(q ′ , p ′+ )} are retrieved using the built index, by taking k-nearest neighbours while excluding the gold passage.

Noisy Self
Training & Fine-tuning (lines 6-8 in Alg. 1) A student model θ S is first pre-trained on the synthetic queries with the noised self-training objective (Eq.4), after which it is fine-tuned on labelled data according to Eq. 2.
Iterative Training (line 9 in Alg. 1) The teacher model can be replaced by the student to generate new pseudo labels and do noisy self-training and fine-tuning all over again.

Datasets
We evaluate our method on two generaldomain datasets: MS-MARCO passage ranking (Bajaj et al., 2016) and Natural Questions (NQ) (Kwiatkowski et al., 2019).MRR@10, Recall@50, Recall@1000 on the MS-MARCO dev set are reported, where MRR represents the Mean Reciprocal Rank, which is calculated as the sum of the reciprocal rank of the first retrieved relevant passage for each query; and Recall@k is the proportion of relevant passages that appear in the top-k retrievals.Recall@k (k=5, 20, 100) is reported on Natural Questions, which is the proportion of top-k retrieved passages that contain the answer string to each query.The evaluation script provided by Pyserini (Lin et al., 2021) is used for all experiments.

Experimental Settings
We replicate the coCondenser model (Gao and Callan, 2022) using PyTorch (Paszke et al., 2019) and treat it as our baseline.
For self-training, the coCondenser model that was fully trained on labelled data serves as the teacher, which we use to train the student retriever on synthetic queries.After self-training, the student is further finetuned on labelled data, with hyperparameters following Gao and Callan (2022).The last checkpoint is selected for evaluation on test set for all experiments.More details are in Appendix A.1.

Main Results
Table 2 shows the results of our model compared with a range of lexical and neural retrievers.We observe that the performance of our replicated coCondenser is competitive with that of Gao and Callan (2022), achieving slightly better results on Natural Questions but slightly worse on MS-MARCO.By applying our proposed noisy self-training framework, the student coCondenser outperforms the state-of-the-art results on both datasets, resulting in significant improvements over coCondenser (1.2% MRR@10 and R@5 on MS-MARCO and Natural Questions, respectively).This also shows that although coCondenser has already been pre-trained on the target corpus, continuing pre-training it on synthetic queries with noisy self-training can still lead to further performance boost.
To test if the proposed method is general, we also combine it with various pre-trained models, includ-

MS-MARCO
Natural Questions MRR@10 R@50 R@1k R@5 R@20 R@100 ing BERT (Devlin et al., 2019), Contriever (Izacard et al., 2021), Condenser (Gao and Callan, 2021), and RetroMAE (Xiao et al., 2022).Note that all models share the same architecture and only differ in their initial weights.Figure 2 shows the results on MS-MARCO.It is observed that the student model achieves consistent improvements on all pretrained models.Moreover, for worse teachers, the benefits gained by adopting the proposed method are generally more significant (e.g., 2.3 MRR@10 on BERT vs 1.3 MRR@10 on RetroMAE).

Method MS-MARCO NQ
MRR@10 R@50 R@5 R@20 We also adapt the proposed framework to see if it can be applied to improve an expensive reranker.Similarly, a teacher reranker that was fully trained on labelled data2 is used to generate soft labels on synthetic data, and an identical student reranker is pre-trained.A cross-encoder model is used as the backbone to jointly encode both query and passage.As shown in Table 3, we see that by reranking the student coCondenser's top-k predictions, the student reranker can outperform the teacher, boosting the performance with another 0.7% and 0.8% gains on MRR@10 and R@1, respectively.

Ablation Studies
We conduct ablation studies to further understand our methods, with results reported in Table 4.

Noise Injection
We remove noise injection during pre-training, and the pre-training objective changes from Eq. 4 to Eq. 3. It is observed that the noisy injection strategies show positive impacts in helping the model rank correct answers higher.
Pseudo Labels We remove line 4 in Algorithm 1 and directly use synthetic labels {(q ′ , p ′+ )} together with hard negatives for pre-training in line 5.The results show that directly pre-training on synthetic query-passage pairs leads to inferior performance on both datasets, resulting in significant performance degradation on all metrics, especially on Natural Questions where the R@20 score is even worse than the coCondenser baseline.This indicates that synthetic query-passage pairs contain a large number of noises, and adopting our proposed method can effectively alleviate the negative impacts of noises, which validates our hypothesis.
Consistency Filtering We take the teacher as a consistency filter (Alberti et al., 2019) to remove noises contained in synthetic data.More specifically, for a given synthetic query-passage pair (q ′ , p ′+ ), if p ′+ can be retrieved by the teacher in the top-1 position, this pair is kept; otherwise, it will be discarded.Although this strategy can effectively improve the quality of synthetic data and also yields competitive performance, taking the teacher as a pseudo-label generator leads to better results.

Joint Training
We jointly train the student model on both synthetic and labelled data, where the loss becomes L = L ST + L CL .Different batch sizes are also used for synthetic and labelled data to ensure the overall update steps are roughly the same as in pre-train + finetune.We observe that joint training leads to significantly worse results on MS-MARCO but comparable performance on Natural Questions.

Analysis
Data Efficiency We further analyse how the size of labelled query-passage pairs used for training affects the retrieval performance of our method.Smaller datasets are randomly sampled from the full MS-MARCO training data with different sizes, ranging from 1% to 70%.Note that in each data size setting, all involved models are restricted to that specific amount of labelled data samples.labelled data samples decreases.Moreover, the student is superior in terms of data efficiency, achieving performance comparable to the teacher model using full data samples (i.e., our coCondenser in Table 2) with as little as 30% of labelled data, and the performance continues to improve when more data is available.
Robustness to Shuffled Query Figure 4 compares the performance when different proportions of tokens in each test query are randomly shuffled.We observe that the student model is significantly more robust to queries with tokens in random orders.By removing noise injection during training, the performance drops become increasingly larger with more tokens shuffled, indicating that the noise injection strategy not only leads to performance gains in normal settings but also improves the robustness towards perturbed inputs.This also signifies our model learns lexical matching in capturing salient phrases to some extent, a property typically held in sparse retrievers (e.g., BM25).

Out-of-Domain Experiments
We further test the domain adaptation ability of our proposed method by reporting the performance on the BEIR benchmark (Thakur et al., 2021), which contains 18 datasets from retrieval tasks of diverse formats and domains.The average nDCG@10 score over all datasets is used for evaluation.
Implementation Details We use the coCondenser model that was fine-tuned on MS-MARCO as the teacher model in our algorithm, which we call Teacher henceforth.For synthetic queries, we directly use the publicly available ones released by Wang et al. (2022). 4Since labelled data is not available, the fine-tuning step is eliminated.Unlike previous methods that train specialised retrievers for each task, we train a single universal retriever on the union of synthetic queries from all tasks.We directly use the last checkpoint for evaluation on the test set of each task.To ensure fair comparison, we also re-implement GPL by following the same configurations as in Wang et al. (2022) except that we employ the Teacher retriever for initialisation 5and train a single model, denoted as GPL-S.
We also investigate if the proposed method can improve the reranker's out-of-domain performance.Similarly, the reranker trained on MS-MARCO serves as the teacher: Teacher-Reranker, guided by which an indentical Student Reranker is trained.At inference, both rerankers are applied to rerank the top-100 predictions of the Student Retriever.The official evaluation script is used for experiments. 6More details are in Appendix A.2.

Main Results
We compare our proposed model with a variety of models, including BM25 (Robertson and Zaragoza, 2009), DocT5query (Nogueira and Lin, 2019), DeepCT (Dai and Callan, 2020), GenQ (Thakur et al., 2021), GPL (Wang et al., 2022) and PTR (Dai et al., 2023). 7Table 5 shows the experimental results.We first notice that the Teacher model already achieves strong results compared to other baseline models.GenQ which further finetunes Teacher on synthetic data does not show positive effects, indicating that synthetic queries are extremely noisy and directly using them for training is not beneficial.By contrast, the student retriever trained using our noisy self-training framework significantly improves over Teacher, increasing the averaged nDCG@10 by 3.0%.Even when being compared to GPL and GPL-S, the models that also use query generation for training data augmentation but take a prohibitively expensive reranker as a pseudo-label generator, and PTR that prompts a 137B instruction-tuned large language model for query generation, our model can still achieve competitive averaged performance and beats these task-specialised retrievers on 6 out of 18 tasks while being comparable on the rest.Note that our model does not rely on any external models, and it is improved in a self-evolution manner, managing to exhibit a high degree of efficiency.More specifically, when being compared to GPL, our method significantly diminishes training time by roughly a factor of three (204h → 74h), and it achieves a 25× speed-up in relevance labelling (20ms/q → 0.8ms/q).Figure 2 shows that our method achieves consistent improvements and beats GenQ across different pre-trained models on SCIDOCS, a dataset that is significantly different from MS-MARCO in retrieval needs and domain.See Table 10 for full details.
When adopting the same framework to rerankers, the student rereanker can further boost the performance with another 1.1 average points over the teacher reranker.The results again confirm that our method is general and can be extended to improve a powerful reranker in out-of-domain settings.

Discussion
The adaptation of our proposed framework in out-of-domain tasks can be interpreted as a specific type of unsupervised domain adaptation (UDA) algorithm (Wang and Deng, 2018), where a model aims to maximise its performance on target domains when only labelled data from in-domain sources and unlabelled data from target domains are available.In our case, the query generator trained on in-domain data (i.e., MS-MARCO) is employed to generate synthetic queries on unlabelled target-domain data (i.e., cor-7 See Appendix B for more baseline details. pus in BEIR datasets).Meanwhile, the retriever trained on in-domain data (i.e.Teacher) generates soft labels for these synthetic queries.As a result, the query generator fabricates distributions of potential queries that could be possibly asked in the target domain; while the teacher retriever captures prevalent patterns of correspondence between queries and passages within the target domain, by leveraging the knowledge it has acquired from the labelled in-domain data.When exposing it to such silver target-domain data during training, a new student retriever is enforced to acquire relevant knowledge that is required to complete retrieval tasks in target domains.Consequently, the exposure to pseudo target-domain data mandates the improvement of domain-specific aptitude within the student retriever, allowing it to effectively retrieve information within target domains.

Self-Training Iterations
Table 6 compares the models trained with varying numbers of iterations.We observe that employing a single iteration yields optimal results, while performance diminishes with more iterations.We conjecture that errors produced in relevance labelling may accumulate over successive iterations, potentially reinforcing the model's bias towards particular error types as the training process continues.

Quality of Pseudo Labels
To assess the quality of the generated soft labels, we randomly sampled 100 synthetic queries and verified the relatedness of the originating passage to each synthetic query through manual examination.We observed that 49 of these queries were indeed related to their source passages.Among this subset of 49 queries, we noted that, 93% of the time, the soft labels effectively identified alternative passages capable of accurately responding to the query.This was indicated by the assignment of a high probability mass to these alternative passages.In contrast, for the remaining 51 querypassage pairs, where the association was found to be incorrect, soft labels can identify better-matched passages for 42 of them.This observation provides strong evidence for the effectiveness of our proposed self-training methodology.dimensional embeddings and show superiority over lexical retrievers when being trained on sufficient data.Karpukhin et al. (2020) adopts a dual-encoder structure to use two independent neural encoders to encode queries and passages into fix-sized vectors separately with their dot product as the relevance score.The model is trained to discriminate positive passages from randomly-sampled irrelevant ones or more informative negatives (Xiong et al., 2021).Other works adopt the poly-encoder architecture (Humeau et al., 2020), where query and documents are represented as multiple vectors to allow token-level interactions (Khattab and Zaharia, 2020).Although effective, the benefits come at the cost of increased index size and a more complex scoring function.In this work, we adopt the dualencoder structure for its simplicity.However, our proposed method is orthogonal to model architectures, and we believe its combination with polyencoder retrievers is worth further exploration.
Synthetic Queries for Information Retrieval Synthetic queries have been widely used in information retrieval.Early work expands passages with synthetic queries for lexical retrievers (Nogueira and Lin, 2019).Recent neural models take synthetic queries and their originating passages as positive pairs for model pre-training, resulting in boosted performance (Lu et al., 2021), better domain adaptation ability (Ma et al., 2021;Gangi Reddy et al., 2022) and promising zero-shot results (Dai et al., 2023).In order to reduce the noise in synthetic data, Wang et al. (2022) exploits a cross-encoder reranker to generate pseudo labels.It incurs expensive costs in generating soft labels and training task-customised retrievers, resulting in slow adaptation.Our work follows this direction but significantly differs in that we do not rely on any external model for synthetic data labelling and create dataset-specialised retrievers.By contrast, we empirically show that by using the retriever itself as a more efficient pseudo-label generator, it can be improved in a self-evolution manner with our introduced noisy self-training framework.Moreover, we show this framework is general and can be extended to boost reranking performance.
Self-Training Self-training (Scudder, 1965) refers to a class of approaches that learn from unlabelled data with pseudo labels.A good teacher model is first trained on labelled data, which is later used to label unlabelled data.Another student model is then pre-trained on unlabelled data first and further finetuned on labelled data.More advanced methods train multiple teachers using features of disjoint partitions on labelled data and a student is learned from their ensembles (Blum and Mitchell, 1998).Recently, the effectiveness of self-training has been verified in a wide range of tasks, including machine translation with back translation (Wu et al., 2019) and language generation (He et al., 2020).These approaches are termed noisy self-training, as the inputs of the student are perturbed.In this work, we follow this direction to show that, for the first time, noisy self-training can be adopted to make better use of synthetic queries to improve neural retrievers in both general-domain and out-of-domain settings.

Conclusion
In this paper, we present a novel noisy self-training framework for neural retrieval models.It shows that when combined with automatically generated queries, neural retrievers can be improved in a selfevolution manner without relying on any external models.Empirical results on both general-domain and out-of-domain benchmarks confirm the superiority of our proposed method, significantly outperforming a wide range of competitive existing models.We further adapt our method to show it can be applied to improve an expensive reranker.

Limitations
Although our proposed method does not change model architectures and the resulting models can perform as efficiently as previous models, the introduced training framework does incur the additional training cost, including query generation and the associated hard negative mining.Despite these ex-tra costs, the training of our methods has a fairly modest footprint by modern standards, taking about 3 days of a server with 4×A100 GPUs and 450G CPU RAM.We mainly experiment with dual-encoderbased neural dense retrieval models and additionally adapt the method to reranker training in this work.Alternative retriever methods such as ColBERT (Khattab and Zaharia, 2020) and SPLADE (Formal et al., 2021) are compatible with our approach, and their incorporation should lead to further gains.However, they introduce extra complexity compared to the dual-encoder, e.g., requiring an index over tokens rather than complete passages.We leave their efficient integration as future work.
For out-of-domain experiments, the current method still relies on using general-domain datasets to obtain the teacher model and the query generator.How to achieve this in an unsupervised setting needs further exploration.Moreover, on tasks that have different retrieval needs from general passage retrieval, synthetic queries are normally quite different from the gold standard.For instance, some gold queries in DBPedia are general and may have multiple matching passages (e.g., Give me all people that were born in Vienna and died in Berlin.), but synthetic queries are usually only related to their originating passages (e.g., Who was Johannes Mayer).How to generate synthetic queries with high quality and similar properties to gold standard queries is the key to better retrievers.
As this work uses pre-trained language models for generating synthetic queries, it is possible that undesirable biases (e.g., gender and cultural) from the language models (Wei et al., 2022) is propagated to downstream models.This is in additional to existing biases in training and evaluation datasets (Bigdeli et al., 2021).Evaluating the extent to which biases affect the synthetic data and the resulting model is an inherently complex problem, and remains an open question for future work.

A Implementation Details
Statistics about the datasets used in our experiments are reported in Table 7. Tables 8 and 9 show the hyperparameters used in our experiments.
A.1 General Domain Datasets co-condenser-marco8 is used for model initialisation on MS-MARCO and co-condenser-wiki9 on Natural Questions.For retrievers, the weights of dual encoders are tied.The teacher is trained using the two-stage training following the hyperparameters used in (Gao and Callan, 2022) and (Karpukhin et al., 2020).The student is pre-trained for 1 epoch for all datasets, with a learning rate 1 × 10 −5 and batch size 32.The top-100 passages retrieved by the teacher are used as hard negatives, from which 7 negatives are sampled for each query in a minibatch.We empirically find conducting the training with one iteration is enough for achieving the best result.The training takes about 48 hours on both datasets, with up to 4 A100 GPUs.
For generating synthetic queries, we use the publicly available model10 to generate queries on MS-MARCO with top-k sampling.For Natural Questions, we first finetune a T5 model on labelled query-passage pairs for 200 epochs with learning rate 2 × 10 −5 , which takes about 10 hours on 1 A100 GPU.One query per passage is generated for all datasets, which requires approximately 7 hours on MS-MARCO and 16 hours on Natural Questions using 2 A100 GPUs.
As for the reranker described in §3, ERNIEbase11 is used for model initialisation.The teacher reranker is trained with the teacher retriever hard negatives (Figure 1).It is trained for 2 epochs on MS-MARCO, with learning rate set to 1 × 10 −5 , batch size 12, and weight decay 0.1.Each query is paired with 40 sampled negatives; while for Natural Questions, we train it for 10 epochs with 15 sampled negatives.For pre-training the student rereanker, the batch size is set to 48 and 128 on MS-MARCO and Natural Questions, respectively, and other hyperparameters remain the same.For the finetuning stage, the settings used for training the teacher reranker are adopted.Completing the whole training pipeline requires approximately 80 hours on MS-MARCO and 67 hours on Natural Questions with up to 4 A100 GPUs.

A.2 Out-of-Domain Datasets
On BEIR benchmark, the Teacher retriever is used to mine hard negatives for synthetic queries on each dataset.The top-100 passages retrieved by the Teacher retriever are regarded as the hard negative pool.Student Retriever is initialised from the co-condenser-marco checkpoint.On each dataset, we train a Student Retriever for 10 epochs with learning rate 1 × 10 −5 .The batch size is set to 32, and each query is paired with 7 sampled negatives.Training a single model on each dataset takes about 8 hours using 2 A100 GPUs and two iterations are used to achieve the best average performance.For GenQ, we follow Thakur et al. (2021) to further finetune the Teacher retriever on synthetic queries of each dataset for 1 epoch with batch size 64 and only in-batch negatives.
The reranker is initialised from ERNIE-base and finetuned for 5 epochs with learning rate 1 × 10 −5 and batch size 12.Each query is paired with 15 negatives sampled from the same hard negative pool as above.Training a single model takes about 18 hours using 2 A100 GPUs.
B Baselines on BEIR BM25 (Robertson and Zaragoza, 2009) is a lexical retriever based on token matching.DocT5query (Nogueira and Lin, 2019) uses a query generator to synthesise queries and append them to passages as expansion.DeepCT (Dai and Callan, 2020) uses BERT model trained on MS-MARCO to compute the weight of each term in the vocabulary and each passage is represented with keywords multiplied by the term weights.GenQ (Thakur et al., 2021) first trains a dense retrieval model on MS-MARCO and continues to finetune it on synthetic queries with in-batch negatives through contrastive learning.GPL (Wang et al., 2022) uses knowledge distillation to train a dense retriever by learning from a reranker trained on MS-MARCO.PTR (Dai et al., 2023) generates a large number of synthetic queries by prompting an instruction-tuned large language model and trains task-specialised retrievers.
Synthetic Query: what was the major achievement of the manhattan project?Originating Passage Logits Hard ...The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.

Figure 1 :
Figure 1: The overview of noisy self-training algorithm.Line numbers in Algorithm 1 are above each arrow type.

Figure 3 :Figure 4 :
Figure 3: Impacts of labelled training data size on MS-MARCO dev set.Teacher refers to coCondenser.

Figure 5 :
Figure 5: The overview of adapting the self-training framework to reranker.

Table 1 :
A motivating example for the issue of weak-relation between synthetic query and originating passages , where top-2 sampled 'negatives' are way more relevant to the query than its originating passage.Negatives and associated Logits are from a dense model fine-tuned on MS-MARCO and are normalised for clarity.

Table 2 :
Khattab and Zaharia (2020)t and Natural Questions test set.*indicatesresultsdirectlycopied fromGao and Callan (2022)andKhattab and Zaharia (2020).† indicates our implementation.The best results are marked bold and unavailable results are left blank.

Table 3 :
Compared results between the teacher and student reranker.The top-50 and top-100 predictions from student coCondenser are re-ranked on MS-MARCO and Natural Questions, respectively.

Table 4 :
Results of different variants compared to the student model on MS-MARCO and Natural Questions.

Table 5 :
Thakur et al. (2021)hmark (nDCG@10).Best results are marked bold and second best results are underlined.*indicatesresults copied fromThakur et al. (2021).† indicates our implementation.§ means methods using synthetic queries.¶ means methods learning from cross-encoder rerankers, thus are not directly comparable to ours.×n means that these methods train specialised models for each datasets.

Table 10 :
Results on BEIR benchmark (nDCG@10) when adopting different pre-trained models.Teacher refers to the model finetuned on MS-MARCO.