Boot and Switch: Alternating Distillation for Zero-Shot Dense Retrieval

Neural 'dense' retrieval models are state of the art for many datasets, however these models often exhibit limited domain transfer ability. Existing approaches to adaptation are unwieldy, such as requiring explicit supervision, complex model architectures, or massive external models. We present $\texttt{ABEL}$, a simple but effective unsupervised method to enhance passage retrieval in zero-shot settings. Our technique follows a straightforward loop: a dense retriever learns from supervision signals provided by a reranker, and subsequently, the reranker is updated based on feedback from the improved retriever. By iterating this loop, the two components mutually enhance one another's performance. Experimental results demonstrate that our unsupervised $\texttt{ABEL}$ model outperforms both leading supervised and unsupervised retrievers on the BEIR benchmark. Meanwhile, it exhibits strong adaptation abilities to tasks and domains that were unseen during training. By either fine-tuning $\texttt{ABEL}$ on labelled data or integrating it with existing supervised dense retrievers, we achieve state-of-the-art results.\footnote{Source code is available at \url{https://github.com/Fantabulous-J/BootSwitch}.}


Introduction
Remarkable progress has been achieved in neural information retrieval through the adoption of the dual-encoder paradigm (Gillick et al., 2018), which enables efficient search over vast collections of passages by factorising the model such that the encoding of queries and passages are decoupled, and calculating the query-passage similarity using dot product.However, the efficacy of training dualencoders heavily relies on the quality of labelled data, and these models struggle to maintain competitive performance on retrieval tasks where dedicated training data is scarce (Thakur et al., 2021).
Various approaches have been proposed to enhance dense retrievers (Karpukhin et al., 2020) in zero-shot settings while maintaining the factorised dual-encoder structure, such as pre-training models on web-scale corpus (Izacard et al., 2022) and learning from cross-encoders through distillation (Qu et al., 2021).Other alternatives seek to trade efficiency for performance by using complex model architectures, such as fine-grained token interaction for more expressive representations (Santhanam et al., 2022) and scaling up the model size for better model capacity (Ni et al., 2022).Another line of work trains customised dense retrievers on target domains through query generation (Wang et al., 2022;Dai et al., 2023).This training paradigm is generally slow and expensive, as it employs large language models to synthesise a substantial number of high-quality queries.
In this paper, we present ABEL, an Alternating Bootstrapping training framework for unsupervised dense rEtrievaL.Our method alternates the distillation process between a dense retriever and a reranker by switching their roles as teachers and students in iterations.On the one hand, the dense retriever allows for efficient retrieval due to its factorised encoding, accompanied by a compromised model performance.On the other hand, a reranker has no factorisation constraint, allowing for more fine-grained and accurate scoring, but at the cost of intractable searches.Our work aims to take advantage of both schools by equipping the dense retriever with accurate scoring by the reranker while maintaining search efficiency.Specifically, i) the more powerful but slower reranker is used to assist in the training of the less capable but more efficient retriever; ii) the dense retriever is employed to improve the performance of the reranker by providing refined training signals in later iterations.This alternating learning process is repeated to iteratively enhance both modules.
Compared with conventional bootstrapping approaches (Alberti et al., 2019;Zelikman et al., 2022), wherein the well-trained model itself is used to discover additional solutions for subsequent training iterations, our method considers one model (i.e., teacher) as the training data generator to supervise another model (i.e., student), and their roles as teacher and student are switched in every next step.This mechanism naturally creates a mutual-learning paradigm to enable iterative bidirectional knowledge flow between the retriever and the reranker, in contrast to the typical singlestep and unidirectional distillation where a student learns from a fixed teacher (Miech et al., 2021).
Through extensive experiments on various datasets, we observe that ABEL demonstrates outstanding performance in zero-shot settings by only using the basic BM25 model as an initiation.Additionally, both the retriever and reranker components involved in our approach can be progressively enhanced through the bootstrapping learning process, with the converged model outperforming more than ten prominent supervised retrievers and achieving state-of-the-art performance.Meanwhile, ABEL is efficient in its training process by exclusively employing sentences from raw texts as queries, rather than generating queries from large language models.The use of the simple dual-encoder architecture further contributes to its efficient operation.In summary, our contributions are: 1. We propose an iterative approach to bootstrap the ability of a dense retriever and a reranker without relying on manually-created data.2. The empirical results on the BEIR benchmark show that the unsupervised ABEL outperforms a variety of prominent sparse and supervised dense retrievers.After fine-tuning ABEL using supervised data or integrating it with off-theshelf supervised dense retrievers, our model achieves new state-of-the-art performance.3. When applying ABEL on tasks that are unseen in training, we observe it demonstrates remarkable generalisation capabilities in comparison to other more sophisticated unsupervised dense retrieval methods.4. To the best of our knowledge, we are the first to show the results that both dense retriever and cross-encoder reranker can be mutually improved in a closed-form learning loop, without the need for human-annotated labels.

Preliminary
Given a short text as a query, the passage-retrieval task aims at retrieving a set of passages that in-clude the necessary information.From a collection of passages as the corpus, are most relevant to a specific query q.Optionally, a reranker R is also employed to fine-grain the relevance scores for these retrieved passages.

Dense Retrieval Model
The dense retrieval model (retriever) encodes both queries and passages into dense vectors using a dual-encoder architecture (Karpukhin et al., 2020).Two distinct encoders are applied to transform queries and passages separately, then, a relevance score is calculated by a dot product, where E(•; θ) are encoders parameterised by θ p for passages and θ q for queries.The asymmetric dualencoder works better than the shared-encoder architecture in our preliminary study.For efficiency, all passages in P are encoded offline, and an efficient nearest neighbour search (Johnson et al., 2021) is employed to fetch top-k relevant passages.

Reranking Model
The reranking model (reranker) adopts a crossencoder architecture, which computes the relevance score between a query and a passage by jointly encoding them with cross-attention.The joint encoding mechanism is prohibitively expensive to be deployed for large-scale retrieval applications.In practice, the joint-encoding model is usually applied as a reranker to refine the relevance scores for the results by the retriever.The relevance score by the reranker is formalised as, where E(•, •; ϕ) is a pre-trained language model parameterised by ϕ.In this work, we adopt the encoder of T5 (EncT5) (Liu et al., 2021) as E. A startof-sequence token is appended to each sequence, with its embedding fed to a randomly initialised single-layer feed-forward network (FFN) to calculate the relevance score.

Alternating Distillation
We propose an unsupervised alternating distillation approach that iteratively boosts the ability of a retriever and a reranker, as depicted in Fig. 1.Alg. 1 outlines the proposed method with three major steps.Our approach starts with warming up a retriever by imitating BM25.Subsequently, a recursive learning paradigm consisting of two steps is conducted: (1) training a reranker based on the labels extracted from the retriever by the last iteration;

Query
(2) refining the dense retriever using training signals derived from the reranker by the last step.

Retriever Warm-up
The training starts with constructing queries from raw texts (line 1) and training a warm-up dense retriever D 0 θ by imitating an unsupervised BM25 model (lines 3-5).

Query Construction
We use a sentence splitter to chunk all passages into multiple sentences, then we consider these sentences as croppingsentence queries (Chen et al., 2022).Compared to queries synthesised from fine-tuned query generators (Wang et al., 2022) or large language models (Dai et al., 2023), cropping-sentence queries i) can be cheaply scaled up without relying on supervised data or language models and ii) have been shown effectiveness in pre-training dense retrievers (Gao and Callan, 2022;Izacard et al., 2022).
Training Data Initialisation The first challenge to our unsupervised method is how to extract effective supervision signals for each cropping-sentence query to initiate training.BM25 (Robertson and Zaragoza, 2009) is an unsupervised sparse retrieve model, which has demonstrated outstanding performance in low-resource and out-of-domain settings (Thakur et al., 2021).Specifically, for a given query q ∈ Q, we use BM25 to retrieve the topk predictions P k BM25 (q), among which the highest ranked k + ∈ Z + passages are considered as positive P + (q), while the bottom k − ∈ Z + passages are treated as hard negatives P − (q), following Chen et al. (2022), ) where the initial D uses BM25 and r(p) means the rank of a passage p.Then, we train a warm-up retriever based on these extracted training examples.

Iterative Bootstrapping Training
We iteratively improve the capability of the retriever and the reranker by alternating their roles as teacher and student in each iteration.
In the t-th iteration, we first use the most recent retriever D t−1 θ to retrieve top-k passages P k D t−1 θ (q) that are most relevant to each query q ∈ Q and generate soft labels D(q, p; θ t−1 ) for each (q, p) pair accordingly.Then we use all such soft labels to train a reranker R t ϕ as described in Alg. 1 lines 8-10 and §3.3.1.The second step is to train the t-th retriever.We employ R t ϕ to rerank P k D t−1 θ (q) to obtain a refined ranking list, from which updated supervision signals are derived as Alg. 1 lines 12-13.We train a new retriever D t θ with these examples as discussed in Alg. 1 line 14 and §3.3.2.Training iterations are repeated until no performance improvement is observed.Note that, in order to mitigate the risk of overfitting the retriever, for all iterations we refine the warm-up retriever D 0 θ using the newest training examples but refresh top-k predictions and update soft labels for reranker training with the improved retriever.Similarly, to avoid the accumulation of errors in label generation, we reinitialise the reranker using pre-trained language models at the start of each iteration, rather than finetuning the model obtained from the last iteration.Please refer to Table 3 for the empirical ablations.

Retriever and Reranker Training
In each iteration, we fine-tune the reranker and then the retriever as follows.
In our preliminary experiments, we observed that using hard labels to train the rereanker yields poor results.The reason is that hard labels use binary targets by taking one query-passage pair as positive and the rest pairs as negative, failing to provide informative signals for discerning the subtle distinctions among passages.To address this problem, we consider using the soft labels generated by a dense retriever D θ to guide the reranker training.These labels effectively capture the nuanced semantic relatedness among multiple relevant passages.Specifically, we employ the KL divergence loss: where P q is the retrieved set of passages regarding to query q, which are sampled from P k T (•|q, P q , θ) and S(•|q, P q , ϕ) are the distributions from the teacher retriever and the student reranker, respectively.Our preliminary experiment shows that adding noise to the reranker's inputs (e.g., word deletion) can further enhance the performance.For more details, please refer to Table 4 in Appendix B.

Retriever Training
For each query q, we randomly sample one positive p + and one hard negative p − from P + (q) (Eq. 3) and P − (q) (Eq.4), respectively.In practice, we use in-batch negatives (Karpukhin et al., 2020) for efficient training.The dense retriever D θ is trained by minimising the negative log-likelihood of the positive passages using contrastive learning (CL) (Hadsell et al., 2006), .
where |B| is the size of a batch B. Similarly, we find that injecting noise into the inputs of the retriever during training results in improved performance, as demonstrated in Table 4 in Appendix B.
We believe that manipulating words in queries and passages can be seen as a smoothing method to facilitate the learning of generalisable matching signals (He et al., 2020), therefore, preventing the retriever from simply replying on examining the text overlaps between the cropping-sentence query and passages.Additionally, adding noise tweaks the semantics of texts, which encourages the dense retriever to acquire matching patterns based on lexical overlaps in addition to semantic similarities.

Discussion
The novelty of our approach lies in two aspects: i) We introduce a separate reranker model for training label refinement.This allows the retriever training to benefit from an advanced reranker, which features a more sophisticated cross-encoding architecture.This design is effective compared to training the retriever on labels extracted from its own predictions (see the blue line in Figure 5(a)); ii) We create a mutual-learning paradigm through iterative alternating distillation.In our case, we consider the retriever as a proposal distribution generator, which selects relatively good candidates, and the reranker as a reward scoring model, which measures the quality of an example as a good answer.At each iteration, the improved reranker can be used to correct the predictions from the retriever, Retriever Type dense dense dense dense dense dense dense mul-vec sparse dense dense sparse dense dense dense dense therefore reducing the number of inaccurate labels in retriever training data.Meanwhile, the improved retriever is more likely to include more relevant passages within its top-k predictions and provide more nuanced soft labels, which can enhance the reranker training and thus, increase the chance of finding correct labels by the reranker in the next iteration.As the training continues, (1) the candidates of correct answers can be narrowed down by the retriever (i.e., better proposal distributions), as shown in Fig. 4 and (2) the accuracy of finding correct answers from such sets can be increased by the reranker, as shown in Fig. 2(b).Overall, our framework (i.e., ABEL) facilitates synergistic advancements in both components, ultimately enhancing the overall effectiveness of the retrieval system.

Dataset
Our method is evaluated on the BEIR benchmark (Thakur et al., 2021), which contains 18 datasets in multiple domains such as wikipedia and biomedical, and with diverse task formats including question answering, fact verification and paraphrase retrieval.We use nDCG@10 (Järvelin and Kekäläinen, 2002) as our primary metric and the average nDCG@10 score over all 18 datasets is used for comprehensive comparison between models.BEIR-13 (Formal et al., 2021) and PTR-11 (Dai et al., 2023) are two subsets of tasks, where BEIR-13 excludes CQADupStack, Robust04, Signal-1M, TREC-NEWS, and BioASQ from the calculation of average score and PTR-11 further removes NQ and Quora.We also partition the datasets into query search (e.g., natural questions) and semantic relatedness (e.g., verification) tasks, in accordance with Santhanam et al. (2022),

Experimental Settings
We use Contriever to initialise the dense retriever.
The passages from all tasks in BEIR are chunked into cropping-sentence queries, on which a single retriever ABEL is trained.We initialise the reranker from t5-base-lm-adapt (Raffel et al., 2020) and train a single reranker ABEL-Rerank similarly.We conduct the alternating distillation loop for three iterations and observe that the performance of both the retriever and reranker converges, and we take ABEL in iteration 2 and ABEL-Rerank in iteration 3 for evaluation and model comparison.We further fine-tune ABEL on MS-MARCO (Bajaj et al., 2016) to obtain ABEL-FT, following the same training recipe outlined in Izacard et al. (2022).In addition, we also evaluate ABEL and ABEL-Rerank on various datasets that are unseen during training to test the generalisation ability.More details are in Appendix A.

Experimental Results
Retriever Results Both unsupervised and supervised versions of our model are compared with a range of corresponding baseline models in Table 1.
The unsupervised ABEL outperforms various leading dense models, including models that are supervised by MS-MARCO training queries.The supervised model, ABEL-FT, achieves 1.0% better overall performance than ABEL.ABEL-FT also surpasses models using the target corpora for pre-training (COCO-DR), employing fine-grained token interaction (ColBERTv2), using sparse representations (SPLADE++), with larger sizes (GTR-XXL), and using sophisticated training recipes with diverse supervision signals (DRAGON+).
Considering semantic relatedness tasks, such as Signal-1M, Climate-FEVER, and SciFact, ABEL generally achieves results superior to other supervised dense retrievers.For query-search tasks, particularly question-answering tasks like NQ and Hot-potQA, ABEL underperforms many dense retrievers.We attribute such outcomes to the differences in query styles.Semantic relatedness tasks typically use short sentences as queries, which aligns with the cropping-sentence query format employed in our work.However, query-search tasks often involve natural questions that deviate significantly from the cropping-sentence queries, and such format mismatch leads to the inferior performance of ABEL.For lexical matching tasks, such as Touché-2020, ABEL surpasses the majority of dense retrievers by a considerable margin.We attribute this success to the model's ability to capture salient phrases, which is facilitated by learning supervision signals from BM25 in retriever warm-up training and well preserved in the following training iterations.Finally, ABEL outperforms GPL and PTR, even though they have incorporated high-quality synthetic queries and cross-encoder distillation in training.This observation demonstrates that a retriever can attain promising results in zero-shot settings without relying on synthetic queries.
For the supervised setting, the performance of ABEL can be improved by fine-tuning it on supervised data in dealing with natural questions.The major performance gains are from query-search tasks and semantic relatedness tasks, which involve human-like queries, such as NQ (42.0 to 50.2) and DBPedia (37.5 to 41.4).On other datasets with short sentences as queries (e.g., claim verification), the performance of ABEL-FT degrades but is comparable to other supervised retrievers.This limitation can be alleviated by combining ABEL and ABEL-FT, thereby achieving performance improvements on both types of tasks, as illustrated in the last two bars of Figure 3.

Reranker Results
Figure 2 shows the averaged reranking performance on 9 subsets of BEIR, excluding FEVER and HotpotQA from PTR-11.As shown in Figure 2(b), the reranker is able to achieve improvements as the bootstrapping iteration progresses, providing strong evidence for the effec- tiveness of our iterative alternating distillation approach.The final reranker model, ABEL-Rerank (i.e., t = 3), as illustrated in Figure 2(a), enhances the performance of ABEL by 1.6%, surpassing supervised models of similar parameter sizes.It is noteworthy that ABEL-Rerank outperforms unsupervised SGPT (Muennighoff, 2022) and zero-shot UPR (Sachan et al., 2022), despite having much fewer parameters, and is comparable to PTR, which creates dedicated models for each task and employs high-quality synthetic queries.With the increase in model size, we consistently observe improvements with ABEL-Rerank and it outperforms TART (Asai et al., 2022) and MonoT5 (Nogueira et al., 2020) using only a fraction of the parameters.This finding demonstrates the potential of incorporating more capable rerankers to train better retrievers.We leave this exploration to future work.Please refer to Tables 5 and 7 in Appendix D for further details.

Cross-Task Results
As shown in Table 2, when directly evaluating ABEL on various tasks unseen during training, it consistently achieves significant improvements over Contriever (+7.8%) and outperforms BM25 and other advanced unsupervised retrievers.These results show that ABEL is capable of capturing matching patterns between queries and passages that can be effectively generalised to unseen tasks, instead of memorising training corpus to achieve good performance.Please refer to Appendix E for more results.

Analysis
Pre-trained Models Table 3 #1 compares our approach with DRAGON when using different pretrained models for initialisation.ABEL outperforms DRAGON consistently using aligned pre-trained models, with up to +0.6% gains.Moreover, our method exhibits continued improvement as we use more advanced pre-trained checkpoints.This demonstrates that our approach is orthogonal to existing unsupervised pre-training methods, and further gains are expected when more sophisticated pre-trained models are available.
Training Corpus We compare models trained using cropped sentences from different corpus as the training data.As shown in Table 3 #2, the method trained using MS-MARCO corpus is significantly better than the vanilla Contriever (+5.1%) but is inferior to the one using diverse corpus from BEIR, and we believe that the diverse corpora we used in training is one of the factors to our success.

Model Re-initialisation
We use re-initialisation to avoid the accumulation from biased errors in early iterations.At the start of each iteration, we re-initialise the retriever with the warm-up retriever and the reranker using pre-trained language models, respectively, rather than continuously fine-tuning the models obtained from the last iteration.Table 3 #3 shows the overall performance is increased by 0.8% using this re-initialisation technology.
Combination with Supervised Models We investigate whether ABEL can advance supervised retrievers.We merge the embeddings from ABEL with different supervised models through concatenation.Specifically, for a given query q and passage p, and  dense retrievers E i q and E i p , we compute the relevance score as s(q, p) The results shown in Figure 3 indicate ABEL can be easily integrated with other models to achieve significant performance improvement, with an average increase of up to 4.5% on RetroMAE and even a 1% gain on ABEL-FT.Besides, the benefit is also remarkable (+10%) when combining ABEL with weak retrievers (i.e., ANCE).Overall, we observe that the ensembling can result in performance gains in both query-search and semantic-relatedness tasks, as demonstrated in Table 11 in Appendix F. 4 presents the comparison of ABEL's performance throughout all bootstrapping iterations against the BM25 and the supervised Contriever.We observe that the accuracy of the retriever consistently improves as the training iteration t progresses.Specifically, ABEL matches the performance of the supervised Contriever on the first iteration, and further gains are achieved with more iterations t ≤ 2. The performance converges at iteration 3, where the results on six tasks are inferior to those achieved at iteration 2. Please refer to Figure 6 in Appendix G for results on each individual task.Self Supervision We investigate the necessity of taking the reranker as an expert in ABEL.Specifically, we use the top-k predictions of the latest retriever to extract training data at each iteration, instead of employing a separate reranker (i.e., without lines 8-10 in Alg.1).The blue line in Figure 5(a) indicates that the retriever struggles to improve when using itself as the supervisor.By investigating a small set of generated labels, we notice the extracted positive passages for most queries quickly converge to a stable set, failing to offer new signals in the new training round.This highlights the essential role of the expert reranker, which iteratively provides more advanced supervision signals.

Synthetic Queries
We assess the utility of synthetic queries in our approach by replacing cropping-sentence queries with synthetic queries from a query-generator fine-tuned on MS-MARCO. 2The results in Figure 5(a) show that using synthetic queries is less effective and exhibits similar trends, with the performance improving consistently as the iterative alternating distillation progresses. 3Splitting task groups, we observe synthetic queries yield a larger performance drop on semantic-relatedness tasks than query-search tasks, in Figure 5(b).We attribute this disparity to the stylistic differences between the training and test queries.Synthetic queries exhibit similarities to natural questions in terms of style, akin to those found in question-answering datasets.In contrast, semantic relatedness tasks usually involve shortsentence queries (e.g., claim) that are closer to cropping-sentence queries.This finding empha-sises the importance of aligning the formats of training queries with test queries in zero-shot settings.Please refer to Figure 7 in Appendix H for results comparison in each individual task.

Related Work
Neural Information Retrieval Neural retrievers adopt pre-trained language models and follow a dual-encoder architecture (Karpukhin et al., 2020) to generate the semantic representations of queries and passages and then calculate their semantic similarities.Some effective techniques have been proposed to advance the neural retrieval models, such as hard negative mining (Xiong et al., 2021), retrieval-oriented pre-training objectives (Izacard et al., 2022), and multi-vector representations (Khattab and Zaharia, 2020).All of these approaches require supervised training data and suffer from performance degradation on outof-domain datasets.Our work demonstrates the possibility that an unsupervised dense retriever can outperform a diverse range of state-of-the-art supervised methods in zero-shot settings.

Zero-shot Dense Retrieval
Recent research has demonstrated that the performance of dense retrievers under out-of-domain settings can be improved through using synthetic queries (Ma et al., 2021;Gangi Reddy et al., 2022;Dai et al., 2023).Integrating distillation from cross-encoder rerankers further advances the current state-of-the-art models (Wang et al., 2022).However, all of these methods rely on synthetic queries, which generally implies expensive inference costs on the usage of large language models.Nonetheless, the quality of the synthetic queries is worrying, although it can be improved by further efforts, such as fine-tuning the language model on high-quality supervised data (Wei et al., 2022).In contrast, our work does not have such reliance, and thus offers higher training efficiency.We show that effective large-scale training examples can be derived from raw texts in the form of cropping-sentence queries (Chen et al., 2022), and their labels can be iteratively refined by the retriever and the reranker to enhance the training of the other model.
Iterated Learning Iterated Learning refers to a family of algorithms that iteratively use previously learned models to update training labels for subsequent rounds of model training.These algorithms may also involve filtering out examples of low qual-ity by assessing whether the solution aligns with the desired objective (Alberti et al., 2019;Zelikman et al., 2022;Dai et al., 2023).This concept can also be extended to include multiple models.One method is the iterated expert introduced by Anthony et al. (2017), wherein an apprentice model learns to imitate an expert and the expert builds on improved apprentice to find better solutions.Our work adapts such a paradigm to retrieval tasks in a zero-shot manner, where a notable distinction from previous work is the novel iterative alternating distillation process.For each iteration, the roles of the retriever and the reranker as apprentice and expert are alternated, enabling bidirectional knowledge transfer to encourage mutual learning.

Conclusion
In this paper, we introduce ABEL, an unsupervised training framework that iteratively improves both retrievers and rerankers.Our method enhances a dense retriever and a cross-encoder reranker in a closed learning loop, by alternating their roles as teachers and students.The empirical results on various tasks demonstrate that this simple technique can significantly improve the capability of dense retrievers without relying on any human-annotated data, surpassing a wide range of competitive sparse and supervised dense retrievers.We believe that ABEL is a generic framework that could be easily combined with other retrieval augmenting techniques, and benefits a range of downstream tasks.

Limitations
The approach proposed in this work incurs additional training costs on the refinement of training labels and the iterative distillation process when compared to standard supervised dense retriever training.The entire training pipeline requires approximately one week to complete on a server with 8 A100 GPUs.This configuration is relatively modest according to typical academic settings.
We focus on the standard dual-encoder paradigm, and have not explored other more advanced architectures, such as ColBERT, which offer more expressive representations.We are interested in investigating whether incorporating these techniques would yield additional benefits to our approach.Furthermore, existing research (Ni et al., 2022) has demonstrated that increasing the model size can enhance the performance and generalisability of dense retrievers.Our analysis also shows that scaling up the model size improves reranking performance.Therefore, we would like to see whether applying the scaling effect to the retriever side can result in further improvement of the performance.
Moreover, we mainly examined our method on the BEIR benchmark.Although BEIR covers various tasks and domains, there is still a gap to industrial scenarios regarding the diversity of the retrieval corpora, such as those involving web-scale documents.We plan to explore scaling up the corpus size in our future work.Additionally, BEIR is a monolingual corpus in English, and we are interested in validating the feasibility of our method in multi-lingual settings.

A.1 Implementation Details
For training queries, we divide the passages in the corpus of each task in BEIR into individual sentences, treating them as cropping-sentence queries.In datasets with a large corpus size (e.g., Climate-FEVER), we randomly select 2 million sentences for training.However, for datasets with fewer passages (~5k-50k), we use all of them for training.For each query, we follow Chen et al. (2022) to extract its positive and negative passages from the top-k predictions of a specific retriever, where the top-10 are considered as positives while passages ranked between 46 and 50 are regarded as negatives.For the reranking process in line 12 of Algorithm 1, we always rerank the top 100 retrievals returned by a specific retriever, from which refined labels are extracted using the above rules.
We use Contriever4 to initialise the dense retriever and train it for 3 epochs on 8 A100 GPUs, with a per-GPU batch size 128 and learning rate 3 × 10 −5 .Each query is paired with one positive and one negative passage together with in-batch negatives for efficient training.A single retriever ABEL is trained on the union of queries from all tasks on BEIR.We initialise the reranker from t5-base-lm-adapt (Raffel et al., 2020) checkpoint. 5For each query, we sample one positive and 7 negative passages.Similarly, we train a single reranker ABEL-Rerank using a batch size 64 and learning rate 3 × 10 −5 for 20k steps, with roughly 1.2k steps on each task.We conduct the iterative alternating distillation for three iterations and take ABEL in iteration 2 and ABEL-Rerank in iteration 3 for evaluation and result comparison.We set the maximum query and passage lengths to 128 and 256 for training and set both input lengths to 512 for evaluation.
We further fine-tune ABEL on MS-MARCO (Bajaj et al., 2016) using in-batch negatives for 10k steps, with a batch size 1024 and learning rate 1×10 −5 .The maximum query and passage lengths for training are set to 32 and 128, respectively.Following Izacard et al. (2022), we first train an initial model with each query paired with one gold positive and a randomly-sampled negative.We then mine hard negatives with this model and retrain a second model ABEL-FT in the same manner but with a hard negative 10% of the time.

A.2 Baseline Retrievers
We compare our method with a wide range of unsupervised and supervised models.Unsupervised models include: (1) BM25 (Robertson and Zaragoza, 2009); (2) Contriever (Izacard et al., 2022) that is pre-trained on unlabelled text with contrastive learning; (3) SimCSE6 (Gao et al., 2021) that uses contrastive learning to learn unsupervised sentence representations by taking the encodings of a single sentence with different dropout masks as positive pairs; (4) REALM7 (Guu et al., 2020) that trains unsupervised dense retrievers using the masked language modelling signals from a separate reader component.Supervised models include (1) ANCE (Xiong et al., 2021) trained on MS-MARCO with self-mined dynamic hard negatives; (2) Contriever (MS) and COCO-DR (Yu et al., 2022) that are first pretrained on unlabelled corpus with contrastive learning and then finetuned on MS-MARCO; (3) RetroMAE (Xiao et al., 2022) uses masked auto-encoding for model pretraining and MS-MARCO for fine-tuning; (4) GTR-XXL (Ni et al., 2022), ColBERTv2 (Santhanam et al., 2022) and SPLADE++ (Formal et al., 2022) that use significantly larger model size, multivector and sparse representations, along with distillation from cross-encoder on MS-MARCO; (5) DRAGON+ (Lin et al., 2023) that learns progressively from diverse supervisions provided by above models on MS-MARCO with both crop-sentence and synthetic queries; (6) QGen: GPL (Wang et al., 2022) and PTR (Dai et al., 2023) that create customised models for each target task by using synthetic queries and pseudo relevance labels.more than one iteration (i.e., t = 2, 3) improve the performance significantly.ABEL-Rerank (t = 2) performs the best on more tasks (10/18), while ABEL-Rerank (t = 3) achieves a marginally higher overall score.

F Combination with Supervised Models
Table 11 shows the results in each individual task when combining ABEL with supervised dense retrievers.The findings indicate that on both types of tasks (i.e., query-search and semantic-relatedness), the ensembling with ABEL leads to performance improvements on all supervised retrievers, with the gains being generally more significant on semanticrelatedness tasks.This demonstrates the complementarity between ABEL and existing supervised dense retrievers, which are commonly trained using labelled data in the form of natural questions.

G Effects of Boostrapping
Figure 6 shows the detailed results of each individual task throughout the iterative alternating distillation process.We observe that ABEL-1 outperforms ABEL-0 on almost all datasets, and the performance consistently improves from ABEL-1 to ABEL-2.For

Figure 1 :
Figure1: The overview of the alternating bootstrapping training approach for zero-shot dense retrieval.

Figure 2 :
Figure 2: Reranking results on BEIR, with (a) the comparison of models with various sizes and (b) the accuracy of the reranker in iteration t using the base model.

Figure 3 :Figure 4 :
Figure 3: The comparison of combining ABEL and supervised dense retrievers (Model & ABEL) on BEIR.

Figure 5 :
Figure 5: Results on BEIR by removing the reranker component or using synthetic queries for training, where QS=Query-Search and SR=Semantic-Relatedness. 23

Figure 7 :
Figure 7: The comparison of cropping-sentence queries and synthetic queries on each task.

Table 1 :
Zero-shot retrieval results on BEIR (nDCG@10).The best and second-best results are marked in bold and underlined.Methods that train dedicated models for each of the datasets are noted with †.Highlighted rows represent semantic relatedness tasks.QGen.means query generator.Please refer to Appendix A.2 for baseline details.

Table 2 :
Zero-shot cross-task retrieval results of unsupervised models based on nDCG@10.The best and second-best results are marked in bold and underlined.

Table 4 :
Study on the effects of injecting input noise and using soft labels on model training.Average performance on the BEIR benchmark is reported.

Table 7 :
ABEL-Rerank FQ SF AA CF DB CQ QU SD FE NF TC T2 HP NQ RB TN SG BA Avg.The comparison of bootstrapping on the reranking performance of individual tasks.

Table 8 :
Domains and task formats of cross-task evaluation datasets.

Table 10 :
Zero-shot cross-dataset retrieval results of unsupervised models on open-domain question answering datasets.Top-20 & Top-100 retrieval accuracy on the test set of each dataset is reported.The best and second-best results are marked in bold and underlined.Unavailable results are denoted with -. Results reported by Ram et al. (2022) are denoted with †.FQ SF AA CF DB CQ QU SD FE NF TC T2 HP NQ RB TN SG BA QS SR Avg.
We further evaluate ABEL on five open-domain question-answering datasets that were not encountered during training to test its cross-dataset generalisation ability.We compare ABEL with a wide