Relevance-assisted Generation for Robust Zero-shot Retrieval

,


Introduction
Despite strong in-domain performance, dense retrievers have shown poor generalization to out-ofdomain (OOD) zero-shot tasks where no training queries are available (Thakur et al., 2021).To enable training, pseudo-query generation (PQG) (Ma et al., 2021;Liang et al., 2020) has shown promising results, by generating in-domain pseudo queries Q from a target corpus D.
However, we show Q are often irrelevant to the documents for which they were generated, and generating a single document vector from the finetuned document encoder using Q is often insufficient.Figure 1(a) illustrates the two limitations of the standard PQG approach, and Figure 1(b) our solutions, discussed as follows.
To tackle the two limitations, we propose Relevance-assisted Multi-query Domain Adaptation, or RaMDA 1 .First, for relevance-Zero-shot task such as the BEIR benchmark reveals the key weakness of high-performance dense retrievers […] Zero-shot task such as the BEIR benchmark reveals the key weakness of high-performance dense retrievers […] How to improve performance?
: document " : generated pseudo-query guided generation, we first generate relevance explanations Z d (e.g., keywords explaining the relevance of the given document to queries to be generate).Second, guided by Z d , we generate multiple queries, that form a more relevant and comprehensive Q.To address the second issue, we augment the single vector from d with varying numbers of vectors from Q, denoted by v d and v qi , respectively.This enables the document to be matched to diverse relevant queries at test time.
We conduct experiments on BEIR benchmarks, which include diverse out-of-domain retrieval tasks.
The results show that, compared to the baseline PQG, our proposed RaMDA increases nDCG@10 and Recall@100, by 2.4 pt and 4.6 pt on average, respectively.Further analyses show that our generated queries approximate gold queries better, and capture diverse queries.
Our contributions are threefold: 1. We analyze existing PQG and identify their term frequency bias and diversity bias ( §3.1).

Related Work
To address retrieval across diverse domains, dense retrievers have been trained using opendomain large-scale corpus, in supervised manner (Karpukhin et al., 2020;Xiong et al., 2021;Hofstätter et al., 2021) when relevance annotations are available (e.g., MS MARCO (Nguyen et al., 2016)), or using self-supervised learning in cases where such annotations are absent (Lee et al., 2019;Izacard et al., 2022).However, dense retrievers have shown poor performance when tested on specialized out-of-domain datasets, due to distribution shifts (Thakur et al., 2021;Yu et al., 2022).Towards improved generalization, we discuss two approaches that tackle the challenge of distribution shifts: 1) improving training and 2) robustifying inference.

Improving Training for Better Adaptation
For improved training, existing work can be categorized into those pursuing domain invariant and domain-tailored learning.
The former aims to reduce the representation gap between source and target domains, by training a domain classifier, distinguishing source from target, based on which the encoder adversarially learns features that are domain independent (Xin et al., 2022).Recently, COntinuous COntrastive pretraining (COCO) (Yu et al., 2022) of a language model on target corpus, followed by implicit Distributionally Robust optimization (iDRO), achieved state-of-the-arts in this direction.However, as universal features from COCO-DR may not be effective for some target corpus, we adopt COCO-DR, but with domain-specific adaptation, by combining it with domain-tailored learning.
In contrast, domain-tailored learning aims to produce a domain-specific encoder, by fine-tuning the encoder to better fit each target domain.To enable fine-tuning, relevant query-document pairs should be constructed to build a training dataset, by devising pseudo-queries for each document in the corpus.To this end, pseudo-queries have been generated by either heuristic rules or a trained generator.For the former, TSDAE (Wang et al., 2021a) randomly injects noise into the document, while for the latter, GenQ (Ma et al., 2021) or GPL (Wang et al., 2021b) leverage a pseudo-query generator trained using MS MARCO, resulting in better adaptation performance compared to the former.While employing a trained generator, our distinction is ensuring the relevance of pseudo-queries.

Robustifying Inference by Increasing Model Capacity
In another dimension, domain shifts can be tackled by increasing the model capacity, through enriching query-document interactions or ensembling multiple retrievers, discussed as follows.
Beyond the similarity between a pair of single vectors having limited capacity (Luan et al., 2021), matching between query-document can be extended to term-level interaction.Crossencoder (Guo et al., 2016) can capture full interactions between query and document terms, though not scalable to our target tasks as documents are not indexable.ColBERT (Khattab and Zaharia, 2020), with late interaction, is an indexable alternative with comparable performance, which we adopt as a baseline.Ours shares the same motivation of enriching interaction but distinguishes itself by making the interaction more concise via Q, showing better performance with little index overhead.
While the term-level interaction enriches relevance signals via multiple terms, such signals can be captured from multiple retrievers by ensembling (Gao et al., 2021).With this view, ours can be viewed as introducing another retriever, to gain the benefit of such signals from two complementary text representations, Q and D. While showing comparable performance to the standard ensemble, when combined ours further enhances state-of-the-art performance.

Method
Given a document d in a target corpus D, PQG aims to generate pseudo-queries Qd = {q i } | Qd | i=1 , as alternatives to gold queries Q d .Following previous work, we employ T5 (Raffel et al., 2020) as the backbone generator.

Motivation: Distribution Shift on PQG
Desirably, a robust PQG method should model p( Qd |D), such that the sampled Qd should closely approximate Q d .However, as we will show, PQG often fails to generalize to OOD settings.We hypothesize that this failure is driven by two biases in the source domain.First, term frequency bias: PQG can be biased to generate terms that occur frequently in the source domain, and thus fail to generate rarely observed terms.Second, diversity bias: The source domain may have a short passage, where the topic of queries naturally coincides.When target domains have a long document covering a diverse set of topics, PQG trained from the source domain would generate a homogeneous set of queries covering only a single main topic.
We conduct a preliminary analysis of existing PQG approaches with respect to such biases.We first quantify the two biases and categorize OOD datasets in terms of the biases.Similar to Wang et al. (2022), the term frequency bias is measured by max t∈q 1 1+log(1+DFt) , where q denotes a query (or a pseudo-query) in the target domain and DF t denotes document frequency of t, i.e., how many documents in the source domain contain t in their relevant queries.For diversity bias, we measure the maximum cosine distance between pairs of embeddings of any two relevant queries (or pseudoqueries) for the same document2 .Figure 2(a) visualizes datasets regarding the two bias metrics.We can observe that, while some OOD datasets share similar distributions to MS MARCO, others deviate significantly from it in terms of bias characteristics, namely Climate-FEVER, TREC-COVID, SCIDOCS, and NFCorpus.With the goal of debiasing PQG, we adopt these four datasets, which demonstrate clear distribution shifts, denoted as "BEIR-BiasShift", in our experiments.
For an efficient preliminary analysis, we focus on the small corpus datasets among the four, which are NFCorpus and TREC-COVID, denoted as "BEIR-BiasShift-Small". Figure 2(b) compares terms in gold and synthesized queries in the target domain in terms of term frequency bias, denoted as Q (x-axis) and Q (y-axis) in the figure, respectively.Desirably, the two distributions should be identical (as in the dotted diagonal line y = x).The figure shows that gold query terms Q are rarely observed in the source distribution, but Q from the baseline PQG model (shown in red skewing lower than the optimal line) fails to generate the rare terms.In terms of diversity bias, Figure 2(c) compares the semantic diversity of Q and Q, where Q should be as diverse as Q. Results show that the baseline PQG suffers from the bias, showing significantly lower diversity compared to that of gold queries.
Our hypothesis is that biased queries, as observed above, negatively affect the generalization and should be pruned off, to allow the retriever to learn from an unbiased set of Q.

Relevance-assisted Multi-query Generation
To this end, our distinction is to decompose the generation of Qd into relevance explanation, and relevance-guided generation.First, we generate an explanation of the relevance between d and the query to be generated, as the set of terms Z d which are shared by the relevant d-Q d pairs.Next, we leverage Z d to guide the generator to sample improved Qd that includes relevant terms for d, thereby enhancing generalization.Alternatively, one may over-generate-then-filter (i.e.post filtering), which we denote as GenQ + RTF in Figure 2. RTF refers to round-trip filtering (Dai et al., 2022), approximating the relevance of generated qi if a dense retriever ranks d at top-1 using qi as a query3 .However, this is not only expensive, requiring repetitive decoding and ranking, but also aggravates the biases by filtering out rarely observed query terms and diverse query terms, as shown in blue lines in Figure 2 (b) and (c).
In contrast, we propose to filter preemptively, by decoding q i guided by Z d .Among many relevance explanations surveyed in Anand et al. (2022), we employ SPLADE (Formal et al., 2021), generating Z d terms with weights λ Z d on terms, based on strong empirical results.Given Z d , our pseudoquery generator decodes q i ∈ Q guided by Z d , via arg max q ′ p(q ′ |Z d , d).For p(q ′ |Z d , d), we add λ Z d to output logits of the decoder, followed by softmax normalization, such that terms with high scores given by SPLADE will be more likely to be generated as a pseudo-query4 .As a result, ours better alleviates the distribution shifts as shown in green lines in Figure 2 (b) and (c).Given Q, standard PQG approaches fine-tune the document encoder for each domain adaptively, to enable it to represent the dense vector v d of d, yet often limited to a single vector representation.In contrast, as we observed in §3.1, relevant queries for d in target domains are often diverse, where the capacity of the fixed-size v d becomes the bottleneck.Our distinction is increasing the representation capacity, by appending varying numbers of vectors from Qd to v d .Specifically, we first partition tokens in d into S segments {s l d } S l=1 , where each segment s l d has a fixed number of tokens and has a sub-topic of d5 .We then generate pseudo-queries {q l d } S l=1 for each segment, such that pseudo-queries from the whole segments can cover diverse topics in d.Finally, to augment v d , we encode ql d into the dense vector v ql l=1 , the relevance to the testtime q is measured by v , where v q denotes the dense vector of q and ⊤ denotes inner product.We employ the max-pooling on varying numbers of pseudo-query embeddings, to capture the most relevant sub-topic to q (Khattab and Zaharia, 2020).

Experimental Setup
Dataset and Evaluation Metrics To evaluate the effectiveness of our method, we conduct experiments on BEIR, a benchmark designed to evaluate the zero-shot generalization of retrieval systems across an array of different information retrieval tasks on various domains.Among BEIR datasets, we adopt BEIR-BiasShift datasets showing the largest distribution shifts from MS MARCO in terms of the two biases (Figure 2(a)), which are NFCorpus, SCIDOCS, TREC-COVID, and Climate-Fever.For evaluation metrics, following Thakur et al. (2021), we adopt nDCG@10 and Recall@100, which measure the overall quality of predicted top-10 ranking and the completeness of top-100 documents on relevant documents, respectively.
Baselines We compare RaMDA to both domaininvariant retrievers and domain-adaptive retrievers.For the former, we compare COCO-DR and SPLADE, as the state-of-the-art models among dense retrievers and sparse retrievers, respectively.While employed to guide PQG in RaMDA, SPLADE can serve as a retriever, by assessing the relevance of d via the sum of λ Z d .In addition, we also compare Contriever, which is the stateof-the-art model among retrievers trained using self-supervised learning.
As domain-adaptive retrievers, we compare GenQ and GPL.GenQ utilizes a pseudo-query generator, initially trained on MS MARCO and subsequently adapted for each target domain to produce domain-specific pseudo-queries.GPL is a similar query generation approach, but additionally utilizes an expensive cross-encoder to label the generated pseudo-queries, for better adaptation.
Implementation Details For the pseudo-query generator, we fine-tune T5 (Base) using MS MARCO for 50k steps with 1k warm-up steps, by employing AdamW (Loshchilov and Hutter, 2019) optimizer with learning rate 1e-5 and batch size 32.

Results
We first validate the effectiveness of RaMDA in retrieval performance on BEIR, by comparing RaMDA with domain-adaptive retrievers as well as the state-of-the-art domain-invariant retriever.Following previous work, we adopt nDCG@10 as the evaluation metric.Results are shown in Table 1.

Analysis on pseudo-query generation
In this section, we study how PQG affects the adaptation to out-of-domain tasks.
Poor PQG does not help domain adaptation.Both existing domain adaptive retrievers (GenQ and GPL) exhibit lower average performance than the domain-invariant retriever, COCO-DR.This is because Q is often different from gold queries in target domains, as observed in Figure 2.
RaMDA's preemptive filtering helps, while postfiltering is detrimental.We compare RaMDA with the post-filtering approach, denoted as "GenQ + RTF † "6 .While RTF produces similar, or even worse queries than blind generation, in contrast, our preemptive RTF consistently outperforms both "GenQ" and "GenQ + RTF', as well as domain invariant baselines such as COCO-DR and SPLADE.
Biases on PQG negatively affect retrieval performance.We conduct ablation studies where we remove half of the pseudo-queries that account for each of the biases, to compare with RaMDA (with all pseudo-queries) in bias-amplified settings.Regarding the frequency bias, pseudoqueries that have the most rarely observed terms in MS MARCO are removed.Regarding the diversity bias, we repeatedly remove a pair of pseudoqueries whose distance is the farthest among the remaining pseudo-queries, until only half of the pseudo-queries remain.To demonstrate the significance of alleviating the two biases, we also compare performance of randomly removing pseudoqueries.The results are reported in Table 2.
Removing randomly sampled pseudo-queries shows the least degradation, indicating that alleviating two biases has a significant contribution to performance.The contribution of the two varies depending on the dataset characteristics.As shown in Figure 2(a), between the two datasets, NFCorpus and TREC-COVID, NFCorpus exhibits a more pronounced distribution shift in terms of diversity bias, with diversity scores of 0.83 and 0.42 for NFCorpus and TREC-COVID, respectively.Conversely, TREC-COVID demonstrates a more significant shift in term frequency bias, with rarity scores of 0.25 and 0.43 for NFCorpus and TREC-  COVID, respectively.As expected, on NFCorpus, diversifying pseudo-queries contributes more to the retrieval performance as evidenced by the large performance gap.Similarly, on TREC-COVID, generating rarely observed terms is more important, as these are often key domain-specific terms.
Efficiency-Effectiveness trade-offs.Figure 3 compares efficiency-effectiveness trade-offs between preemptive and post-filtering, where efficiency is measured by the number of pseudo-query samples (x-axis), and effectiveness (y-axis) by the retrieval performance.
Ours shows high effectiveness consistently over all sample numbers, and tends to show performance improvements as more pseudo-queries are sampled.In contrast, when using GenQ, sampling more  pseudo-queries rather decreases the retrieval performance on TREC-COVID, indicating the biased PQG negatively contributes to the generalization.GenQ + RTF shows a similar trend, indicating that RTF fails to filter such harmful pseudo-queries.
Compared to the domain-invariant retriever COCO-DR, ours requires only 10 or 15 pseudoquery samples to outperform COCO-DR, showing better efficiency.In contrast, both GenQ and GenQ + RTF consistently show worse performance than COCO-DR, indicating poor PQG makes the domain adaptation ineffective.

Analysis on document representation
We now examine whether Q can enhance document representations by complementing D, in Figure 4. Q complements D. Since documents are much longer than pseudo-queries, D shows better recall than Q alone.However, even terms from relevant queries often do not appear in documents.Q adds missing terms to complement D, further increasing recall on gold query terms when combined with D. On the other hand, regarding precision, Q is consistently better than D. This indicates that Q can take the role of a summary of D, to rectify the noisy vocabulary of D.
We further show that Q alleviates a well-known problem of dense vector representation, called token amnesia (Ram et al., 2023), where the single dense vector of a document often fails to capture its salient terms, due to occlusion by noisy terms.tokens, as follows.We first compute the conditional probability of each BERT token w from the dense vector, by using the vector as the input to the masked language modeling head of the BERT encoder.Regarding the probability, we then take the top 100 tokens, as an interpretation of the vector semantics 7 .Finally, to measure the semantic relevance between the dense vector and gold queries, we adopt recall from the top 100 tokens on gold query tokens in d. Results are reported in Figure 5. Compared to D, Q shows better recall, and combining both further increases the recall.This indicates that Q can semantically complement D.
Q improves the document representation.Table 3 compares ours with baselines with higher model capacity -enriching query-document interactions or ensembling multiple retrievers.Beyond single vector representation, ColBERT (Khattab and Zaharia, 2020) enables the document representation to have adaptive capacity by indexing all terms in d, with a memory overhead on index 7 For further details, refer to Ram et al. (2023).size.On the other hand, ensemble methods increase the capacity by introducing another dense retriever.We compare with an ensemble of COCO-DR and GTR (Ni et al., 2022).
Surprisingly, though increasing the capacity, Col-BERT underperforms COCO-DR on all datasets except SCIDOCS, and the ensemble often shows comparable or worse performance to individual retrievers.This is because the capacity increase in both methods is constrained by the quality of d, which often produces noisy lexicons and semantics, as observed in Figures 4 and 5.While sharing the same goal, we leverage Q to complement d.As a result, with only minimal overhead on the index compared to dense retrievers, our method outperforms all compared baselines on all tested datasets.

Conclusion
We investigated PQG for overcoming domain shifts in zero-shot retrieval, motivated by the observation that generated PQs often negatively affect such a goal.We show term frequency and diversity bias as a cause, and propose a novel PQG method that preempts negative PQG.We validate with extensive experiments on the BEIR benchmark, that through relevance-guidance and multi-query generation, our proposed model effectively addresses the challenges of domain shifts in zero-shot retrieval.

Figure 1 :
Figure 1: Contrast between (a) the standard PQG approach and (b) our proposed RaMDA, with respect to pseudo-query generation and document representation.

Figure 2 :
Figure 2: (a) 2D visualization of distribution shifts of all BEIR datasets in two bias metrics.The brighter the contour lines, the more severe the shifts.(b) and (c) demonstrate that baseline PQG methods suffer from the two distribution shifts in terms of (b) term frequency and (b) diversity bias.Vertical dotted lines in (b) and (c) denote the corresponding bias metrics of MS MARCO validation queries.

d
and append it to v d .In our experiments, to maximize the coverage, we sample 50 pseudo-queries per segment and then do mean pooling of embeddings of those, to have the single vector v ql d .Given v d and {v ql d } S

Figure 4 :
Figure 4: Recall and precision of tokens from different fields, on gold query terms.

Figure 5 :
Figure 5: Recall of projected tokens from dense vectors from different fields, on gold query terms.

Table 1 :
NDCG@10 on BEIR-BiasShift datasets.† denotes retrievers that employ Qd -augmented document representations (i.e., {v ql d } S l=1 in addition to v d ) with different generators.The best and the second-best results are denoted in bold-faced and underlined, respectively.

Table 3 :
Specifically, to see whether gold query terms in d can be retained by the dense vector of d (or Qd ), we project v d (or v ql Recall@100 on BEIR-BiasShift datasets.The best results are denoted in bold.