Incorporating Relevance Feedback for Information-Seeking Retrieval using Few-Shot Document Re-Ranking

Pairing a lexical retriever with a neural re-ranking model has set state-of-the-art performance on large-scale information retrieval datasets. This pipeline covers scenarios like question answering or navigational queries, however, for information-seeking scenarios, users often provide information on whether a document is relevant to their query in form of clicks or explicit feedback. Therefore, in this work, we explore how relevance feedback can be directly integrated into neural re-ranking models by adopting few-shot and parameter-efficient learning techniques. Specifically, we introduce a kNN approach that re-ranks documents based on their similarity with the query and the documents the user considers relevant. Further, we explore Cross-Encoder models that we pre-train using meta-learning and subsequently fine-tune for each query, training only on the feedback documents. To evaluate our different integration strategies, we transform four existing information retrieval datasets into the relevance feedback scenario. Extensive experiments demonstrate that integrating relevance feedback directly in neural re-ranking models improves their performance, and fusing lexical ranking with our best performing neural re-ranker outperforms all other methods by 5.2% nDCG@20.


Introduction
User queries can be categorized as navigational (retrieving a specific document), transactional (retrieving a website to perform a particular action) or informational (Broder, 2002).For informationseeking queries, users might want to learn about a new topic or might be unfamiliar with the search domain.Therefore they potentially do not use common keywords of the domain which decreases performance (Furnas et al., 1987).Furthermore, they might want to find complementary information from diverse sources or consider different aspects of a topic (Clarke et al., 2008).Lastly, informationseeking queries can also be used to keep up with the latest developments on a topic.
Concretely, these queries are encountered during scientific literature review (Voorhees et al., 2021;Dasigi et al., 2021), when looking for news and background information (Soboroff et al., 2018), during argument retrieval, (Bondarenko et al., 2021) or in the legal context for case law retrieval (Locke and Zuccon, 2018).
Formulating effective queries to satisfy the complex information need in these scenarios is difficult.On the contrary, a user can easily judge whether a document is relevant to their query.Therefore, information obtained from the user when interacting with the search results, known as relevance feedback, can be used in the search.This can be obtained implicitly from click logs (Joachims, 2002) or explicitly by asking users whether a document is relevant (Rocchio, 1971).We focus on explicit feedback because it is clean compared to implicit feedback and existing information retrieval datasets can be transformed into this scenario.In both settings, the amount of feedback is limited, since users will provide feedback only on a few documents.
Incorporating relevance feedback in information retrieval (IR) systems is well-established for lexical retrieval (Rocchio, 1971;Lavrenko and Croft, 2001;Zhai and Lafferty, 2001).These systems incorporate the feedback by expanding the query with terms extracted from relevance feedback documents.While these approaches can alleviate the lexical gap, they inherently struggle with semantics because they represent text as a bag of words.Additionally, lexical query expansion methods have the disadvantage that their latency increases with the number of query terms (Wu and Fang, 2013).
To mitigate these issues, neural retrieval and reranking methods have been proposed and recently outperformed lexical retrieval (Gillick et al., 2019;Nogueira and Cho, 2019;Karpukhin et al., 2020;Khattab and Zaharia, 2020).State-of-the-art retrieval results are obtained in a two-stage setup: First, an efficient and recall-optimized retrieval method (e.g.dense or lexical retrieval) retrieves an initial set of documents.Subsequently, a neural re-ranker optimizes the rank of the documents.However, there exists no neural re-ranking model that directly incorporates relevance feedback.
To this end, we explore how relevance feedback can directly be integrated into neural re-ranking models.This is difficult because state-of-the-art models have millions of parameters and require a large amount of training data, while only a limited amount of relevance feedback per query is available.We make use of recent advances in parameterefficient fine-tuning (Houlsby et al., 2019;Ben Zaken et al., 2022) and few-shot learning (Snell et al., 2017;Finn et al., 2017) to address the challenges of model re-usability and learning from limited data.Concretely, we present a kNN approach that re-ranks documents based on their similarity to the feedback documents.We further propose to finetune a re-ranking model from only the relevance feedback for each query.We explore the effectiveness of our approach with a varying number of feedback documents and evaluate its computational efficiency.To evaluate our models, we transform four existing IR datasets into the re-ranking with relevance feedback setup.Our final model combines the strengths of lexical and neural re-ranking using reciprocal rank fusion (Cormack et al., 2009).
In summary, our contributions are as follows: • We propose a few-shot learning task for information retrieval, specifically adopting the two-stage retrieve and re-rank settings to incorporate relevance feedback, both in the retrieval as well as in the re-ranking.
• We outline retrieval scenarios for the task and how to transform existing IR datasets into the few-shot retrieve and re-rank setup.
• We present novel re-ranking methods that directly incorporate relevance feedback leveraging few-shot learning and parameter-efficient techniques.We evaluate their efficiency and demonstrate their effectiveness through extensive experiments and across different datasets.
2 Related Work

Information Retrieval Approaches
Traditionally, lexical approaches have been used for IR, such as TF-IDF and BM25 (Robertson and Zaragoza, 2009).However, these systems cannot model lexical-semantic relations between query and document (the document and query are treated as bag of words) and suffer from the lexical gap (Berger et al., 2000), e.g., when synonyms are used.
Recently, dense retrieval methods have shown promising results, outperforming lexical approaches (Gillick et al., 2019;Karpukhin et al., 2020;Khattab and Zaharia, 2020).Contrary to lexical systems, they can discover semantic matches between a query and a document, thereby overcoming the lexical gap.Dense retrieval methods learn query and document representations in a shared, high-dimensional space.This is enabled by largescale pre-training (Devlin et al., 2019) and training on IR datasets of considerable size (Nguyen et al., 2016;Kwiatkowski et al., 2019).After training, the model computes a document index holding a representation for each document in the corpus.At inference, a query representation is compared to each document vector using maximum inner product search (Johnson et al., 2021).
However, applying dense retrieval to our setup is not practical.We aim to fine-tune the model for every query, therefore, the precomputed document index would become out of sync with the model and might not yield optimal results (Guu et al., 2020).Since the document index is very large, re-encoding it would create an unreasonable computational overhead.Thus, we do not experiment with dense retrieval models that rely on a precomputed document index.
Similar to dense retrieval, neural re-ranking models have profited from pre-training and training on large datasets.The predominant approach is to use a Cross-Encoder (CE) model that takes both query and document as input to directly compute a relevance score.Contrary to dense retrieval models, this enables direct query-document interactions.Since this approach does not allow to pre-compute representations and is compute-intensive, it is generally paired with a more efficient first-stage retrieval method (dense or lexical) and subsequently applied to the top retrieved documents.Particularly combined with lexical retrieval methods, neural re-ranking yields state-of-the-art performance (Thakur et al., 2021).

Relevance Feedback
Relevance feedback has mostly been integrated into IR systems by modifying the query using the feedback documents and subsequently performing a second round of retrieval.Rocchio (1971) propose a linear combination of the vectors of the query, the relevant and non-relevant feedback documents to obtain a new query vector, which is more similar to the relevant documents.Another approach is to use language models of the query and documents to obtain new terms (Lavrenko and Croft, 2001;Zhai and Lafferty, 2001).Recently, Naseri et al. (2021) use the similarity between contextualized query and document word embeddings to extract terms for query expansion.Similarly, Zheng et al. ( 2020) use BERT to obtain document chunks for expansion and subsequently compute the relevancy by summing over chunk-document relevance.While these works leverage advances in pre-trained language modeling for selecting query terms, they eventually rely only on lexical retrieval, potentially missing semantic matches in the document collection.While we also use lexical retrieval with query expansion for the second stage, we additionally update a reranking model based on the relevance feedback and employ it on the second stage retrieval results.Other works directly incorporate relevance feedback into neural retrieval.Ai et al. (2018) train a model that sequentially encodes the top document representations from the first stage retrieval.The documents are subsequently re-ranked using an attention mechanism between the final and intermediate representations of the model.Yu et al. (2021) further fine-tune the query encoder of a dense retrieval model to additionally take the top documents from a first retrieval stage as input.While these works directly incorporate first-stage retrieval documents into their model, they require large annotated datasets to train their models.Furthermore, adding the feedback documents to the input is sub-optimal due to large memory requirements of transformer models with growing input size.Our approach overcomes this by using the relevance feedback to update the model parameters instead of providing it as input.
Most similar to our work, Lin (2019) propose to learn a re-ranker using machine learning classifiers (logistic regression and support vector machines) based on lexical features from the top and bottom retrieved documents.They show that this simple approach improves over query expansion and neural approaches like NPRF (Li et al., 2018).In contrast to our work, they use pseudo-relevance feedback and simple classification approaches as a re-ranking model.Moreover, we use explicit relevance feedback since the automatic selection of non-relevant documents is challenging.Depending on the query and document collection the number of relevant documents varies significantly.For one query there might only be few relevant documents in which case irrelevant documents could be selected from higher ranks.Another query might return a large set of relevant documents in which case pseudo-irrelevant documents would actually need to be selected from lower ranks.User feedback on the other hand does not have this disadvantage.However, since users will only give feedback on limited documents, the models used by Lin (2019) cannot be trained from explicit feedback.Therefore, we opt for few-shot learning combined with pre-trained re-ranking models.Furthermore, the user-selected documents also provide a form of interpretability to the re-ranking model.In summary, using pseudo and explicit relevance feedback in lexical models via query expansion has shown to improve retrieval performance.Furthermore, neural retrieval and re-ranking models have shown promising results, outperforming lexical methods.While there exists related work that combines neural models with query expansion, they are applied to pseudo-relevance feedback and use state-of-the-art models only for determining query expansion terms.Other methods are limited in the amount of feedback and require large training datasets for fine-tuning.In this work, we leverage few-shot learning techniques to directly update a re-ranking model based on explicit feedback.

Datasets
Large-scale IR datasets mostly target use cases where a user has a less complex information need, e.g., looking for a factoid answer (Nguyen et al., 2016;Kwiatkowski et al., 2019;Zhang et al., 2021).These are usually sparsely annotated, i.e. there is only a single (or few) judged relevant documents per query.However, for our information-seeking use case, we are interested in queries where many relevant documents exist.Therefore, we select datasets where a large set of relevant documents per query are judged.Further, the datasets should target suitable use cases containing queries that have a diverse set of relevant documents.For example, the query "What is the origin of Covid-19" from TREC-Covid, has relevant documents about the geographical location of the first cases, the genetic origins of the virus, and animals that likely have transmitted the disease to humans.
Specifically, we consider Robust04 (Voorhees, 2004), TREC-Covid (Voorhees et al., 2021), TREC-News (Soboroff et al., 2018), and Webis-Touché (Bondarenko et al., 2021).An overview of all datasets with their statistics is provided in Table 1. 2  We transform these datasets into the few-shot reranking setup by including only queries with at least 32 judged relevant and non-relevant documents in the BM25 top 1000 results with the query.Any queries with fewer judged documents are discarded because they provide little evaluation power, because we remove the feedback documents from the evaluation.Moreover, this filter ensures that enough judged documents are present for a robust evaluation.
For our experiments, we create training, validation and test splits in a 3:1:1 ratio, by randomly assigning each query to one set.We further conduct three random shuffles over the assignment of a query into the training, validation, and test set.We report the averaged results over the shuffles.

Task Setup
To incorporate relevance feedback in any retrieval process, a multi-stage approach is required.We propose a multi-phase task setup which is visualized in Figure 1.In Phase 1, the relevance feedback 2 More details on the datasets in Appendix A. The three phases of our proposed few-shot retrieve and re-rank setup.Phase 1: Documents are retrieved using the query q, and relevance feedback is obtained from a user.Phase 2: The query q and feedback documents R are used for query expansion and the second round of retrieval.Further, a re-ranking model is fine-tuned using the user-selected feedback documents.Phase 3: The documents are re-ranked using the finetuned re-ranker, obtaining the final document ordering.To improve performance, the ranking from the re-ranker and the second phase are be fused.
is collected from the user after a first retrieval.The selected documents refine the information need and provide additional insight into what is relevant to the query.In Phase 2, the feedback is processed and a second retrieval is conducted while the reranking model is trained on the selected feedback documents.This phase returns documents that are more relevant to the user's information need.Ultimately, in Phase 3, the documents obtained previously are re-ranked based on the tailored re-ranker.Specifically, in Phase 1, an initial retrieval is conducted with the query q against the document collection.For lexical retrieval, we use BM25 as it is robust in a zero-shot setting on a diverse set of domains (Thakur et al., 2021).Next, we select the top k ∈ {2, 4, 8} relevant and non-relevant documents from the first-stage retrieval according to the judgments in the dataset, i.e., there are 2k documents selected per query.
We refer to these documents as feedback documents R. By selecting the top judged and retrieved documents, this process simulates a user provid-ing relevance feedback. 3For our evaluation, we remove the feedback documents from the relevance judgments (i.e.we use the residual collection, (Salton and Buckley, 1990) in order to evaluate the ability of the models to rank non-selected documents higher.
In Phase 2, we use the 2k feedback documents for query expansion and a second retrieval step.
We extract e terms per relevant feedback document and append them to the query resulting in the expanded query for second-stage retrieval. 4Furthermore, the feedback documents are used to finetune a re-ranking model.Starting from a common base model, a new model is fine-tuned for every query.To exploit the small number of feedback documents most effectively, we employ few-shot learning when fine-tuning the re-ranker.
Finally, in Phase 3, the documents are scored using the query-specific re-ranker from the second phase.Additionally, the ranking from BM25-QE can be incorporated in the final document ranking.We experiment with different models and settings, details are described in §5.
The re-ranking could also be performed on the documents from the first-stage retrieval.However, since the feedback documents are available and query expansion generally improves recall (which is important for the re-ranking performance), we chose to not experiment with re-ranking the firststage retrieval documents.This also improves the evaluation, because all models re-rank the same set of documents.
Evaluation Metrics.To measure the ranking performance we use nDCG@20 (Järvelin and Kekäläinen, 2000) implemented by PYTREC_EVAL (Gysel and de Rijke, 2018).This metric considers graded relevance labels.We chose the cut-off at 20 to take the large number of relevant documents per query into account.Beyond ranking performance, we also focus on retrieval/re-ranking latency and parameter efficiency.The response time of IR systems is generally crucial for user satisfaction (Schurman and Brutlag, 2009).Therefore, we evaluate the time for retrieval, query expansion, fine-tuning, and re-ranking.Since we fine-tune a model per query, the memory footprint of the model should be small.This allows keeping many models in memory at the same time or quickly reloading a model whenever a user revisits a query.

BM25 Query Expansion
For the second-stage retrieval, we expand the original query q with terms e obtained from the relevant feedback documents.We experiment with a varying number of expansion terms e ∈ {4, 8, 16, 32, 64} and also use all terms in the document for expansion which we refer to as all.We do so by using Elasticsearch's MoreLikeThis feature,5 which extracts terms according to their TF-IDF score.For retrieval, the query and the extracted terms are combined, and the documents are scored according to BM25.This setup follows the query expansion technique described in Rocchio (1971).The ranking produced by BM25 query expansion (BM25-QE) serves as the lexical baseline in our experiments.

Re-ranker
In this section, we detail the different approaches employed for document re-ranking: kNN, Cross-Encoder, and Rank Fusion.

kNN
The kNN approach is based on a dense retrieval model that computes a high-dimensional, semantic text representation.Specifically, we use the transformer-based MiniLM (Wang et al., 2020b) model that was fine-tuned on a diverse set of training datasets. 6e use the 6-Layer model since its counterpart with 12 layers only provides marginally better performance albeit requiring twice the compute.
To obtain a document score s i , we compute the similarity between the query q and the document d i and add the sum of similarities between the relevant feedback documents d j ∈ R + and d i .We use cosine-similarity as similarity function f .This is expressed in Equation 1.
The kNN setup resembles Prototypical Networks (Snell et al., 2017), however, instead of having a single, averaged point in vector space representing a class, we have k + 1 points (all relevant feedback documents and the query).In this setting, the model weights are not updated, instead, we use the document and query encodings for finding similar documents.

Cross-Encoder (CE)
For re-ranking with a Cross-Encoder, we employ the 6-Layer MiniLM model fine-tuned on MS MARCO (Hofstätter et al., 2020).We experiment with zero-shot, query fine-tuning, and metalearning approaches.
Zero-Shot.As a baseline, we do not perform any fine-tuning and re-rank the documents with the pre-trained model.Zero-shot only refers to not fine-tuning the re-ranking model, however, we still re-rank the documents obtained with query expansion for comparability reasons.
Query Fine-Tuning.We update the re-ranker using few-shot supervised learning with the 2k feedback documents.We optimize the Binary Cross-Entropy and use the validation set to determine the learning rate and the number of training steps that perform best on average according to the nDCG@20 score.We refer to this as CE Query-FT.
Meta-Learning.In order to optimize the model for quick adaption to new queries, we also explore using model-agnostic meta-learning (MAML) (Finn et al., 2017).Meta-learning is generally defined over a set of tasks (as opposed to a set of training samples).Therefore, we treat each query with the respective feedback documents as its own task.This is reasonable since we model the relevance of a document in the context of the query.The training process consists of two stages: (1) First, the model g is optimized on the training dataset.Each batch consists of two tasks T 1 and T 2 , each comprising a query and the respective 2k feedback documents.The model parameters θ are updated using T 1 optimizing the Binary Cross-Entropy on the feedback documents with learning rate α.We obtain new parameters θ ′ from this step.We show this formally in Equation 2 for a single step.
Subsequently, the new parameters are evaluated on their ability to adapt to the second task T 2 by computing the loss of the predictions made by the model g θ ′ .By backpropagating through this entire process (i.e.computing the gradients w.r.t. to θ), the original parameters of the model are optimized: Intuitively, the loss in Equation 3 will be low, if the parameters θ ′ can quickly adapt to T 2 .We refer the reader to Finn et al. (2017) for a more detailed overview of the training process using MAML.In our training process, we only use a single task, i.e. one query with its respective feedback documents, per step due to the limited amount of training data.We find the hyperparameters according to the zeroshot performance on the validation dataset.(2) Once the MAML training concludes, the model is updated per query as detailed in the Query Fine-Tuning paragraph.We call this method CE MAML + Query FT.
Parameter Efficiency.For all Cross-Encoder methods we only update the bias layers as proposed by Ben Zaken et al. (2022).This keeps the number of tunable parameters and the memory footprint of the models very small.Using this method only 0.11% of the parameters are updated.Compared to adapters (Houlsby et al., 2019), tuning the biases is advantageous because the parameters are already tuned and not randomly initialized.

Rank Fusion
We also investigate merging the rankings produced by BM25-QE and the neural re-ranking model using Reciprocal Rank Fusion (RRF) (Cormack et al., 2009).The final ranking is computed according to Equation 4, where s i is the fused score of document d i , h is the ranking function returning the rank of a document and c is a constant decreasing the impact of the top-scored documents.7 This approach has the advantage of being agnostic to the relevance scores assigned to the documents by the models because it only uses their rankings.Using the relevance scores directly is problematic when the scores of the models are in different ranges.it will rank documents higher that are strongly preferred by one ranking model than documents that are weakly preferred by multiple models.9 6 Results

2 nd Stage Retrieval: Query Expansion
We report recall@1000 results of the second stage retrieval with varying number of expansion terms e in Table 2.Note that by increasing e the performance increases, reaching a maximum at e = 8.However, when further increasing e, the recall drops.We observe qualitatively that extracting more terms per document also includes more nonspecific terms or even stop words which hurt performance.Based on the recall@1000 performance on the validation set (see Appendix I) we use the documents obtained by extracting e = 16 terms for the final re-ranking step.In this work, we do not focus on the first stage retrieval.For completeness, we report the results in Appendix B.

Re-Ranking Performance
We report the nDCG@20 ranking performance in Table 3 and additional zero-shot baselines in Appendix C. We first note that increasing the amount of relevance feedback k generally improves performance.Furthermore, we observe that BM25-QE already performs well.Neither the kNN approach, nor the Cross-Encoder zero-shot and Query FT, nor the wide variety of zero-shot models are able to outperform BM25-QE, except on TREC-Covid.We note a superior performance on Webis-Touché, although this task is the most challenging for neural models in our test suite.This agrees with related work that indicates that BM25 beats all other methods on this task (Thakur et al., 2021)  performance increases when the relevance feedback is integrated.CE zero-shot is outperformed by query fine-tuning, which is subsequently outperformed when MAML training is added.This shows that our proposed direct integration of relevance feedback in the model is effective and that the parameters obtained by MAML training are better able to adapt to new queries given the relevance feedback.This method also slightly outperforms BM25-QE.Finally, combining the rankings of the lexical retrieval and neural re-ranker is particularly effective.While different methods excel at each dataset (e.g.BM25 on Webis-Touché or neural models on TREC-Covid), the rank fusion is able to mitigate the weaknesses of one model successfully.Moreover, combining two rankings often outperforms the single ranking, showing that query expansion and neural re-ranking are highly complementary. 10e analyze the intersection of the top documents between BM25-QE and the two re-rankers.We find that in more than 50% of the queries in the test set, BM25-QE and the re-ranking model only agree on 5 or fewer documents in the top 20.11

Re-Ranking Ablations
To gain further insights into where our performance improvements are coming from, we conduct a series of ablation studies, reported in Table 4.
First, we ablate the influence of query expansion and the feedback documents on lexical retrieval.We retrieve only using the query and remove the feedback documents from the retrieval and evaluation, i.e. we use the residual collection, even though the feedback documents are not used.From the first section of Table 4 we can observe a large performance drop.This shows that BM25-QE is successfully able to exploit the feedback documents and retrieve more relevant documents.
For the kNN approach, we compare the performance by using only the query-document similarity for obtaining the relevance score (i.e.dropping the second term in Equation 1).On average this results in a drop of 5.6 percentage points, proving the effectiveness of injecting feedback documents in the kNN re-ranking approach.
For the CE experiments, we ablate if optimizing only the bias layers compared to fully fine-tuning the model affects the performance.We, therefore, repeat our query fine-tuning experiment but optimize all parameters of the model.On average, optimizing only the biases results in a 0.8% performance drop.However, the biases account only for 26k parameters, which is 0.11% of the entire model.This result is in line with other research showing that optimizing only a small subset of parameters results in comparable performance (Houlsby et al., 2019;Pfeiffer et al., 2020;Ben Zaken et al., 2022).This finding supports the query fine-tuning applicability from a memory perspective.While there might be many queries in a deployed system, and therefore many fine-tuned models, the required memory would not grow significantly.Furthermore, the memory requirements could be further reduced by only fine-tuning biases of certain components (Ben Zaken et al., 2022) or transformer layers (Rücklé et al., 2021).Finally, we investigate the impact of metalearning by comparing it with supervised training.We follow the same setup as in MAML but replace meta-learning with standard supervised learning.We find that MAML training results in 0.5% improvement.We also note that supervised training is less stable than MAML.When increasing k, the performance intermittently drops (e.g. in TREC-Covid from k = 2 → 4 and TREC-News from k = 4 → 8), while MAML does not experience performance decreases.

Retrieval and Re-Ranking Latency
Results for the speed performance are reported in Figure 2. First, we note that performing query expansion does significantly increase retrieval speed.Depending on the number of feedback documents this is a 2.8 (k = 2) -7.6 (k = 8) fold increase over BM25 without query expansion. 12or the re-ranking methods, we notice that the kNN approach is extremely fast.This is due to the fact that all heavy computations can be precomputed.This is promising since combining kNN and BM25-QE with rank fusion results in a 2.6% performance improvement over BM25-QE alone, while not significantly adding any latency.In contrast, the Cross-Encoder model takes the longest time.However, the time for fine-tuning the model is only a fraction of the total time (≈ 22% on average).The retrieval latency can generally be tradedoff with the ranking performance by retrieving and re-ranking fewer documents.

Conclusion
In this work, we introduced a few-shot learning task for incorporating relevance feedback in neural re-ranking models.We further transformed existing IR datasets into the few-shot setting.Most importantly, we have introduced different methods for incorporating relevance feedback directly into neural re-ranking models.The proposed kNN approach is particularly computationally efficient, however, by itself, it cannot outperform BM25 with query expansion.Since the kNN method does not add significant latency to the re-ranking, it can be combined with BM25 query expansion, which outperforms the latter by 2.6% nDCG@20.Our second re-ranking method based on a Cross-Encoder model performs on par with BM25 with query expansion.Regarding its latency, we show that finetuning on a query basis is feasible since a majority of the time is spent on re-ranking and not fine-tuning.Similar to kNN, performing rank fusion of the two approaches yields a high performance gain of 5.2% nDCG@20.Advantageously, reciprocal rank fusion is very stable in our setting, mitigating weaknesses of individual model-task combinations.

Limitations
In this work, we investigate how relevance feedback can directly be incorporated into neural reranking models.While our best-performing approach improves the ranking performance by a large margin, it is inherently more computationally expensive compared with models that do not use any relevance feedback.We quantify this by reporting the latency of our approaches.The speed can be further reduced by re-ranking fewer documents, thereby trading off latency and performance.Further, we propose a kNN model that is computationally efficient and does not significantly add latency to query expansion models.Lastly, we recommend using our approach foremost in the information-seeking scenario, where users are particularly concerned about having accurate results rather than low latency.
For our methods, relevance feedback has to be explicitly collected from a user.While we believe in a information-seeking scenario users are more willing to provide explicit feedback, in this work, we did not explore using implicit or pseudorelevance feedback.While this type of feedback is noisier, larger amounts are available.
In this work, we make use of simulated relevance feedback from existing relevance judgments.The re-ranked documents in the second stage will be biased toward the selected feedback documents.We leave to future work the integration of more diverse search results and investigation of position bias.However, we note that in preliminary experiments, we found that selecting random feedback documents from the first stage leads to worse performance.
To keep the degrees of freedom in our experiments reasonable and to facilitate evaluation, we do not experiment with an iterative relevance feedback setting.We instead focus on a single round of relevance feedback but vary the number of feedback documents.While related work has shown that iterative relevance feedback can further improve retrieval, there are diminishing gains with every round (Bi et al., 2019).
Our best-performing approach requires a train-ing dataset.Albeit small (depending on the task, the training dataset contains 22 -90 queries), the model cannot be created without it.Since the model is targeted to a specific domain, we hypothesize that employing it on a different domain will result in worse performance than using the pretrained model.

A Dataset Details
Robust04 (Voorhees, 2004) is a dataset initially created to investigate the performance of poorly performing queries.Thereby, a collection with many judgments per query has been created and has since been used to test the robustness of IR models.We use the description field of the queries, which is a question or a single sentence of the search intent.
The document collection contains news articles.
TREC-Covid (Voorhees et al., 2021) is an IR dataset in the biomedical domain consisting of questions about Coronavirus and scientific articles as document collection.It was collected in five iterative rounds.We use the question from the query set along with the documents from the COVID-19 Open Research Dataset (Wang et al., 2020a). 13Documents are constructed by concatenating the title and abstract.Further, we remove exact duplicates from the feedback documents.In TREC-Covid, the documents are judged as relevant, partially relevant, or non-relevant.For the feedback documents, we consider only the relevant and non-relevant ones but include also partially relevant ones for evaluation.
TREC-News (Soboroff et al., 2018) is an Information Retrieval task based on a corpus provided by the Washington Post.We use the 2019 background linking task.In this setup, the goal is to find other relevant news articles that provide background information or further reading on a subject and help the user contextualize the current article.
To have a concise query, we use the titles as query.
Webis-Touché 2020 (Bondarenko et al., 2021) is an argument retrieval dataset based on the args.me 14corpus containing arguments scraped from debate websites. 15Queries are formulated as questions.The dataset contains fine-grained annotations of documents on a scale from 0-7.We select documents with a relevance of at least 3 for our relevant feedback documents.Since Webis-Touché only contains very few non-relevant documents (i.e.relevance of 0), we augment them using BM25 negatives (Karpukhin et al., 2020), by selecting non-judged documents after rank 100 for the non-relevant feedback documents. 13We use the snapshot from 16-JULY-2020 14 args.me/api-en.html 15debate.org,debatepedia.org,debatewise.org,idebate.org

B 1 st Stage Retrieval
Table 5 shows BM25 retrieval performance with the relevance feedback documents removed (F = ✗) or included (F = ✓) in the retrieval results and evaluation.

D BM25-QE and Neural Re-Ranker Top 20 Analysis
Figure 3 presents the number of documents that BM25-QE and the respective neural re-ranking method have in common in the top 20 retrieval results.It shows that while there is some overlap, the methods rank different documents on top.

E Retrieval Speed with Query Expansion
Figure1: The three phases of our proposed few-shot retrieve and re-rank setup.Phase 1: Documents are retrieved using the query q, and relevance feedback is obtained from a user.Phase 2: The query q and feedback documents R are used for query expansion and the second round of retrieval.Further, a re-ranking model is fine-tuned using the user-selected feedback documents.Phase 3: The documents are re-ranked using the finetuned re-ranker, obtaining the final document ordering.To improve performance, the ranking from the re-ranker and the second phase are be fused.

Figure 2 :
Figure 2: Average time in milliseconds (in log scale) for retrieval (BM25 and BM25-QE) and re-ranking (kNN and CE) 1000 documents.Average over all queries in the test sets.For the Cross-Encoder we separate the time for fine-tuning and re-ranking.
Figure 3: Overlap of documents in the top 20 between BM25-QE and two neural re-ranking methods.

Figure 4
Figure4presents the average time duration per query when varying the number of expansion terms when using BM25 with query expansion.

Figure 4 :
Figure 4: Average duration per query with varying the number of expansion terms.

Table 4 :
nDCG@20 results of ablations studies on the test set.The first experiment shows BM25 without using query expansion and removing the feedback documents from the evaluation.The next experiment ablates the performance of kNN by removing the influence of the relevant feedback documents.The third row shows results for fully fine-tuning the Cross-Encoder model, ablating fine-tuning only the bias layers.The last experiment ablates the MAML training, comparing it to standard supervised learning.

Table 5 :
Retrieval results using BM25 on the test set with the query only.For F = ✗ the feedback documents have been removed from the retrieved documents and the ground truth for computing the evaluation metrics.