Optimizing Test-Time Query Representations for Dense Retrieval

Recent developments of dense retrieval rely on quality representations of queries and contexts from pre-trained query and context encoders. In this paper, we introduce TOUR (Test-Time Optimization of Query Representations), which further optimizes instance-level query representations guided by signals from test-time retrieval results. We leverage a cross-encoder re-ranker to provide fine-grained pseudo labels over retrieval results and iteratively optimize query representations with gradient descent. Our theoretical analysis reveals that TOUR can be viewed as a generalization of the classical Rocchio algorithm for pseudo relevance feedback, and we present two variants that leverage pseudo-labels as hard binary or soft continuous labels. We first apply TOUR on phrase retrieval with our proposed phrase re-ranker, and also evaluate its effectiveness on passage retrieval with an off-the-shelf re-ranker. TOUR greatly improves end-to-end open-domain question answering accuracy, as well as passage retrieval performance. TOUR also consistently improves direct re-ranking by up to 2.0% while running 1.3-2.4x faster with an efficient implementation.


Introduction
Recent progress in pre-trained language models gave birth to dense retrieval, which typically learns dense representations of queries and contexts in a contrastive learning framework . By overcoming the term mismatch problem, dense retrieval has been shown to be more effective than sparse retrieval in open-domain question answering (QA) Karpukhin et al., 2020;Lee et al., 2021a) and information retrieval (Khattab and Zaharia, 2020;Xiong et al., 2020). * Work partly done while visiting Princeton University. 1 Our code is available at https://github.com/ dmis-lab/TouR.  (TOUR). Given the initial representation of a test query q 0 , TOUR iteratively optimizes its representation (e.g., q 0 → q 1 → q 2 → q 3 ) based on minimal top-k retrieval results. The figure shows how each query vector retrieves new context vectors and updates its representation to find the gold answer (e.g., 1983). Our cross-encoder re-ranker provides a relevance score for each top retrieval result making the query representation closer to the final answer.
Dense retrieval often uses a dual encoder architecture, which enables the pre-computation of context representations while the query representations are directly computed from the trained encoder during inference. However, directly using trained query encoders often fails to retrieve relevant context (Thakur et al., 2021;Sciavolino et al., 2021) since many test queries are unseen during training.
In this paper, we introduce TOUR, which further optimizes instance-level query representations at test time for dense retrieval. Specifically, we treat each test query as a single data point and iteratively optimize its representation. This resembles the query-side fine-tuning proposed for phrase retrieval (Lee et al., 2021a), which finetunes the query encoder over training queries in a new domain. Instead, we fine-tune query representations for each test query. Since we do not have target labels for test queries, we leverage crossencoder re-rankers (Nogueira and Cho, 2019;Fajcik et al., 2021) to provide pseudo relevance labels on intermediate retrieval results and then iteratively optimize query representations using gradient descent. For phrase retrieval, we also develop a crossencoder phrase re-ranker, which has not been explored in previous studies.
We theoretically show that our framework can be viewed as a generalized version of the Rocchio's algorithm for pseudo relevance feedback (PRF; Rocchio, 1971), which is commonly used in information retrieval to improve query representations. While most PRF techniques assume that the top ranked results are equally pseudo-relevant, our method dynamically labels the top results and updates the query representations accordingly. We leverage our pseudo labels as either hard binary labels or soft continuous labels, introducing two versions of our method. Lastly, to reduce computational overhead, we present an efficient implementation of TOUR, which significantly improves its runtime efficiency.
We apply TOUR on phrase (Lee et al., 2021a) and passage retrieval (Karpukhin et al., 2020) for open-domain QA. Experiments show that TOUR consistently improves performance on both retrieval tasks, even when the query distribution changes greatly. To be specific, TOUR improves the end-to-end open-domain QA accuracy by up to 10.7%, while also improving the accuracy of the top-20 passage retrieval by up to 8.3% compared to the baseline retrievers. TOUR only requires a handful of top-k candidates to perform well, which enables TOUR to run up to 4× faster than re-ranker with our efficient implementation while consistently outperforming the re-ranker up to 2.6%. The ablation study further shows the effectiveness of each component, highlighting the importance of fine-grained relevance signals.

Dense Retrieval
Dense retrieval typically uses query and context encoders-E q (·) and E c (·)-for representing queries and contexts, respectively Karpukhin et al., 2020). In this work, we focus on improving phrase or passage retrievers for opendomain QA. The similarity of a query q and a context c is computed based on the inner product between their dense representations: Dense retrievers often use the contrastive learning framework to train the encoders E q and E c . After training the encoders, top-k results are retrieved from a set of contexts C: where the top-k operator returns a sorted list of contexts by their similarity score sim(q, c) in descending order (i.e., sim(q, c 1 ) ≥ · · · ≥ sim(q, c k )).
Dense retrievers aim to maximize the probability that a relevant context c * exists (or is highly ranked) in the top results.

Query-side Fine-tuning
After training the query and context encoders, the context representations {c | c ∈ C} are typically pre-computed for efficient retrieval while the query representations q are directly computed from the query encoder during inference. However, using the dense representations of queries as is often fails to retrieve relevant contexts, especially when the test query distribution is different from the one seen during training.
To mitigate the problem, Lee et al. (2021a) propose to fine-tune the query encoder on the retrieval results of training queries {q | q ∈ Q train } over the entire corpus C. For phrase retrieval (i.e., c denotes a phrase), they maximize the marginal likelihood of relevant phrases in the top-k results: and c = c * checks whether each context matches the gold context c * or not. Note that c * is always given for training queries. The query-side fine-tuning significantly improves performance and provides a means of efficient transfer learning when there is a query distribution shift. In this work, compared to training on entire training queries as in Eq.
(3), we treat each test query q ∈ Q test as a single data point to train on and optimize instance-level query representations at test time. This is in contrast to distillation-based passage retrievers (Izacard and Grave, 2020;Ren et al., 2021), which fine-tune the parameters of the retrievers directly on all training data by leveraging signals from cross-encoders.

Pseudo Relevance Feedback
Pseudo relevance feedback (PRF) techniques in information retrieval (Rocchio, 1971;Lavrenko and Croft, 2001) share a similar motivation with ours in that they refine query representations for a single testing query. Unlike using the true relevance feedback provided by the users (Baumgärtner et al., 2022), PRF relies on heuristic or model-based relevance feedback, which can be easily automated. While most previous work used PRF for sparse retrieval (Croft et al., 2010;Zamani et al., 2018;Li et al., 2018;Mao et al., 2021), recent work has started applying PRF for dense retrieval (Yu et al., 2021;Li et al., 2021). PRF aims to improve the quality of the retrieval by updating the initial query representation from the query encoder (i.e., E q (q) = q 0 ): where g is an update function and q t denotes the query representation after t-th updates over q 0 . The classical Rocchio's algorithm for PRF (Rocchio, 1971) updates the query representation as: where C r and C nr denote relevant and non-relevant sets of contexts, respectively. α, β, and γ determines the relative contribution of the current query representation q t , relevant context representations c r , and non-relevant context representations c nr , respectively, when updating to q t+1 . A common practice is to choose top-k contexts as pseudorelevant among top-k (k < k), i.e., C r = C qt 1:k .
In this work, we theoretically show that our testtime query optimization is a generalization of the Rocchio's algorithm. While Eq. (6) treats the positive (or negative) contexts equally, we use crossencoder re-rankers (Nogueira and Cho, 2019) to provide fine-grained pseudo labels and optimize the query representations with the gradient descent.

Methodology
In this section, we provide an overview of our method (TOUR) ( §3.1) and its two instantiations ( §3.2, §3.3). We also introduce a relevance labeler for phrase retreival ( §3.4) and simple techniques to improve efficiency of TOUR( §3.5).

Optimizing Test-time Query Representations
We propose test-time optimization of query representations (TOUR), which optimizes query representations at the instance level. In our setting, the query and context encoders are fixed after training, and we optimize the query representations solely based on their retrieval results. Figure 1 illustrates an overview of TOUR. First, given a single test query q ∈ Q test , we use the cross-encoder φ(·) to provide a score of how relevant each of the top-k contexts c ∈ C q 1:k is with respect to a query: where φ(·) is often parameterized with a pretrained language model, which we detail in §3.4. Compared to simply setting top-k results as pseudo positive in PRF, using cross-encoders enables more fine-grained judgments of relevance over the top results. In addition, it allows us to label results for test queries as well without access to the gold label c * . Here, we introduce two versions of TOUR.

TOUR with Hard Labels : TOUR hard
First, we explore using the scores from the crossencoder labeler φ and selecting a set of pseudopositive contexts C q hard ⊂ C q 1:k defined as the smallest set such that: where τ is a temperature parameter andc ∈ C q hard denotes a pseudo-positive context selected by φ.
Intuitively, we choose the smallest set of contexts as C q hard whose marginal relevance with respect to a query under φ is larger than the threshold p. This is similar to Nucleus Sampling for stochastic decoding (Holtzman et al., 2020).
Then, TOUR optimizes the query representation with a gradient descent algorithm based on the relevance judgment C q hard made by φ: . Similar to the query-side fine-tuning in Eq.
(3), we maximize the marginal likelihood of (pseudo) positive contexts C q hard . We dub this version as TOUR hard . Unlike query-side fine-tuning that updates the model parameters of E q (·), we directly optimize the query representation q itself. Also, TOUR hard is an instance-level optimization over a single test query q ∈ Q test without having access to the gold label c * .
For optimization, we use gradient descent.
where η denotes the learning rate for the gradient descent and the initial query representation is used as q 0 . Applying the gradient descent over the test queries shares the motivation with dynamic evaluation for language modeling (Krause et al., 2019), but we treat each test query independently unlike the series of tokens for the evaluation corpus of language modeling. For each iteration, we perform a single step of gradient descent followed by another retrieval with q t+1 to update C qt 1:k into C q t+1 1:k . Relation to the Rocchio's algorithm Eq. (10) could be viewed as performing PRF by setting the update function g(q t , C qt In fact, our update rule Eq. (10) is a generalized version of the Rocchio's algorithm as shown below: wherec ∈ C qt hard and P (c|q t ) = exp(sim(qt,c)) c exp(sim(qt,c )) (proof in Appendix A). While our update rule seems to fix α in Rocchio's to 1, it can be dynamically changed by applying weight decay during the gradient descent, which sets α = 1 − ηλ decay multiplied to q t . Then, the equality between Eq. (6) and Eq. (11) holds when C qt hard = C qt 1:k with P k (c|q t ) being equal for all c ∈ C qt 1:k , namely P k (c|q t ) = 1/k. This reflects that the Rocchio's algorithm treats all top-k results equally (i.e., P (c|q t ) = 1/k ). Then, β = γ = η k−k k holds (Appendix C). In practice, C qt hard would be different from C qt 1:k if some re-ranking happens by φ. Also, each pseudopositive context vectorc in the second term of RHS of Eq. (11) has a different weight. The contribution ofc is maximized when it has a larger probability mass P (c|q t ) among the pseudo-positive contexts, but a smaller probability mass P k (c|q t ) among the top-k contexts; this is desirable since we want to update q t a lot when the initial ranking of pseudo positive context in top-k is low. For instance, if there is a single pseudo-positive context c (i.e., P (c|q t )) = 1) ranked at the bottom of top-k with a large margin with top-1 (i.e., P k (c|q t ) = 0), then P (c|q t )(1 − P k (c|q t )) = 1 is maximized.

TOUR with Soft Labels : TOUR soft
From Eq. (11), we observe that it uses pseudopositive contexts C qt hard sampled by the crossencoder labeler φ, but the contribution ofc (the second term in RHS) does not directly depend on the scores from φ. The scores are only used to make a hard decision in pseudo-positive contexts. Another version of TOUR uses the normalized scores of a cross-encoder over the retrieved results as soft labels. We can simply change the maximum marginal likelihood objective in Eq. (9) to reflect the scores from φ in g. Specifically, we change Eq. (9) to minimize Kullback-Leibler (KL) divergence loss as follows: where We call this version TOUR soft . The update rule g for TOUR soft changes as follows: Eq. (13) shows that q t+1 reflects c i weightaveraged by the cross-encoder (i.e., P (c i |q t , φ)) while removing c i weight-averaged by the current retrieval result (i.e., P k (c i |q t )) (proof in Appendix B).

Relevance Labeler for Phrase Retrieval
In the previous section, we use the cross-encoder φ to provide a relevance score s i over a pair of a query q and a context c. While it is possible to use an off-the-shelf re-ranker (Fajcik et al., 2021) for passage retrieval, no prior work has introduced a re-ranker for phrase retrieval (Lee et al., 2021b).
In this section, we introduce a simple and accurate phrase re-ranker for TOUR.
Inputs for re-rankers For phrase retrieval, sentences containing each retrieved phrase are considered as contexts, following Lee et al. (2021b). For each context, we also prepend a title of its document and use it as our context for re-rankers. To train our re-rankers, we first construct a training set from the retrieved contexts of the phrase retriever given a set of training queries Q train . Specifically, from the top retrieved contexts C 1:k for every q ∈ Q train , we sample one positive context c + q and one negative context c − q . In open-domain question answering, a context that contains a correct answer to each q is assumed to be relevant (positive). Our re-ranker is trained on a dataset We use the RoBERTa-large model (Liu et al., 2019) as the base model for our re-ranker. Given a pre-trained LM M, the cross-encoder re-ranker φ outputs a score of a context being relevant: where {M, w} are the trainable parameters and ⊕ denotes a concatenation of q and c using a [SEP] token. Since phrase retrievers return both phrases and their contexts, we use special tokens [S] and [E] to mark the retrieved phrases within the contexts. Re-rankers are trained to maximize the probability of a positive context c + q for every (q, c + q , c − q ) ∈ D train . We use the binary cross entropy loss defined over the probability P + = . We found pre-training φ with reading comprehension datasets (Rajpurkar et al., 2016;Joshi et al., 2017; before training the rerankers is helpful. For the ablation study of our phrase re-rankers, see Appendix F.
Score aggregation While the re-ranker is used to provide pseudo relevance labels for TOUR, it also serves as a strong baseline for dense retrieval.
We found that adding a final re-ranking step at the end of TOUR provides consistent improvement. Specifically, we linearly interpolate the similarity score sim(q, c i ) with s i for the final re-ranking: λs i + (1 − λ)sim(q, c i ).

Efficient Implementation of TOUR
TOUR potentially improves the recall of gold candidates by iteratively searching with updated query representations. However, it has high computational complexity since it has to label top-k retrieval results with a cross-encoder and perform additional retrieval. To minimize the additional time complexity, we perform up to t = 3 iterations of TOUR hard with an early stop condition of the top-1 retrieval result being pseudo-positive, i.e., c 1 ∈ C qt hard . When using TOUR soft , we apply an early stop condition when the top-1 retrieval result has the highest relevance score. We also cache φ(q, c i ) for each query to skip redundant computation. As we will show in Figure 2, this enables TOUR to consistently outperform re-ranking models with up to 4× faster inference.

Experiments
We test TOUR on multiple open-domain QA datasets. Specifically, we evaluate its performance on phrase retrieval and passage retrieval.

Open-domain Question Answering
For end-to-end open-domain QA, we use phrase retrieval (Seo et al., 2019;Lee et al., 2021a) for TOUR, which directly retrieves phrases from the entire Wikipedia using a phrase index. Since a singlestage retrieval is the only component in phrase retrieval, it is easy to show how its open-domain QA performance can be directly improved with TOUR. We use DensePhrases (Lee et al., 2021a,b) for our base phrase retrieval model and train a crossencoder labeler as described in §3.4. We report exact match (EM) for end-to-end open-domain QA.  (Guu et al., 2020) 40.4 -40.7 42.9 -DPR multi (Karpukhin et al., 2020) 41.5 56.8 42.4 49.4 24.1 GAR (Mao et al., 2021) 41.8 62.7 ---UnitedQA-E base (Cheng et al., 2021) 47.7 66.3 ---DPR multi (large) 44.6 60.9 44.8 53.5 -+ Re-ranker (Iyer et al., 2021) 45.5 61.7 45.9 55.3 -ColBERT-QA large (Khattab et al., 2021)    Baselines Many open-domain QA models take the retriever-reader approach (Chen et al., 2017;Izacard and Grave, 2021;Singh et al., 2021). As our baselines, we report extractive open-domain QA models, which is a fair comparison with retriever-only (+ re-ranker) models whose answers are always extractive. For re-ranking baselines of retriever-reader models, we report ReConsider (Iyer et al., 2021), which re-ranks the outputs of DPR + BERT. For a PRF baseline, GAR (Mao et al., 2021), which uses context generation models for augmenting queries in BM25, is reported. We also include RePAQ (Lewis et al., 2021b) for a retriever-only model.
Results Table 1 shows the results on the five open-domain QA datasets in the in-domain evaluation setting where all models use the training sets of each dataset they are evaluated on. First, we observe that using our phrase re-ranker largely improves the performance of DensePhrases multi . Compared to adding a re-ranker on the retriever-reader model (DPR multi + Re-ranker by Iyer et al., 2021), the gain is much larger possibly due to its high topk accuracy (k > 1) that we observed. Unlike using the simple Rocchio's algorithm, using TOUR hard or TOUR soft largely improves the performance, even NQ TRIVIAQA ENTITYQ † Model (Acc@20/100) (Acc@20/100) (Acc@20/100)   Table 1, which was trained on all five open-domain QA datasets, we observe huge performance drops on unseen query distributions when using DPR NQ and DensePhrases NQ . DPR NQ seems to suffer more (e.g., 0.1 on SQuAD) since both of its retriever and reader were trained on NQ, which exacerbates the problem when combined. On the other hand, using TOUR largely improves the performance of DensePhrases NQ on many unseen query distributions even though all of its component were still trained on NQ. Specifically, TOUR hard gives 5.9% improvement on average across different query distributions by further advancing our phrase re-ranker. Interestingly, TOUR hard consistently performs better than TOUR soft in this setting, which requires more investigation in the future.

Passage Retrieval
We test TOUR on the passage retrieval task for open-domain QA. We use DPR as a passage retriever and DensePhrases as a phrase-based passage retriever (Lee et al., 2021b). In this experiment, we use off-the-shelf passage re-ranker (Fajcik et al., 2021) to show how existing re-rankers can serve as a pseudo labeler for TOUR. We report top-k retrieval accuracy, which is 1 when the answers exist in top-k retrieval results. Table 3 shows the results of passage retrieval for open-domain QA. We find that using TOUR consistently improves the passage retrieval accuracy. Under the query distribution shift similar to Table 2, DPR multi + TOUR soft improves the original DPR by 8.3% and advances off-the-shelf re-ranker by 1.8% on EntityQuestions (Acc@20). Notably, Acc@100 always improves with TOUR, which is not possible for re-rankers since they do not update the top retrieval results. Unlike the phrase retrieval task, we observe that TOUR soft is a slightly better option than TOUR hard on this task.

Test-Train Overlap Analysis
Open-domain QA datasets often contain semantically overlapping queries and answers between training and test sets (Lewis et al., 2021a), which overestimates the generalizability of QA models. Hence, we test our models on test-train overlap splits provided by Lewis et al. (2021a). Table 4 shows that TOUR consistently improves the performance of test queries that do not overlap with training data (i.e., None). Notably, on WebQuestions, while the performance on the none overlap split has been improved by 1.5% from the re-ranker, the performance on query overlap is worse than the re-ranker since unnecessary exploration is often performed on overlapping queries. Figure 2 compares the query latency and performance of TOUR and other baselines. We vary the   top-k value from 10 to 50 by 10 (left to right) to visualize the trade-off between latency and performance. The result shows that TOUR with only top-10 is better and faster than the re-ranker with the best top-k. Specifically, TOUR hard (k = 10) outperforms the re-ranker (k = 40) by 1.0% while being 2.5× faster. We can further reduce the top-k of TOUR to 5, which gives us 4× faster inference than the re-ranker (k = 40) with comparable performance. This demonstrates that TOUR requires a less number of retrieval results to perform well, compared to a re-ranker model that typically requires a larger k. Simple techniques introduced in §3.5 such as early stopping and caching further reduce the run-time of TOUR at the cost of a slight drop in performance. See Appendix D for details.  Table 5 shows an ablation study of TOUR hard on the end-to-end open-domain QA. We observe that using fine-grained relevance signals generated by our phrase re-ranker (i.e., C q hard ) is significantly more effective than simply choosing top-k as relevance signals (i.e., C 1:k ). Using SGD or aggregating the final scores between a retriever and a re-ranker gives additional improvement. Figure 3 shows the effect of multiple iterations in TOUR hard compared to the Rocchio's algorithm. While Rocchio's with t = 1 achieves slightly better performance than DensePhrases, it shows a diminishing gain with a larger number of iterations. In contrast, the performance of TOUR hard benefits from multiple iterations until t = 3. Skipping the final re-ranking stage (i.e., λ = 0) causes a performance drop, but it quickly recovers with a larger t.  Figure 4: A sample prediction of TOUR from Natural Questions. For every t-th iteration of TOUR, we show the top 5 phrases (denoted in bold) retrieved from DensePhrases along with their passages. The score s i from the crossencoder labeler φ is shown in each parenthesis. t = 0 denotes initial retrieval results. When t = 1, TOUR obtains three new results and the correct answer "Sound" becomes the top-1 prediction at t = 3. Figure 4 shows a sample prediction of TOUR. We use DensePhrases multi + TOUR hard with k = 10, from which the top-5 results are shown. While the initial result at t = 0 failed to retrieve correct answers in the top-10, the next round of TOUR gives new results including the correct answer, which were not retrieved before. As the iterations continue, the correct answer moves up the retrieval results, and it becomes the top-1 prediction at t = 3.

Conclusion
In this paper, we propose TOUR, which iteratively optimizes test query representations for dense retrieval. Specifically, we optimize instance-level query representations at test time using a gradientbased optimization method over the top retrieval results. We use cross-encoder re-rankers to provide pseudo labels where our simple re-ranker or offthe-shelf re-rankers can be used. We theoretically show that the gradient-based optimization provides a generalized version of the Rocchio's algorithm for pseudo relevance feedback, which leads us to develop different variants of TOUR. Experiments show that our test-time query optimization largely improves the retrieval accuracy on multiple opendomain QA datasets in various settings while being more efficient than traditional re-ranking methods.

Limitations
In this paper, we focus on end-to-end accuracy and passage retrieval accuracy for open-domain QA. We have also experimented on the BEIR benchmark (Thakur et al., 2021) to evaluate our method in the zero-shot document retrieval task. For some tasks, TOUR obtains significant improvements with a pre-trained document retriever (Hofstätter et al., 2021). For example, TOUR improves the baseline retriever by 11.6% and 23.8% NDCG@10 on BioASQ and TREC-COVID, respectively, while also outperforming the re-ranker by 2.1% and 2.4% NDCG@10. However, for other tasks, we obtained marginal improvements compared to the re-ranking method and obtained 48.1% macroaveraged NDCG@10 compared to 47.8% by the re-ranking method. We suspect that this is largely due to the re-ranker being mostly trained on QA datasets or MS-MARCO, which we plan to address in the future.
Another limitation of TOUR is that it requires a set of validation examples for hyperparameter selection. While we only used in-domain validation examples for TOUR, which were equally adopted when training re-rankers, we have observed some performance variances depending on the hyperparameters. We hope to tackle this issue with a better optimization technique in the future.

A Derivation of the Gradient for TOUR hard
Proof. We compute the gradient of L hard (q t , C qt 1:k ) in Eq. (10) with respect to the query representation q t . Denoting c P k (c|q t ) as Z, the gradient is: Then, we have:

B Derivation of the Gradient for TOUR soft
Proof. We compute the gradient of L soft (q t , C qt 1:k ) in Eq. (12) with respect to q t . Denoting P (c i = c * |q t , φ) as P i , we expand the loss term as: Then, the gradient is: Putting it all together:
Then, the equality holds when α = 1, β = η k−k k , and γ = η k−k k .   D Efficient Implementation of TOUR Figure 5 summarizes the effect of optimization techniques to improve efficiency of TOUR. Without each technique, the latency increases linearly with the number of iterations. By adding the caching mechanism for φ and the stop condition of c 1 ∈ C qt hard , the latency is greatly reduced. Table 6 shows the statistics of the datasets used for end-to-end open-domain QA and passage retrieval tasks. For EntityQuestions, we only use its test set for the query distribution shift evaluation.

F Implementation Details
Phrase re-ranker To train a cross-encoder reranker for phrase retrieval ( §3.4), we first annotate the top 100 retrieved results from DensePhrases. We use three sentences as our context, one that contains a retrieved phrase and the other two that surround it. This leads to faster inference than using the whole paragraph as input while preserving the performance. During the 20 epochs of training, we sample positive and negative contexts for every epoch while selecting the best re-reanker based on the validation accuracy of the re-ranker. We modified the code provided by the Transformers   library 3 (Wolf et al., 2020) and used the same hyperparameters as specified in their documentation except for the number of training epochs. The ablation study shows that we can achieve stronger performance by prepending titles to inputs, using larger language models, using three sentences as our context, and pre-training over reading comprehension datasets (Table 7). Using entire paragraphs as input contexts only slightly increases performance compared to using three sentences, but it doubles the query latencies of re-ranking.
Dense retriever We modified the official code of DensePhrases 4 (Lee et al., 2021a) and DPR 5 (Karpukhin et al., 2020) to implement TOUR on dense retrievers. While pre-trained models and indexes of DensePhrases multi and DPR NQ are publicly available, the indexes of DensePhrases NQ and DPR multi have not been released as of May 25th, 2022. When necessary, we reimplemented them to experiment with opendomain QA and passage retrieval in the query distribution shift setting.