Efficient k-NN Search with Cross-Encoders using Adaptive Multi-Round CUR Decomposition

Cross-encoder models, which jointly encode and score a query-item pair, are prohibitively expensive for direct k-nearest neighbor (k-NN) search. Consequently, k-NN search typically employs a fast approximate retrieval (e.g. using BM25 or dual-encoder vectors), followed by reranking with a cross-encoder; however, the retrieval approximation often has detrimental recall regret. This problem is tackled by ANNCUR (Yadav et al., 2022), a recent work that employs a cross-encoder only, making search efficient using a relatively small number of anchor items, and a CUR matrix factorization. While ANNCUR's one-time selection of anchors tends to approximate the cross-encoder distances on average, doing so forfeits the capacity to accurately estimate distances to items near the query, leading to regret in the crucial end-task: recall of top-k items. In this paper, we propose ADACUR, a method that adaptively, iteratively, and efficiently minimizes the approximation error for the practically important top-k neighbors. It does so by iteratively performing k-NN search using the anchors available so far, then adding these retrieved nearest neighbors to the anchor set for the next round. Empirically, on multiple datasets, in comparison to previous traditional and state-of-the-art methods such as ANNCUR and dual-encoder-based retrieve-and-rerank, our proposed approach ADACUR consistently reduces recall error-by up to 70% on the important k = 1 setting-while using no more compute than its competitors.

1 Introduction k-nearest neighbor (k-NN) search is a core subroutine of a variety of tasks in NLP such as entity linking (Wu et al., 2020), passage retrieval for QA (Karpukhin et al., 2020), and more generally, in retrieval-augmented machine learning models (Guu et al., 2020;Izacard et al., 2023).For many of these applications, the state-of-the-art similarity function is a cross-encoder that directly outputs a scalar similarity score after jointly encoding a Figure 1: Exact versus approximate cross-encoder scores (computed using ANNCUR) of all items for a test-query in domain=YuGiOh.ANNCUR incurs high approximation error on k-NN items wrt exact scores when using 50 anchor items sampled uniformly at random (Fig. 1a).In contrast, sampling 50 anchor items with probability proportional to exact cross-encoder scores (Fig. 1b) significantly improves approximation of top-scoring items.
given query-item pair.However, computing a single query-item score using a cross-encoder requires a forward pass of the model which can be computationally expensive as cross-encoders are typically parameterized using deep neural models such as transformers (Vaswani et al., 2017).For this reason, k-NN search with a cross-encoder typically involves retrieving candidate items using additional models such as dual-encoders or BM25 (Robertson et al., 2009), followed by re-ranking items using the cross-encoder (Logeswaran et al., 2019;Zhang and Stratos, 2021;Qu et al., 2021).However, the accuracy of such retrieve-and-rerank approaches is upperbound by the recall of first-stage retrieval and may require computationally expensive distillationbased training of dual-encoders to improve recall.
Recent work by Yadav et al. (2022) proposed ANNCUR, a CUR factorization (Mahoney and Drineas, 2009) based method, that approximates cross-encoder using dot-product of latent query and item embeddings and performs k-NN retrieval using approximate scores followed by optionally re-ranking retrieved items using exact cross-encoder scores.The latent item embeddings are computed by comparing each item against a set of anchor queries, and the latent query embedding is computed using the query's cross-encoder scores against a fixed set of anchor items.As shown in Figure 1, when ANNCUR selects the anchor items uniformly at random (Fig 1a), it incurs higher approximation error on the top-scoring items than the rest of the items, resulting in poor k-NN recall, and including some k-NN items as part of anchor items (Fig. 1b) can significantly improve approximation error for top-scoring items.
In this work, we propose ADACUR, a search strategy that improves k-NN search recall by improving the approximation of top-scoring items.ADACUR performs retrieval over multiple rounds, retrieving the first batch of items either uniformly at random or using heuristic or auxiliary models such as dual-encoder or BM25 to get a first coarse approximation of item scores for the test query.In subsequent rounds, it alternates between a) performing retrieval using approximate scores and scoring retrieved items using cross-encoder and b) using all retrieved items as anchor items to improve the approximation and hence retrieval of relevant items in the subsequent rounds.Our proposed approach provides significant improvements in k-NN search recall over ANNCUR and dual-encoder based retrieve-and-rerank approaches when performing k-NN search with cross-encoder models trained for the task of entity linking and information retrieval.
2 Proposed Method: ADACUR Task Given a scoring function f θ : Q × I → R that maps a query-item pair to a scalar score, and a query q ∈ Q, the k-nearest neighbor search task is to retrieve top-k scoring items from a fixed item set I according to the given scoring function f θ .

ADACUR: Offline Indexing of Items
The indexing step of ADACUR involves using the cross-encoder model (f θ ) to score each item against a fixed set of k q anchor/train queries (Q train ), to get score matrix R anc R anc (q, i) = f θ (q, i), ∀(q, i) ∈ Q train × I Each column of E I := R anc ∈ R kq×|I| corresponds to a k q -dimensional latent item embedding.

ADACUR: Test-time inference
The baseline method ANNCUR computes the latent test-query embedding e q test using C q test , a Algorithm 1 ADACUR k-NN Search 1: Input: (q test , Ranc, NR, BCE, A) 2: q test : Test query 3: Ranc: Matrix containing CE scores between Qtrain and I 4: BCE: Total cross-encoder (CE) call budget.5: A: Algorithm to use for selecting (anchor) items.6: NR: Number of iterative search rounds 7: Output: Ŝ: Approximate scores of q test with all items, Ianc: Retrieved (anchor) items with CE scores in Ctest.
8: ks ← BCE/NR £ Num. of items to sample per round 9: Ianc ← INIT(I, ks) £ Initial set of anchor items 10: Ctest ← [f θ (q test , i)]i∈I anc £ CE scores of Ianc for q test 11: for j ← 2 to NR do 12: Iselect ← ks items sampled using S 13: else if A = Random then 14: Iselect ← ks items uniformly sampled from I \ Imask 15: return Iselect |I anc |-dimensional vector containing cross-encoder scores of q test with a set of anchor items (I anc ) as: the subset of columns of R anc corresponding to (anchor) items I anc .Finally, AN-NCUR approximates the score for a query-item pair (q test , i) using dot-product of the query embedding e q test and item embedding E I [:, i] as The main bottleneck at test-time inference is the number of items scored using the cross-encoder for the given test-query.For a given budget of crossencoder calls, ANNCUR splits the budget (B CE ) into two parts -it uses k i cross-encoder calls to compare against anchor items (chosen uniformly at random or using heuristic methods) and retrieves B CE − k i items using approximate scores and reranks them using exact cross-encoder scores.
In contrast, our proposed approach ADACUR uses the cross-encoder call budget to adaptively retrieve and score items over N R rounds, repurposing the retrieved items as anchor items as shown in Algorithm 1. ADACUR begins by sampling the first batch of k s = B CE /N R (anchor) items uniformly at random.The first batch of items can also be selected using baseline retrieval methods such as BM25 and dual-encoders.In the j th round, the items retrieved up to round j −1 are used as anchor items to revise the test-query embedding, which in turn is used to update the approximate scores (line 13 in Algorithm 1).Finally, the items selected so far are masked out and the next batch of k s items in round j is retrieved using the updated approximate scores in the following two ways: • TopK: Greedily pick top-k s items according to approximate scores.
• SoftMax: Convert approximate item scores into probability using softmax and sample k s items without replacement.Finally, ADACUR returns top-k items based on exact cross-encoder scores1 from the set of retrieved (anchor) items as approximate k-NN items.We refer interested readers to Appendix B.5 for a discussion on the time complexity of ADACUR.

Experiments
In our experiments, we evaluate the proposed approach and baselines on the task of finding knearest neighbor items as per a given cross-encoder.We experiment with two cross-encoders -one trained for the task of zero-shot entity linking, and another trained on information retrieval datasets.
Experimental Setup We run evaluation on domains YuGiOh, StarTrek, and Military from ZESHEL-a zero-shot entity linking dataset (Logeswaran et al., 2019), and domains SciDocs and HotpotQA from BEIR-a zero-shot information retrieval benchmark (Thakur et al., 2021).We use two cross-encoder models trained on labeled training data from the corresponding benchmark and evaluate separately on each domain on the task of find-ing k-NN cross-encoder items.For each ZESHEL domain, we randomly split the query set into a train/anchor set (Q train ) and a test set (Q test ) while for BEIR domains, we use pseudo-queries released as part of the benchmark as train/anchor queries and evaluate on queries in the official test split.We refer the reader to Table 1 for additional details.
Baselines We compare our proposed approach with the following baseline retrieval methods.Dual-Encoders (DE): Query-item scores are computed using dot-product of embeddings produced by a learned deep encoder model.DE is used for initial retrieval followed by re-ranking using the cross-encoder.We report results for DE BASE , a dual-encoder trained on training domains in the corresponding dataset, and the following two dualencoder models trained on the target domain via distillation using the cross-encoder.
• DE CE BERT : DE initialized with BERT (Devlin et al., 2019) and trained only on the target domain via distillation using the cross-encoder.ANNCUR : k-NN search method proposed by Yadav et al. (2022) where anchor items are chosen uniformly at random.We additionally experiment with ANNCUR DE BASE which uses top-scoring items retrieved using DE BASE as anchor items.
Evaluation Metric Following the precedent set by previous work (Yadav et al., 2022), we evaluate all approaches using Top-k-Recall@B CE which is defined as the fraction of k-NN items retrieved under test-time cost budget B CE where the cost is defined as the number of cross-encoder calls made during inference.DE baselines will use the entire budget of B CE calls for re-ranking retrieved items using exact cross-encoder scores, ANNCUR splits the budget between scoring anchor k i items and using exact cross-encoder scores for re-ranking B CE − k i retrieved items, and ADACUR use the budget to adaptively search for k-NN items.
For ADACUR, unless noted otherwise, we use N R = 5 with TopK method for retrieving items using approximate scores, and retrieve the first batch of items uniformly at random (ADACUR) or using DE BASE (ADACUR DE BASE ).We refer readers to Appendix B for implementation details and details on training and parameterization of cross-encoder and dual-encoder models used in our experiments.Sampling anchor items using DE BASE For k = 1, 10, Top-k-Recall for both ANNCUR and ADACUR can be further improved by leveraging baseline retrieval models such as DE BASE for retrieving all and the first batch of anchor items respectively instead of sampling them uniformly at random.ADACUR DE BASE always performs better than ANNCUR DE BASE which in turn performs better than re-ranking items retrieved using DE BASE .Thus, for a given cost budget (B CE ), even when a strong baseline retrieval model such as DE BASE is available, using the baseline retrieval model to select the first batch of B CE /N R items followed by using the proposed approach to adaptively retrieve more items over remaining N R − 1 rounds performs better than merely re-ranking B CE top-scoring items retrieved using DE BASE .Finally, note that  Effect of number of rounds Figure 3 shows Topk-Recall and Figure 4 shows total inference latency of ADACUR on primary y-axis for various values of the test-time cross-encoder call budget (B CE ).
The secondary y-axis in Figure 4 shows the fraction of total inference time spent on each one of the three main steps is linear in the number of items in the domain, we observe that it is a negligible fraction of overall latency on GPUs even for domain=HotpotQA with 5 million items (see Figure 5) as GPUs can enable significant speedup even for brute-force computation of this step.

Conclusion
In this paper, we presented an adaptive search strategy that incrementally builds a query embedding to approximate cross-encoder scores and performs k-NN search using approximate scores over several rounds.Our approach is designed to reduce approximation error for the top-scoring items and hence improves k-NN search recall when retrieving items based on the approximate scores.We perform an indepth empirical analysis of the proposed approach in terms of both retrieval quality and efficiency.

Limitations
Building the index for the ADACUR is more expensive than the traditional dual-encoder index due to the computation of dense cross-encoder scores matrix (see §2.1).We have successfully run our approach on up to 5 million items, but scaling to billions of items is an interesting direction for future work.Dual-encoder-based retrieve-and-rerank baseline approaches can benefit from training the dual-encoder on multiple domains.It is not clear if data from multiple domains can be leveraged to improve performance of the proposed approach on a given target domain; although in any case, crossencoders tend to be more robust to domain shift than using only dual-encoders for retrieval.

Ethics Statement
In this paper we tackle the task of finding k-nearest neighbor items for a given query when query-items scores are computed using a black-box similarity function such as a cross-encoder model.The crossencoder scoring function may have certain biases

B.1.2 BEIR
We follow the training setup used by Hofstätter et al. (2020).We first train three teacher crossencoders initialized with albert-large-v2 (Lan et al., 2020), bert-large-whole-word-masking, and bert-base-uncased (Devlin et al., 2019), and compute soft labels on 40 million (query, positive item, negative item) triplets in MS-MARCO dataset (Bajaj et al., 2016).We then train our cross-encoder model parameterized using a 6layer MINI-LM model (Wang et al., 2020) via distillation using average scores of the three teacher models as target signal and minimizing mean-square-error between predicted and target scores.We use training scripts available as part of sentence-transformer 4 repository to train the cross-encoder model and use a dot-product based scoring mechanism for cross-encoders proposed by Yadav et al. (2022)..

B.2.1 ZESHEL dataset
We report results for DE baselines as reported in Yadav et al. (2022).The DE models were initialized using bert-base-uncased and contain separate query and item encoders, thus containing a total of

B.2.2 BEIR benchmark
For BEIR domains, we use a dual-encoder model5 released as part of sentence-transformer repository as DE BASE .This dual-encoder model was initialized using distillbert-base (Sanh et al., 2019) and trained on MS-MARCO dataset.This DE BASE is not trained on target domains SciDocs and HotpotQA used for running k-NN experiments.
We finetune DE BASE via distillation on the target domain to get DE CE BASE model.Given a set of training queries Q train from the target domain, we retrieve top-100 or top-1000 items for each query, score the items with the cross-encoder model, and train the dual-encoder by minimizing cross-entropy loss between predicted query-item scores (using DE) and target query-item scores (obtained using cross-encoder).Training a DE CE BASE with 1K queries and 100 or 1000 items per query takes around 2 hrs and 10 hrs respectively on an Nvidia RTX8000 GPU with 48GB memory.We train DE CE BASE for 10 epochs when using top-100 items per query and for 4 epochs when using top-1000 items per query using AdamW (Loshchilov and Hutter, 2019) optimizer with learning rate 1e-5.

B.3 ANNCUR Implementation details
For ANNCUR, we report results for the optimal split of cross-encoder call budget (B CE ) between scoring k i anchor items followed by retrieving B CE − k i items for re-ranking.We experiment with k i ∈ {iB CE /10 : 1 ≤ i ≤ 9}.If the retrieved items contain a subset of anchor items for which exact  cross-encoder score has already been computed, we retrieve more than B CE − k i items using approximate scores and compute exact cross-encoder scores for them until we have exhausted the entire cross-encoder call budget for the re-ranking step.
For HotpotQA, we restrict our k-NN search to top-10K items wrt DE BASE for ADACUR DE BASE .For ZESHEL domains and SciDocs, we do not use any such heuristic and search over all the items in the corresponding domain.

B.5 Time Complexity of ADACUR
The offline indexing step for ADACUR takes O(k q |I|C f θ ) time as it involves computing exact cross-encoder scores for all |I| items in the target domain against k q anchor queries, and computing each cross-encoder score takes C f θ units of time.
At test time, we are given a budget B CE on the number of cross-encoder calls.Each one of the N R rounds during inference with ADACUR involves approximating all item scores for the test query (q test ) followed by sampling the next batch of k s = B CE /N R items using the updated approximate scores.In the j th round, the score approximation step involves computing the pseudo-inverse of a k q × jk s -dimensional matrix (line 12 in Algo.1), which takes O(C kq,jks inv ), and the time taken to compute cross-encoder scores for the next batch of k s items is O(k s C f θ ).Thus, the total inference latency for retrieving items over N R rounds under a given budget of B CE cross-encoder calls is

Overhead of ADACUR
Figure 5 shows the breakdown of ADACUR's inference latency in terms of time spent on computing cross-encoder scores, and the overhead of computing matrix inverse in line 12 and updating approximate scores by multiplying matrices in line 13 of Algorithm 1. Empirically, we observe that the primary bottleneck at inference time is the time taken to compute cross-encoder scores for query-item pairs at test time, and the overhead for ADACUR accumulates linearly as the number of rounds increases.The overhead is mostly dominated by computing pseudo-inverse (see line 12 in Algorithm 1) and this step is independent of the target domain size.The matrix multiplication step (line 13 in Algorithm 1) has a linear dependence on the number of items in the target domain but it is a negligible fraction of the overall running time as it can be significantly sped up using GPUs.
For ZESHEL domains, we use a cross-encoder parameterized using bert-base (Devlin et al., 2019), and observe that each cross-encoder call takes amortized time of ∼7ms on an Nvidia 2080ti GPU when the scores are computed in batches of size 50.Computing each cross-encoder score sequentially i.e. with batch-size = 1 takes ∼13ms per score.We did not observe any further reduction in amortized time to compute each score when increasing the batch size beyond 50.
The amortized time per cross-encoder call is approximately 6ms and 2ms for SciDocs and HotpotQA respectively when using batch size=50 and MINI-LM-based (Wang et al., 2020) crossencoder.The difference in time per cross-encoder score for SciDocs and HotpotQA is due to the difference in average query-item pair sequence length.

C Additional Results and Analysis
C.1 TopK vs SoftMax for ADACUR Figure 6 shows Top-k-Recall for ADACUR on do-main=YuGiOh, |Q train | = 500, when using TopK and SoftMax strategies for sampling items based on approximate scores (see §2.2 for details).TopK sampling strategy which greedily picks top-k items based on approximate scores results in superior recall as compared to sampling items using softmax over approximate scores.

C.2 Anchor Item Sampling with Oracle
ADACUR performs retrieval over multiple rounds using approximate cross-encoder scores and uses the items retrieved based on the approximate scores as anchor items to improve the approximation and hence retrieval in subsequent rounds.In this section, we run experiments where the anchor item sampling method has oracle access to exact crossencoder scores of all items for the given test query to better understand the effect of anchor items on the approximation of cross-encoder scores and hence subsequent retrieval based on the approximate scores.We experiment with the following strategies for sampling k i anchor items for a given test query : • TopK O km,ϵ : Mask out top-k m items wrt exact cross-encoder scores and select k i anchor items by greedily picking (1−ϵ)k i items starting from rank k m + 1, and sample remaining ϵk i anchor items uniformly at random.• SoftMax O km,ϵ : Mask out top-k m items wrt exact cross-encoder scores and select k i anchor items by sampling (1 − ϵ)k i anchor items using softmax over exact cross-encoder scores, and sample remaining ϵk i anchor items uniformly at random.For a given test-time cross-encoder call budget B CE , we select k i anchor items, compute approximate cross-encoder scores using the chosen anchor items, and then retrieve B CE − k i items based on the approximate scores.We experiment with k i ∈ {iB CE /10 : 1 ≤ i ≤ 9} and report results for the best budget split.factorization which is used to compute the approximate scores incurs negligible approximation error on anchor items, and hence on top-k items when these items are part of the anchor set as shown in figures 7 and 9.For TopK O k,0 and SoftMax O k,0 sampling strategies, since the top-k items are not part of the anchor set, CUR incurs a much higher approximation error for the top-k items (see examples in Figures 9a and 9b ), thus resulting in poor Topk-Recall as shown in Figure 8a.score distribution whereas greedily selecting topscoring items using exact scores results in an anchor set with items having similar cross-encoder scores.However, as shown in Figures 9c and 9d, both of these sampling strategies can result in overestimating scores for all items, even the irrelevant ones (i.e.items beyond top-k items) due to insufficient representation of the irrelevant items in the anchor set.Thus retrieving based on approximated scores may struggle to retrieve relevant k-NN items, especially for larger values of k such as k = 100 when the anchor items are chosen using oracle strategies such as TopK O km,0 .Figures 9e and 9f, where ϵ = 75% of 50 items are sampled uniformly at random, show that overestimating scores of irrelevant items can be avoided by sampling a fraction of anchor items uniformly at random to increase the diversity of the anchor item set.As shown in Figures 8b and 8c, Top-k-Recall for both SoftMax O 0,ϵ and TopK O 0,ϵ generally improves with an increase in ϵ, the fraction of ran-dom items in the anchor set, due to increased diversity in the anchor item set.Since SoftMax O 0,ϵ already samples a diverse set of anchor items, increasing ϵ yields only marginal improvement while for TopK O 0,ϵ , increasing ϵ yields significant improvements due to increased diversity of the anchor set.A small drop in performance is observed for larger values of ϵ as increasing ϵ beyond a threshold results in some of the top-k items to be excluded from the anchor item set.This results in a poorer approximation of scores for the missing top-k items and hence poor retrieval recall as the retrieval is done using the approximate scores.

Effect of diversity in anchor items
Finally, the optimal strategy for choosing the set of anchor items is the one that strikes a balance between selecting anchor items with diverse crossencoder scores and greedily picking top-k items.Our proposed strategy ADACUR improves over ANNCUR as greedily picking top-scoring items according to approximate scores to expand set of anchor items increases the likelihood of picking ground-truth k-NN items to be part of the anchor set, with this likelihood improving after each round with improvement in the score approximation, and ADACUR achieves diversity in the anchor items as a result of sampling items uniformly at random in the first round and due to error in the approximate scores, as shown in Figure 11.

C.3 Comparison with Multi-Vector Models
Multi-vector models (Khattab and Zaharia, 2020;Ma et al., 2021) produce multiple embeddings for each query and item.For a given query q and item i, the query-item score is computed using simple functions such as average similarity or sum-ofmaximum similarities between the set of embeddings for query q and item i.We would also like to note that while multivector models such as MUVER can be more accurate than single-embedding models such as DE BASE ,  such multi-vector models incur significant memory overhead for storing query/item embeddings.For instance, using 15 embeddings per item with 768-dimensional embeddings would take around 250GB space for 5 million items for HotpotQA.

C.4 Results for TF-IDF baseline
TF-IDF: All queries and items are embedded using a TF-IDF vectorizer trained on item descriptions and top-k items are retrieved using the dot-product of sparse query and item embeddings.
For domains in ZESHEL, we report results for TF-IDF baseline, for ANNCUR when anchor items are chosen using TF-IDF baseline (ANNCUR TF-IDF ), and for ADACUR when the first batch of anchor items is chosen using TF-IDF baseline (ADACUR TF-IDF ).With ADACUR, the first batch containing anchor items in Figure 11a-1 and 11b-1 is chosen uniformly at random and in subsequent rounds, items with highest approximate scores are chosen.Note that the approximation error for top-scoring items improves significantly when the 50 anchor items are chosen adaptively (see Figure 11a-5) with the improvement being much more significant than merely increasing the number of anchor items sampled uniformly at random from 50 in Figure 11a-6 to 200 in Figure 11b-6.
4 https://github.com/UKPLab/sentence-transformers 2 × 110M parameters.We refer readers to Yadav et al. (2022) for details related to the training of all DE model variants on ZESHEL dataset.

Figure 5 :
Figure5: ADACUR inference latency versus number of rounds for two different domains.The primary bottleneck at inference time is the time taken to compute cross-encoder (CE) scores for query-item pairs at test time, and the overhead for ADACUR accumulates linearly as the number of rounds increases.See §B.5 for detailed discussion.

)
time, followed by a matrix multiplication step to compute updated approximate scores (line 13 in Algo. 1) which takes O(C kq,jks,|I| mul ) time.The time taken to update the approximate scores in each round is O(C kq,jks inv + C kq,jks,|I| mul

Figure 6 :
Figure 6: Top-k-Recall for ADACUR on YuGiOh, |Q train | = 500 for different strategies for sampling items based on approximate scores described in §2.2.

Figure 7 :
Figure7: Average approximation error for CUR matrix factorization on test-queries for domain=YuGiOh and |Q train | = 500 when choosing k i = 50 anchor items uniformly at random (ANNCUR), using oracle strategies from §C.2 and for ADACUR when sampling anchor items over five rounds.Approximation error is computed as the average of absolute difference between approximate and exact item scores.

Figure 9 :
Figure 9: Scatter plot showing approximate versus exact cross-encoder scores for a query from domain=YuGiOh, when choosing k i = 50 anchor items using oracle strategies from §C.2 and |Q train | = 500.Top-k for k=1,10,100 wrt exact cross-encoder scores are annotated with text along with vertical lines, different color bands indicate the ordering of items wrt approximate scores, and anchor items are shown in blue.
Figure 10 shows Top-k-Recall for DE BASE , DE CE BASE , ADACUR DE BASE , and MUVER (Ma et al., 2021), a recent multi-vector model trained on ZESHEL dataset.For MUVER, we use the pretrained checkpoint released by Ma et al. (2021) with the view-merging inference strategy as described in Ma et al. (2021).While MUVER can be more accurate than DE BASE , DE CE BASE obtained by finetuning DE BASE model on the target domain outperforms MUVER and our proposed method ADACUR DE BASE yields the best Top-k-Recall versus inference cost trade-offs for all values of k.
14, and 15  show Top-k-Recall for domains YuGiOh, StarTrek, and Military respectively for |Q train | ∈ {100, 500, 2000}.For each baseline retrieval method, ADACUR always performs better than ANNCUR which in turn generally performs better than merely re-ranking items retrieved using the corresponding baseline retrieval method.In most cases, Top-k-Recall for ADACUR DE BASE > AN-NCUR DE BASE > DE BASE , and ADACUR TF-IDF > AN-NCUR TF-IDF > TF-IDF.
6) ANNCUR-Sampling all 50 anchor items uniformly at random.(a) Sampling 50 anchor items adaptively for ADACUR (over five rounds) and for ANNCUR (uniformly at random).
ANNCUR-Sampling all 200 anchor items uniformly at random.(b) Sampling 200 anchor items adaptively for ADACUR (over five rounds) and for ANNCUR (uniformly at random).

Figure 11 :
Figure 11: Scatter plot showing approximate versus exact cross-encoder scores for a query from do-main=YuGiOh,|Q train | = 500 when choosing k i = 50 and 200 anchor items with ADACUR over five rounds, and uniformly at random with ANNCUR.Top-k for k=1,10,100 wrt exact cross-encoder scores are annotated with text, different color bands indicate the ordering of items wrt approximate scores, and anchor items are shown in blue.With ADACUR, the first batch containing anchor items in Figure11a-1 and 11b-1 is chosen uniformly at random and in subsequent rounds, items with highest approximate scores are chosen.Note that the approximation error for top-scoring items improves significantly when the 50 anchor items are chosen adaptively (see Figure11a-5) with the improvement being much more significant than merely increasing the number of anchor items sampled uniformly at random from 50 in Figure11a-6 to 200 in Figure11b-6.

Figure 13 :
Figure 13: Top-k-Recall for ADACUR (using five rounds) and baselines for domain=YuGiOh.Each subfigure corresponds to a different value of the number of train/anchor queries (|Q train |).

Figure 14 :
Figure 14: Top-k-Recall for ADACUR (using five rounds) and baselines for domain=StarTrek.Each subfigure corresponds to a different value of the number of train/anchor queries (|Q train |).

Figure 15 :
Figure 15: Top-k-Recall for ADACUR (using five rounds) and baselines for domain=Military.Each subfigure corresponds to a different value of the number of train/anchor queries (|Q train |).Note that DE BASE has high Top-1-Recall values as domain=Military is included in the set of train domains in ZESHEL which are used to train both DE BASE and the cross-encoder model.
-10 rounds while incurring negligible overhead in inference latency.As shown in Figure 4, cross-encoder score computation is the main bottleneck in test-time inference, taking ∼7ms per score 2 .Increasing N R to large values such as 100 can incur up to 25% overhead with step (a) contributing significantly to this overhead.Although the time taken by matrix multiplication in step (b)

Table 1 :
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.2020.Dense passage retrieval for opendomain question answering.In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin.2017.Attention is all you need.Advances in Neural Information Processing Systems.Statistics on the number of items (I) and the number of queries in train and test splits for each domain.The train-query (Q train ) split refers to queries used for distilling dual-encoder models or for indexing items using ADACUR and ANNCUR.For ZESHEL domains, we create train-test splits by splitting the queries in each domain uniformly at random and test with three different splits by putting 100, 500, or 2000 queries in train split.For BEIR domains, we use pseudo-queries released as part of the benchmark as train queries (Q train ) and run k-NN evaluation on test queries from the official test split (as per BEIR benchmark) of these domains.For HotpotQA, we use the first 1K queries out of a total of 7K test queries and we use all 1K test queries for SciDocs.entity and negative entities mined using the dualencoder.We refer readers toYadav et al. (2022)for further details on cross-encoder training.We perform k-NN experiments on domains YuGiOh, StarTrek, and Military from ZESHEL of which only Military was part of the training data used to train the cross-encoder model and YuGiOh and StarTrek are part of the original ZESHEL test domains and the cross-encoder model was not trained on these domains.
Lajanugen Logeswaran, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, Jacob Devlin, and Honglak Lee.2019.Zero-shot entity linking by reading entity descriptions.In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Ilya Loshchilov and Frank Hutter.2019.Decoupled weight decay regularization.In International Conference on Learning Representations, ICLR.Xinyin Ma, Yong Jiang, Nguyen Bach, Tao Wang, Zhongqiang Huang, Fei Huang, and Weiming Lu. 2021.MuVER: Improving first-stage entity retrieval with multi-view entity representations.In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.Michael W Mahoney and Petros Drineas.2009.Cur matrix decompositions for improved data analysis.Proceedings of the National Academy of Sciences, 106:697-702.Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang.2021.RocketQA: An optimized training approach to dense passage retrieval for opendomain question answering.In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Stephen Robertson, Hugo Zaragoza, et al. 2009.The probabilistic relevance framework: Bm25 and beyond.Foundations and Trends in Information Retrieval, 3(4).Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych.2021.BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models.In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).