Hybrid Inverted Index Is a Robust Accelerator for Dense Retrieval

Inverted file structure is a common technique for accelerating dense retrieval. It clusters documents based on their embeddings; during searching, it probes nearby clusters w.r.t. an input query and only evaluates documents within them by subsequent codecs, thus avoiding the expensive cost of exhaustive traversal. However, the clustering is always lossy, which results in the miss of relevant documents in the probed clusters and hence degrades retrieval quality. In contrast, lexical matching, such as overlaps of salient terms, tends to be strong feature for identifying relevant documents. In this work, we present the Hybrid Inverted Index (HI$^2$), where the embedding clusters and salient terms work collaboratively to accelerate dense retrieval. To make best of both effectiveness and efficiency, we devise a cluster selector and a term selector, to construct compact inverted lists and efficiently searching through them. Moreover, we leverage simple unsupervised algorithms as well as end-to-end knowledge distillation to learn these two modules, with the latter further boosting the effectiveness. Based on comprehensive experiments on popular retrieval benchmarks, we verify that clusters and terms indeed complement each other, enabling HI$^2$ to achieve lossless retrieval quality with competitive efficiency across various index settings. Our code and checkpoint are publicly available at https://github.com/namespace-Pt/Adon/tree/HI2.


Introduction
Recently, dense retrieval has become the de-facto paradigm for high-quality first-stage text retrieval, serving as a fundamental component in various information access applications such as search engines (Zou et al., 2021), recommender systems (Zhao et al., 2022), and question answering systems (Karpukhin et al., 2020).Specifically, dense retrievers encode queries and documents into † Corresponding Author.their latent embeddings in the semantic space using bi-encoders, and retrieve relevant documents based on embedding similarity.In practice, they rely on Approximate Nearest Neighbor (ANN) indexes to avoid expensive traversal of all document embeddings for each input query, a.k.a. the brute force search (Johnson et al., 2019).
There are numerous ANN options, e.g. the hashing based ones (Datar et al., 2004;Wang et al., 2018), the tree based ones (Bentley, 1975;Wang et al., 2014), the graph based ones (Wang et al., 2012;Malkov and Yashunin, 2018), and the vector quantization (VQ) based ones (Jégou et al., 2011a,b).Among all these alternatives, the VQ based indexes, exemplified by IVF-PQ, are particularly praised for their high running efficiency in terms of both query latency and space consumption, wherein the inverted file structure (IVF) is an indispensable component (Jégou et al., 2011a).
IVF partitions all document embeddings into disjoint clusters by KMeans.During searching, it finds nearby clusters to an input query and evaluates documents within these clusters by subsequent codecs (e.g.PQ).By increasing the number of clusters to scan, one may expect higher retrieval quality since the relevant document is more likely to be included, yet with higher query latency since there are more documents to evaluate (Jégou et al., 2011a).On top of the basic idea, recent studies improve the accuracy of IVF by grouping the cluster embeddings and skipping the least promising groups (Baranchuk et al., 2018), creating duplicated records for boundary embeddings (Chen et al., 2021), and end-to-end learning the cluster assignments by knowledge distillation (Xiao et al., 2022a).Despite their improvements, IVF still exhibits limited retrieval quality, especially when high efficiency is needed.This is because the clustering is too lossy to include relevant documents in a few close clusters to the query.What's worse, it is not cost-effective to probe more clusters, which sacrifices much more efficiency for minor effectiveness improvements.To better illustrate the above points, we take a concrete example.
Example 1 In Figure 1, we showcase the recalllatency trade-off derived from changing the number of clusters to visit in the basic IVF and the distilled IVF (the best IVF method so far (Xiao et al., 2022a)).We use the "Flat" codec that reranks the candidate documents in brute force.As such, any retrieval loss against brute force search (denoted as "Flat") is due to the failure of IVF.
Two critical facts can be observed.First, despite improvements from end-to-end distillation, both IVF methods suffer from much poorer retrieval quality at low query latency.With 20ms latency, IVF-Flat and DistillIVF-Flat achieve recall of 0.758 and 0.862, both of which lag far behind 0.927 from brute force search.Second, probing more clusters marginally improves recall but significantly increases latency.To promote recall from 0.899 to 0.909, DistillIVF-Flat doubles the query latency (from 90ms to around 200ms).Consequently, there is plenty of room to optimize IVF to achieve lossless retrieval quality with high efficiency.
In contrast to cluster proximity, extensive research has demonstrated that lexical matching, e.g.overlaps of salient terms between queries and documents, tend to be strong features for identifying relevant documents (Robertson and Zaragoza, 2009;Lin and Ma, 2021;Formal et al., 2021).Moreover, complementary effect has been observed from combining lexical and semantic matching in hybrid retrieval systems (Kuzi et al., 2020;Gao et al., 2020;Shen et al., 2022;Zhang et al., 2023).
In this work, we explore the potential of unifying embedding clusters and salient terms in a Hybrid Inverted Index (HI 2 ) for the acceleration of dense retrieval.Specifically, each document reference is indexed in inverted lists of two types of entries: embedding clusters and salient terms.When searching, the input query is dispatched to both types of inverted lists.Documents within them are merged and evaluated by subsequent codecs.
For effectiveness, HI 2 needs to include relevant documents in the dispatched inverted lists; For efficiency, HI 2 requires these inverted lists to be small enough to avoid significant overhead during posthoc evaluation.Both of them call for constructing compact inverted lists and efficiently searching through them.To this end, we devise a cluster selector and a term selector, which accurately and efficiently pick out only a few clusters and terms for indexing and searching, respectively.
As for the implementation of the cluster and term selector, we show simple unsupervised algorithms, i.e.KMeans and BM25, work surprisingly well, whereby HI 2 already substantially outperforms previous IVF methods with competitive efficiency.Moreover, we propose to leverage neural networks for realization and end-to-end learning by a knowledge distillation objective.This approach further boosts the retrieval quality, enabling HI 2 to remarkably and consistently surpass other ANN indexes on popular retrieval benchmarks, i.e.MS-MARCO (Nguyen et al., 2016) and Natual Questions (Kwiatkowski et al., 2019).
Our contributions are summarized as follows: • We propose the Hybrid Inverted Index, which combines embedding clusters and salient terms for accelerating dense retrieval.
• We devise tailored techniques, i.e. the cluster selector, the term selector, and the joint optimization, to guarantee the effectiveness and efficiency of HI 2 .
• We evaluate HI 2 with extensive experiments and verify its robust advantage across implementation variations, indexing/searching configurations, and embedding models.

Related Works
. • Dense Retrieval.In the last four years, the rapid development of pre-trained language models, e.g.BERT (Devlin et al., 2019), has significantly pushed forward the progress of dense retrieval, making it increasingly popular for highquality first-stage retrieval (Zhao et al., 2022;Zhu et al., 2023).Dense retrievers encode queries and documents into dense vectors (i.e.embeddings) in the same latent space, where the semantic relevance is measured by embedding similarity.Recent studies further enhance their retrieval quality by retrieval-oriented pre-training (Wang et al., 2022;Xiao et al., 2022b;Gao and Callan, 2022), delicate negative sampling (Xiong et al., 2021;Qu et al., 2021;Zhan et al., 2021b), and knowledge distillation from more powerful rankers (Zhang et al., 2022;Lu et al., 2022;Qu et al., 2021).
• ANNs Indexes.In practice, relevant documents usually need to be retrieved from a massive collection.Consequently, dense retrieval must rely on Approximate Nearest Neighbor (ANN) indexes to avoid the expensive brute force search.The ANNs indexes can be realized via different strategies: 1) the hashing based ones (Datar et al., 2004;Weiss et al., 2008;Wang et al., 2018); 2) the tree based ones (Bentley, 1975;Wang et al., 2014;Muja and Lowe, 2014); 3) the graph based ones (Dong et al., 2011;Wang et al., 2012;Malkov and Yashunin, 2018); 4) the vector quantization (VQ) based ones (Ge et al., 2014;Jégou et al., 2011a,b;Baranchuk et al., 2018).Among these options, the VQ based indexes are particularly preferred for massive-scale retrieval owing to their high efficiency in terms of both query latency and space consumption (Johnson et al., 2019).
• VQ Index Optimization.Despite the competitive efficiency, VQ-based indexes are prone to limited retrieval quality when low latency is desired.In recent years, continuous efforts have been dedicated to alleviating this problem, which can be categorized into two threads.One thread is to design advanced heuristics for clustering and evaluation.For example, (Jégou et al., 2011b) and (Baranchuk et al., 2018) add another refinement stage over the quantized embeddings and skip less promising clusters according to tailored heuristics.(Chen et al., 2021) create duplicated reference for boundary embeddings to improve recall with high efficiency.The other research thread optimizes the VQ index towards retrieval quality with cross-entropy loss instead of minimizing the reconstruction loss.For example, (Zhan et al., 2021a) and (Xiao et al., 2021) jointly learns the query encoder and the product quantizer by contrastive learning.(Xiao et al., 2022a) further improves the accuracy by leveraging knowledge distillation for joint optimization.However, all these methods stick to conventional IVF to organize the search space, which is subop-timal as shown in Example 1.In this work, our proposed Hybrid Inverted Index support efficient identification of relevant documents through both semantic and lexical matching.Note that our work is orthogonal to those about efficient inverted index access (Broder et al., 2003;Mallia et al., 2022) and hence can be combined for further acceleration.
• Hybrid Retrieval.Recently, there have been emergent recipes for the union of semantic (dense) and lexical (sparse) features.Some of them are direct ensembles of dense and sparse retrievers (Ma et al., 2021;Kuzi et al., 2020); Others use enhanced optimization objectives, e.g.adversarial hard negatives and distillation, to jointly learn from semantic/lexical features (Gao et al., 2020;Shen et al., 2022;Zhang et al., 2023).However, they all rely on separate sparse and dense indexes to work, and interpolate scores from the two indexes.Different from them, HI 2 combines semantic and lexical features at index level, and estimates scores universally by specific codecs.Meanwhile, HI 2 may benefit from enhanced optimization methods in these methods, which we leave for future work.

Dense Retrieval
Given a collection of documents D = {D i } |D| i=1 , dense retrieval aims to retrieve the top R relevant documents from D in response to an input query Q.Specifically, each document D ∈ D and query Q is encoded into its embedding e D , e Q ∈ R h , by a document encoder and a query encoder, respectively.Next, the relevance is measured by the inner product between them, whereby the top R ranked documents are returned.
In reality, it is impractical to evaluate every document (computing ⟨e Q , e D ⟩) for each input query (i.e. the brute force search), which results in exceedingly high latency and resource consumption.Instead, ANN indexes are used to avoid exhaustively scanning all documents and accelerate the relevance measurement by approximation.

Inverted File Structure and Product Quantization
Among all alternative ANN indexes, the Vector Quantization (VQ) based ones are particularly popular for massive-scale retrieval.They consist of two basic modules: the inverted file structure (IVF) and the product quantization (PQ).
To avoid exhaustive search, IVF partitions all documents into disjoint clusters C = {C i } L i=1 by KMeans, where each cluster is associated with an embedding e C i ∈ R h .For the query Q, documents within the closest K C clusters are evaluated by the subsequent codec (PQ by default): To accelerate relevance estimation, PQ compresses the document embedding into discrete integer codes according to a codebook v ∈ R m×k×h/m .
It splits e D into m fragments {e j D } m j=1 , then quantizes each fragment to the closest codeword in v: Therefore, only the global codebook v and the codeword assignment θ * need to be stored, which is much smaller than the full-precision embedding.
Finally, the relevance is evaluated by: where e j Q is the query embedding fragment.Since the inner product between e j Q and any codeword v j, * can be cached once computed, the relevance estimation approximated by PQ is much faster.
By increasing the number of clusters to scan (K C ), higher retrieval quality can be achieved because the relevant document is more likely to be included in A(Q).Yet, the latency is increased at the same time as more documents need to be evaluated.Conventional IVF falls short in including the relevant document given a small K C , meanwhile, it needs to sacrifice a lot of efficiency for minor retrieval quality improvement.In this work, we propose an alternative to alleviate these problems.

Hybrid Inverted Index
The framework of the Hybrid Inverted Index (HI2 ) is shown in Figure 2. HI 2 organizes the search space with two types of inverted lists: embedding clusters (C = {C i } L i=1 ) and salient terms (T = {T v | v ∈ V} where V is the term vocabulary).Each document reference is stored in the inverted lists of 1 cluster and K T 1 terms.When searching, the input query Q is dispatched to the inverted lists of K C clusters and K T 2 terms.Documents within them are merged and evaluated by PQ.Formally, (5) To determine which clusters/terms to use for indexing the document and dispatching the query, HI 2 employs two modules: a cluster selector and a term selector.They can be implemented with simple unsupervised algorithms (resulting in HI 2 unsup ) and neural networks (resulting in HI 2 sup ).1 This flexible implementation scheme injects high practicability into HI 2 .In the following, we elaborate on the two modules ( §4.1 and §4.2) and the supervised optimization for their neural network implementation ( §4.3).

Cluster Selector
This module selects 1 cluster for indexing the document and K C clusters for dispatching the query to.Specifically, it associates each cluster C i with an embedding e C i ∈ R h .Then scores each cluster with the inner product between the document embedding e D or the query embedding e Q : ⟨e * , e C i ⟩.The document is indexed to the cluster with the highest score.When searching, the query is dispatched to the top K C clusters: (6) For HI 2 unsup , the cluster embeddings {e C i } L i=1 are produced by KMeans over all document embeddings.For HI 2 sup , they are initialized with KMeans and optimized on-the-fly by the objective in §4.3.

Term Selector
This module selects K T 1 terms for indexing the document and K T 2 terms for dispatching the query.There are two concerns for designing the term selector: 1) the selected terms must be representative w.r.t. the input, hence the lexical matching between the query and the document can be effectively captured; 2) the term selection for the query must be efficient enough to avoid excess online overhead.
Therefore, for the document D, the term selector first tokenizes it to {d i } |D| i=1 where d i ∈ V, then it estimates the score of each unique term v ∈ V in D with BM25 (HI 2 unsup ) or BERT (HI 2 sup ).Formally, ) α, β are hyper parameters, avgdl is the average document length, f (•) is two-layer MLP of R h → R 1 with ReLU activation, and BERT denotes encoding by BERT model (Devlin et al., 2019).As such, the top K T 1 scored terms are used for indexing the document.Besides, the average score of each term across all documents (s v ) is stored.
The query Q is tokenized to {q i } |Q| i=1 likewise, while it is not processed with any complex computations to save online cost.For short queries, all its constituting terms are selected; for long queries, the terms with top K T 2 average score are selected:

Joint Optimization
HI 2 sup involves learning cluster embeddings in the cluster selector, the MLP, and BERT in the term selector.We propose a knowledge distillation objective for jointly training these parameters towards retrieval quality.Concretely, we sample a subset of documents for the query (D ⊆ D), then employ a powerful teacher Θ to produce accurate estimations of their relevance.Finally, we enforce the cluster selector and the term selector to produce similar estimations by KL divergence: Following (Xiao et al., 2022a), we simply choose off-the-shelf embeddings as teachers.Denote softmax operator as sm, the teacher estimations are: The cluster selector estimates relevance by query-cluster embedding similarity: where ϕ(D) is the cluster index of the document.
The term selector estimates relevance by termscore vector similarity: where s Q and s D are the score vector over the vocabulary derived from Eq 7. Note that here both queries and documents are processed the same way.
Additionally, since the document cluster assignment ϕ(D) is fixed, we add a commitment loss to the final loss to keep the document embedding close to its associated cluster, which is a common practice for learning quantization (van den Oord et al., 2017):

Experiments
In this section, we first introduce our experimental settings, then carry out extensive experiments to investigate the following research questions (RQ): RQ1: How are the effectiveness and efficiency of HI 2 compared with other baselines methods?RQ2: Do clusters and terms complement each other for identifying relevant documents?RQ3: How is the robustness of HI 2 across different embedding models?

Experimental Settings
• Datasets.We use two popular benchmark datasets.1) MS MARCO (Nguyen et al., 2016).We use the passage track, including 502,939 training queries and 6,980 evaluation queries (dev set); the corpus size is 8,841,823.2) Natural Questions (Kwiatkowski et al., 2019).We follow the split of DPR (Karpukhin et al., 2020), resulting in 79,168 training queries and 3,610 testing queries.The corpus size is 21,015,324.
For evaluating retrieval efficiency, we compute the average query latency (QL) and the overall index size (IS).Our evaluations are based on the same batch size, thread number, and toolkit (Faiss (Johnson et al., 2019) for ANNs and Pyserini (Lin et al., 2021) for sparse models).Note that the latency of HI 2 is on par with that of IVF-OPQ given the same number of candidates to evaluate, because the term selector dispatches the query with simple heuristics that introduces very little overhead.
• Implementation Details.For all methods involving clustering, we set the number of clusters L to 10000 and the number of probing clusters when searching to 100 (except HI 2 ).For all methods involving PQ, we set the number of fragments m to 96, the number of sub-clusters k to 256.For HI 2 , we use the BERT's vocabulary (Devlin et al., 2019) as the term vocabulary V, resulting in 30522 unique terms in total.K T 2 is always set to 32 for both HI 2 unsup and HI 2 sup .For HI 2 unsup , we use KMeans over all document embeddings to produce cluster embeddings {e C i } L i=1 , BM25 to produce term scores s v with α = 0.82, β = 0.68, and OPQ (Ge et al., 2014) as the evaluation codec, all of which are unsupervised algorithms.K C is set to 25, K T 1 is set to 15.For HI 2 sup , we initialize cluster embeddings with KMeans and optimize them afterward.Note the cluster assignment ϕ(D) is fixed once initialized.We use bert-base-uncased for the term selector.The passage is tokenized to 128 tokens before encoding.We employ the distilled OPQ (Xiao et al., 2022a) as the evaluation codec.K C is set to 30, K T 1 is set to 3.More details are in the appendix.For reproducibility, we release our source code and model checkpoints at https://anonymous.4open.science/r/HI2/.

Main Analysis (RQ1)
We report the overall evaluation results in Table 1.
On the one hand, our hybrid inverted index demonstrate superlative effectiveness over baseline ANN indexes.Specifically, HI 2 unsup , which solely relies on unsupervised algorithms, improves the Recall@100 of IVF-OPQ (the basic unsupervised VQ index) by 14%, and improves that of Distill-VQ (the strongest supervised VQ index in literature) by 8%.It even triumphs the powerful HNSW index by 3 absolute points in Recall@100, which is a more valuable metric for first-stage retrieval than MRR.Moreover, the neural network imple- mentation and the end-to-end knowledge distillation further unleash its potential, as HI 2 sup further amplify the margins over ANN baselines.Remarkably, HI 2 sup achieves on par retrieval quality with its brute-force-search teacher (RetroMAE on MS-MARCO and AR2 on NQ), and surpasses a lot of well-established sparse and dense retrievers.
On the other hand, the efficiency of our hybrid inverted index is also satisfactory.Its query latency is the second lowest on both datasets, accelerating the brute force search (Flat) by hundreds of times and only slightly falling behind that of HNSW.Notably, the latency is even lower than VQ-based indexes, because HI 2 needs to evaluate fewer candidate documents.Besides, HI 2 possesses a moderate index size, which is bigger than VQ baselines since more document references need to be stored, while much smaller than Flat or HNSW since it does not need to store full-precision embeddings.
As such, we have showcased the outstanding effectiveness and efficiency of HI 2 under one specific setting.Next, we are interested in the effectivenessefficiency trade-off of HI 2 and ANN baselines (we exclude IVF-PQ and IVF-JPQ, the former is too weak and the latter is similar to IVF-OPQ).Specifically, for VQ indexes, we change the number of clusters to visit; for HNSW, we change the number of neighbors to visit; for HI 2 , we change the number of terms to index (K T 1 ) and the number of clusters to dispatch (K C ).Since the index size is static, we measure recall@100 as effectiveness and average query latency as efficiency.The resulted trade-off curves are reported in Figure 3.
From the figure, HI 2 unsup performs on par with the powerful HNSW across various index settings, as their recall are almost identical given the same latency.Both of them significantly outperform VQ baselines.Besides, HI 2 sup brings substantial improvement over HI 2 unsup and HNSW, achieving higher recall with lower latency.Meanwhile, it efficiently approaches the brute-force-search effectiveness.In contrast, VQ baselines need to largely increase the latency to marginally improve the recall, yet lagging far behind brute force search.
Based on the above analysis, we answer RQ1: HI 2 achieves lossless retrieval quality against brute force search, with low query latency and small index size, significantly and consistently outperforming baseline ANN indexes and retaining the advantages across indexing/searching configurations.

Ablation Analysis (RQ2)
To answer RQ2, we study the individual contribution from embedding clusters and salient terms.Specifically, we disable the inverted lists corresponding to terms and clusters, respectively, denoted as w.o.Term and w.o.Clus.Other configurations are kept the same.We plot their recall-latency trade-off curves in Figure 4.
Two critical facts can be observed.First, salient terms tend to be better features for organizing the search space than embedding clusters, as the w.o.Clus variants significantly and consistently outperform w.o.Term ones.Thus, our claim that embedding clusters alone falls short in effective identification of relevant documents is well justified.Second,  salient terms and embedding clusters indeed complement each other, as HI 2 unsup and HI 2 sup beats their "homogeneous" variants in terms of both effectiveness and efficiency.Therefore, we answer RQ2: Embedding clusters and salient terms complement each other for more effective and efficient identification of relevant documents.

Robustness Analysis (RQ3)
In Figure 3, we have shown the robust advantage of HI 2 across different index configurations.For practical usage, it is important to evaluate the robustness of HI 2 given different embedding models.
In Table 2, we report the performance of HI 2 and selected strong baselines with RetroMAE and AR2 as the embedding model.We can notice that HI 2 sup always achieves the best recall among all ANN indexes, which is very close to that of brute force search.Besides, HI 2 unsup performs on par with  the strong HNSW index, which uses full-precision embeddings for evaluation.As for efficiency, we observe HI 2 resides in a sweet spot with the second lowest query latency and relatively small index size, which is substantially smaller than Flat and HNSW but slightly bigger than VQ baselines.Additionally, the performance of ANN baselines is unstable given different embedding models: higher retrieval quality with brute-force searching does not result in higher retrieval quality with ANN acceleration.For example, the recall of AR2 Flat is inferior to that of RetroMAE Flat on MS MARCO.However, this trend reverses when ANN baselines are applied, i.e.AR2 IVF-OPQ is better than Retro-MAE IVF-OPQ.By comparison, the performance of HI 2 is stable: higher brute-force-search effectiveness corresponds to higher effectiveness of HI 2 regardless of the embedding model.
In summary, we answer RQ3: HI 2 enjoys high robustness and stability across different embedding models, consistently surpassing strong ANN baselines with competitive efficiency and aligning well with the brute force search.

Conclusion
In this work, we propose the hybrid inverted index, which reformulates conventional IVF by unifying both embedding clusters and salient terms to accelerate dense retrieval.We devise tailored techniques for cluster selection, term selection, and joint optimization.With comprehensive experiments, we verify the effectiveness and efficiency of HI 2 , which consistently outperforms strong ANN baselines across implementation variations, indexing/searching configurations, and embedding models.Moreover, we demonstrate that embedding clusters and salient terms are complementary to each other for identifying relevant documents, which may inspire further research towards the combination of semantic and lexical features.queries into dense embeddings, then estimate relevance with embedding similarity.For each input query, all document embeddings are evaluated.DPR (Karpukhin et al., 2020), the most basic dense retriever.ANCE (Xiong et al., 2021), enhancing DPR with hard negatives mined from the previous model snapshot.CoCondenser (Gao and Callan, 2022), retrieval-oriented pretraining the encoder model to compress more information in the embedding.AR2 (Zhang et al., 2022), adversarially train the encoder and a ranker with knowledge distillation.RetroMAE (Xiao et al., 2022b), retrievaloriented pretraining the encoder model with a shallow decoder and the representation bottleneck.

B Implementation Details
For all methods involving clustering, we set the number of clusters L to 10000 and the number of probing clusters when searching to 100 (except HI 2 ).For all methods involving PQ, we set the number of fragments m to 96, the number of subclusters k to 256, which results in 32 times smaller size than the full-precision one.For HI 2 , we use the BERT's vocabulary (Devlin et al., 2019) as the term vocabulary V, resulting in 30522 unique terms in total.K T 2 is always set to 32 for both HI 2 unsup and HI 2 sup .For HI 2 unsup , we use KMeans over all document embeddings to produce cluster embeddings {e C i } L i=1 , BM25 to produce term scores s v with α = 0.82, β = 0.68, and OPQ (Ge et al., 2014) as the evaluation codec, all of which are unsupervised algorithms.K C is set to 25, K T 1 is set to 15.For HI 2 sup , we initialize cluster embeddings with KMeans and optimize them afterwards.Note the cluster assignment ϕ(D) is fixed once initialized.We use bert-base-uncased for the term selector.The passage is tokenized to 128 tokens before encoding.We employ the distilled OPQ (Xiao et al., 2022a) as the evaluation codec.K C is set to 30, K T 1 is set to 3. For training HI 2 sup , we use the annotated ground truth document D + , 7 hard negatives sampled from BM25 top 200 results, and in-batch negatives to form D.
In practice, we find the terms selected by HI 2 sup results in much "denser" inverted lists than HI 2 unsup .In other words, some terms may be frequently selected from multiple passages, translating to their super big inverted lists.This is especially the case for Natual Questions.Therefore, on NQ, we prune the super big inverted lists to a moderate size inspired by static index pruning technique (Nguyen, 2009).Concretely, after indexing all documents, we count the size of each term-side inverted list, then take the one at the γ-th percentile (γ defaults to 0.996) as the threshold, whereby inverted lists bigger than the threshold are identified as "super big".Next, we ascendingly order document references based on their individual score to the specific term of each super big inverted list.We prune the references from the head until the size of the inverted list equals the threshold.

C Codec Analysis
Apart from the default PQ, HI 2 can be combined with other codecs.In Table 3, we compare PQ with the most powerful yet most expensive Flat codec.It can be observed that HI 2 unsup and HI 2 sup both benefit from the more powerful codec.This indicates that HI 2 can return high-quality candidates universally applicable for different codecs.It also reveals that the current PQ codec is still lossy.However, there is no free lunch: the powerful Flat codec comes with higher latency and higher index size, which is unfavorable for the index efficiency.In summary, we again verify the practicality of HI 2 , one may flexibly balance between higher effectiveness and higher efficiency.

Figure 1 :
Figure 1: Recall-latency trade-off example for existing IVF methods on MSMARCO Passage.Better indexes should locate at the lower right corner.

Figure 2 :
Figure2: The framework of the Hybrid Inverted Index (HI 2 ).Each document reference is indexed in inverted lists of two types of entries: embedding clusters and salient terms.When searching, the input query is dispatched to both types of inverted lists.Documents within them are merged and evaluated by the subsequent codec (PQ).

Figure 4 :
Figure 4: Effectiveness-efficiency trade-off of HI 2 variants on MS MARCO (A) and Natual Questions (B).

Table 1 :
Overall evaluation results.Statistically significant results within ANNs group compared with HI 2 sup (paired t-test with p < 0.05) are decorated with *.Best results are bold.Second best results are underlined.QL means the average query latency.IS means the index size.

Table 2 :
Evaluation of HI 2 and strong ANN baselines with different embedding models (Emb.).

Table 3 :
Evaluation of HI 2 with different codecs.