Accelerating Learned Sparse Indexes Via Term Impact Decomposition

Novel inverted index-based learned sparse ranking models provide more effective, but less efficient, retrieval performance compared to traditional ranking models like BM25. In this paper, we introduce a technique we call postings clipping to improve the query efficiency of learned representations. Our technique amplifies the benefit of dynamic pruning query processing techniques by accounting for changes in term importance distributions of learned ranking models. The new clipping mechanism accelerates top-k retrieval by up to 9 . 6 × without any loss in effectiveness.


Introduction
Sparse term importance representations such as DeepImpact (Mallia et al., 2021) and uniCOIL (Gao et al., 2021;Lin and Ma, 2021) have enabled the use of effective transformer-based text representations that can match the effectiveness of recent dense text representations (Karpukhin et al., 2020;Qu et al., 2021) while still being supported by inverted indexes and their query operations.This is of importance as inverted indexes have been optimized to provide search functionality in distributed settings at web-scale through 40 years of research, providing a variety of time, space and retrieval quality tradeoffs; while also supporting efficient updates, advanced querying modes such as phrase matching or filtering, and good scalability, all of which are crucial in real-world settings (Risvik et al., 2013;Tonellotto et al., 2018).
One of the key techniques that enables efficient top-k query processing in inverted indexes is storing additional metadata about index term importance scores (also referred to as impacts), seeking to facilitate the bypassing of the majority of the matching documents, and thus allowing faster retrieval than would be possible via exhaustive disjunctive processing.For example, dynamic pruning algorithms such as MaxScore (Turtle and  ).Classical schemes such as BM25 and DocT5Query exhibit small maximum impacts for long lists, whereas DeepImpact and uniCOIL assign high importance to terms regardless of their document frequency.Note the irregular scale on the horizontal axis.
Flood, 1995) and WAND (Broder et al., 2003) store the index-wide maximum impact of each term; at query-time, these impacts can be used to rapidly estimate document scores, allowing documents that have no prospect of entering the current min-heap of k results to be bypassed.
Traditional similarity models such as BM25 guarantee that frequent terms have low importance scores, a symbiotic relationship that allows fast query processing.On the other hand, recent transformer-based learned term importance techniques such as DeepImpact are not constrained by term occurrence frequency when assigning importance scores to terms in documents.For example, consider the term "does" in an English corpus.Traditional models such as BM25 would assign a low impact for this term in all documents, since it occurs so frequently; a direct effect of inverse document frequency (IDF).On the other hand, learned Impact MSMARCO-v1 Passage 0.91 Does are the females in the deer family of mammals, individually called a doe (pronounced doe as in toe).The plural of doe can also be doe.Does is also the word meaning the present tense of the verb do.Pronounced duz as opposed to the pronunciation for the female deer.
0.71 You spelled it right in the question: does.The word does (performs action) is a verb, and the plural of the noun doe (female deer).Many people mix up two similar words: dose and does.Dose is a noun and is the amount of medication prescribed.Does is a verb, a form of to do.
0.59 Job Seekers The District of Columbia Department of Employment Services (DOES) was created to develop Jobs for People and People for Jobs.DOES provides job seekers with a number of employment opportunities through its American Job Centers.
0.50 Take your medication exactly as prescribed.Taking higher does of Benzedrine may cause a change in a person's sex drive, allergic reactions, chills, depression, irritability or mood swings, and problems with the digestive system.
0.02 when my dog passes wind its the worst smell ever.does anyone know how to stop it smelling so bad.Add your answer.
Table 1: Sample MSMARCO-v1 passages and the normalized impact assigned by DeepImpact to the word "does", which occurs in 61% of all passages.High impact scores can be assigned to common terms legitimately (based on the context of the term), or may be caused by misspellings, acronyms, or homonyms.models are trained to exploit the contextual information of a passage to assign term impacts, resulting in high importance scores for even the most common terms.Table 1 demonstrates this behavior for DeepImpact, where passages extracted from MSMARCO-v1 contain normalized impact scores of widely varying magnitudes.While DeepImpact assigns impacts from 0.91 (very important) all the way down to 0.02 (not important) for the term "does", the equivalent maximum impact observed over our two BM25-based indexes (see Section 4) are 0.21 (BM25) and 0.004 (DocT5Query).Figure 1 further highlights the pervasive nature of this issue.Learned representations such as Deep-Impact or uniCOIL assign high term importance to even very frequent terms (such as "does"), whereas BM25 always assigns low importance to such terms.This divergent behavior substantially reduces the ability of MaxScore and WAND to bypass low-impact documents during querying, with both techniques relying on maximum list-wise impact scores to prune the search space.
Contribution We adapt the MaxScore and WAND dynamic pruning mechanisms to enable efficient query processing for learned term importance schemes such as DeepImpact via a simple technique we call term impact decomposition.We describe partitioning schemes that separate the postings for each term into two groups -those that are highimpact, and are likely to result in documents being scored; and those that are low-impact, and more likely to be associated with bypassed documents; and present a new form of impact decomposition that we call postings clipping.When integrated into the retrieval engine, impact decomposition allows almost 10× faster top-k term-based querying, with negligible increases to index storage costs, and no effect on result quality.

Background
Term-Based Similarity Many retrieval similarity formulations can be expressed as a sum over per-document query-term impacts, computed for document d and query Q as where C(d) is a static score component associated with document d; and w t,d is the importance, or impact, of term t in document d (see, for example, Zobel and Moffat, 2006).The values of w t,d might be pre-computed at indexing time and stored in the inverted index in quantized form; or might be computed via a function F () from raw index statistics such as term frequency and document length.

Learned Sparse Models
The recent development of pre-trained contextualized language models (LMs) has resulted in impressive benefits in search effectiveness, albeit with higher retrieval cost than traditional lexical models (MacAvaney et al., 2019;Pradeep et al., 2021;Khattab and Zaharia, 2020).This has motivated recent work on making transformer-based ranking more efficient (Cohen et al., 2022;Karpukhin et al., 2020;Zhan et al., 2021).Different solutions have been proposed to address this performance bottleneck, including the application of approximate nearest neighbor search on dense representations (see, for example, Izacard et al., 2020;Zhan et al., 2022;Yamada et al., 2021).Another approach is to apply LMs to improve the effectiveness of inverted index-based "sparse" representations.
Document expansion is one such innovation.It uses LMs to predict expansion terms to add to each document in an effort to address the vocabulary mismatch problem (Zhao, 2012), while still applying traditional scoring regimes like BM25.Currently, DocT5Query (Nogueira and Lin, 2019) and TILDE (Zhuang and Zuccon, 2021b,a) are the most effective expansion methods.
LMs can also be used to learn term importance directly.Early approaches such as DeepCT (Dai and Callan, 2019) learned an updated term frequency value which could be plugged into the existing ranking model.Other more effective approaches such as DeepImpact (Mallia et al., 2021), uniCOIL (Lin and Ma, 2021;Ma et al., 2022), TILDE (Zhuang and Zuccon, 2021a),1 and SPLADE (Formal et al., 2021b,a) have been devised which predict the impact of each term within each document (that is, they learn the value of w t,d in Eqn. 1).These models are all tuned to optimize the downstream retrieval task, but differ in their vocabulary structures, document expansion techniques, and query expansion strategies.For example, DeepImpact first expands the documents in the collection via DocT5Query, and then directly estimates a single impact for each token in each document.The model is trained by directly optimizing the sum of the query term impacts to maximize the score difference between relevant and non-relevant documents for a query.Similar in spirit, uniCOIL also performs weighting on the query terms, such that document ranking becomes a weighted sum over term impacts.
We focus on DeepImpact and uniCOIL as effective learned representations, but our methods are also applicable to other learned sparse techniques.
Indexing An inverted index stores one postings list I t for each distinct term t in the given text collection, with each postings list containing a sequence of postings of the form I t,i = ⟨d t,i , w t,i ⟩, where d t,i is the document number of the i th document containing t, and w t,i can be taken to be the corresponding impact score (see Eqn. 1).These lists are normally stored in increasing document Once 3 documents processed, is smaller than , allowing all subsequent orange documents to be bypassed Figure 2: Dynamic pruning on a two-term query (term A, top left, and term B, top right) for top k = 2 retrieval.At the start of processing, the heap threshold θ is −∞.After processing the three documents shown in green (bottom) we have θ > U A , and documents that contain only term A can be bypassed from now on.order, and compressed using integer compression techniques, see Zobel and Moffat (2006) and Pibiri and Venturini (2021) for examples and further explanation.
Querying To retrieve the top-k highest scoring documents for a bag-of-terms query Q consisting of q = |Q| query terms, a document-at-a-time processing regime is often used (Tonellotto et al., 2018).All q postings lists are open concurrently, each with a local cursor to step through the postings.Each document d that is encountered is fully scored at that time, and a min-heap maintained of the k highest-scoring documents encountered so far.Once all q postings lists are exhausted, the k documents in the heap can either be directly presented to the user or passed to another processing phase for a more sophisticated similarity computation that re-ranks that initial answer set.
MaxScore Algorithm 1 provides details of document-at-a-time querying processing, and also introduces the MaxScore dynamic pruning mechanism of Turtle and Flood (1995), structured in a manner that allows the development we propose in Section 3. In this description the U t per-term upper bounds are a static attribute of the collection, established at indexing time, and I t,c[t] is the "current" posting for term t, indicated by the cursor c[t], with each index list I t ordered by increasing document number, denoted Figure 2 helps explain the pseudo-code.In the diagram a two-term query is being processed, con-Algorithm 1 Standard MaxScore.Input is a set of q postings lists I t , with I t,i = ⟨d, w⟩ the docnum and impact score of the i th posting for the t th term; and a vector U t = max i {I t,i .w}, the maximum impact for the t th term.
1: active ← {0 . . .q − 1} // active terms 2: passive ← { } // passive terms 3: sum_pass ← 0 // sum of passive U t 's 4: heap ← { } // heap of "best so far"  sum_pass ← sum_pass + U y sisting of postings for term A (top left) and term B (top right), and seeking the highest-scoring k = 2 documents.The index also records U A and U B , the maximum impact contributions of A and B across the collection.Once the first three documents in the union set of A and B have been scored, the k th largest-known document score -denoted by θ -is greater than U A .After that point no further documents that contain term A alone need be considered; all candidates for scoring must contain B. In terms of Algorithm 1, term A is thus permanently moved from the active set to the passive set (steps 25-30) to record this change of status.
Algorithm 1 includes a number of subtleties.The ordering assumed at step 25 is constant, and computed once upon query commencement, rather at each loop iteration.As well, steps 25 to 30, shown as executing after every document has been scored, can be carried out infrequently without affecting the correctness of the top-k result set.For example, they might trigger only every 100 or 1000 iterations of the main while loop at step 7.
The key invariant in Algorithm 1 is that contributions from passive terms alone cannot yield a document score large enough to make it into the current top k answer set.That means that postings that appear only in passive postings lists can be bypassed, achieved at step 11 by function SeekGEQ(I t , d), which advances the cursor c[t] until a document number ≥ d is found in I t .Processing terminates when all of the postings associated with the active terms have been consumed.At that time, the required top-k documents are all in the heap.
WAND The WAND dynamic pruning mechanism (Broder et al., 2003) makes use of similar logic.But instead of labeling entire terms as being active or passive, it constantly rearranges the list cursors according to their next documents, in effect treating individual postings as being passive or active.That means that it can be more flexible in determining which postings combinations might yield scores greater than θ, and hence is more discerning in terms of which documents need scoring.Those gains must be offset against the additional cost of maintaining the list cursors in sorted order.Petri et al. (2013) give pseudo-code for WAND pruning.
Block-Max WAND Even more precise control over which documents need to be fully scored is achieved if localized upper bounds are used (denoted U t,b ) as well as whole-of-list U t values.In the BlockMax-WAND (BMW) and Variable BlockMax-WAND (VBMW) approaches there are multiple U t,b values stored for each postings list, each of which provides a localized maximum impact bound for a block of contiguous postings (Ding and Suel, 2011;Mallia et al., 2017;Mallia and Porciani, 2019).During querying, global U t values are used to select a candidate document, and the U t,b values are then used to refine the score estimate before checking whether the document should be scored or bypassed.Thus, storing these additional bounds allows more documents to be bypassed, albeit with increased processing required to handle the complex decision logic that arises, and the additional space costs required to store localized bounds.
High-Impact List Segments and Priming Several authors have proposed explicitly or implicitly splitting postings lists into two (or more) parts, a high-impact segment H(t) and a low-impact segment L(t) to facilitate efficient processing; see, for example, Strohman and Croft (2007), Ding and Suel (2011), Daoud et al. (2016), Daoud et al. (2017), Kane and Tompa (2018) and Mackenzie et al. (2022a).
Another technique known as priming (Kane and Tompa, 2018;Petri et al., 2019) improves query performance by estimating lower bounds on the final heap threshold θ: if the value of the k th highest impact (or a value for the k ′ > k th highest impact) for any of the q query terms is known, then the heap threshold θ can be initialized to the largest of those (up to) q values -it is certain that there will be k or more documents in the collection that score more highly than that value, even in the absence of any term overlaps.Moreover, if those k ′ high impact postings are maintained as a separate postings list, then the q high-impact list segments can be resolved against each other before any low-impact postings are considered, and might further lift the value of θ used when the q low-impact postings lists are employed to finalize the query.

Impact Decomposition
This section introduces the notion of postings list splitting, and shows how it can be combined with both MaxScore and with WAND variants.We then introduce a new technique, postings clipping that replicates the high-impact postings, rather than separating them from the low-impact postings.It has the benefit of allowing more precise score estimations, and hence faster pruned querying.
List Splitting Ding and Suel (2011), and later Daoud et al. (2016) and Kane and Tompa (2018), note that each postings list I t can be split into two parts, denoted here as H(t) and L(t), with H(t) containing the postings with the highest impacts for t, and L(t) containing all the remaining ones.Since H(t) and L(t) are disjoint, query processing algorithms can treat them as independent terms.
The top part of Figure 3 illustrates list splitting.The complete set of postings for some term t (left) is reduced from (in the example) 21 postings to 17 posting to form L(t), with the other 4 postings assigned to H(t).Each of L(t) and H(t) then A number of splitting rules can be considered.For example, a fixed fraction of the original list might be taken; or the split could be based on local or global threshold scores.In this work, we take a fixed fraction, set at 1/64 (based on preliminary experimentation) and respecting quantized impact levels, so that |H(t)| is maximized subject to |H(t)| ≤ |I t |/64, and also subject to the smallest impact in H(t) being greater than U L(t) .We also only apply splitting to lists with more than 256 postings, as short lists are always handled quickly.
Where the impact score distribution of the postings is skewed and has a long tail, splitting results in reduced variance inside each part.The maximum term importance U t stored for any list I t is intended to approximate the distribution of the impacts of the postings in that list; and hence storing two upper bounds, one for H(t) and one for L(t), allows a better approximation to the underling distribution.Note also that list splitting is performed at indexing time and results in only a modest increase in index size.At query time, each term is mapped to (one or) two postings lists, with at most twice as many cursors to maintain, but the same total number of postings to be processed.
MaxScore, WAND, and BMW Our first observation is simply that MaxScore should be implemented so that the static ordering over terms assumed at step 25 of Algorithm 1 is by decreasing list length, rather than by the more usual increas-ing U t , respecting the separation of these concepts that was noted above (that is, IDF is not obeyed by learned sparse models).The MaxScore pseudo-code presented earlier already shows this adaptation.
There are then a number of ways of proceeding when list splitting is considered.The simplest option is to ignore any knowledge of the list pairings, and allow a q-term query to be processed in the standard document-at-a-time manner over as many as 2q postings lists (Kane and Tompa, 2018).In terms of MaxScore, any combination of low-and high-impact lists might be in passive, with the remainder in active.However the use of the U t limits to decide if a document that is in an active list should be scored remains valid -no document that might generate a similarity score greater than θ and thus should get scored will get bypassed.On the other hand, when the L(t) list for one of the terms is in passive (and because it is longer, it will enter earlier), only the postings in H(t) can now trigger a document scoring caused by term t, and hence there is a very real capacity for additional documents to be bypassed.Similar considerations arise with WAND and BMW: in all three processing modes the mere act of splitting the lists introduces the possibility of accelerated query processing, without risking any loss in terms of answer set correctness.
As an orthogonal enhancement, priming can be applied whenever any high-impact list contains k or more postings, |H(t)| ≥ k.If that holds, then can be used as a priming value for the heap bound, without risking the integrity of the top-k answers.
Next, if additional bookkeeping operations can be tolerated, it is also possible to compute what we denote as smart bounds.When the low-impact list for some term t first joins passive, the variable sum_pass is correctly increased by U L(t) .But if and when the partner term H(t) also joins passive, increasing sum_pass by U H(t) is needlessly pessimistic, since no document can appear in both L(t) and H(t).Hence, the correct second increment associated with term t is by U H(t) − U L(t) .In the case of MaxScore, the corresponding smart bounds are easily computed, and are required only occasionally -when a postings list is moved from active to passive.However, for WAND and BMW the estimations must be modified much more frequently, and while smart bounds can certainly be computed, their benefit is less clear.One key part of the experimentation in Section 4 is to quantify the relationship between document scoring and bounds manipulation.Ding and Suel (2011) and Kane and Tompa (2018) also noted the idea of smart bounds estimation in their descriptions of list splitting, but they did not consider MaxScore-based processing.
Postings Clipping Our additional proposal -denoted postings clipping -is illustrated in the bottom half of Figure 3. Rather than partitioning the set of postings in I t across L(t) and H(t), every posting remains in L(t), and we "clip" the highimpact postings by slicing them into two parts, and forming a posting pair.The base part remains in L(t) as a posting with an impact equal to U L(t) , the maximum score contribution permitted in L(t); and the second component of the pair becomes a new posting in H(t), to account for the "trimmed" part of the original impact value, and retain the same total.
This arrangement has the singular advantage of no longer requiring any smart bounds management, or equivalent run-time manipulation of score estimates.Smart bounds are needed in the list splitting approach of Kane and Tompa (2018) to adjust for the constraint that no document can appear in both L(t) and H(t), and hence that U L(t) + U H(t) is an over-estimate (by an addend of U L(t) ) of t's true upper bound U t .But with postings clipping, U H(t) is instead set to the maximum residual amount across all of t's postings, and hence we have U t = U L(t) + U H(t) .In turn, that means that when queries are being processed the lists L(t) and H(t) can be treated as if they were derived from completely independent terms, with all interactions between them handled by the underlying processing logic, be that MaxScore, WAND, or BMW.
That is, while there are more total postings to be stored and processed, the change from list splitting with smart bounds to postings clipping substantially simplifies the query-time processing logic.Indeed, with the exception of priming -which can still be applied on the basis that is noted in Eqn. 2 -a MaxScore-based postings clipping implementation remains exactly as is shown by the logic provided in Algorithm 1.The result is that -as we demonstrate in Section 4 -quite dramatic reductions in query processing times for the learned sparse retrieval models can be achieved.
Figure 4 crystallizes the difference between list splitting and postings clipping.In the left pane (list splitting) the U H(t) values rise as U L(t) increases, plotted over the set of MSMARCO-v1 query terms; whereas in the right pane (postings clipping) U H(t) becomes increasingly constrained as U L(t) grows.The difference affects the pruning bounds estimation, and while it can be partially ameliorated by smart bounds adjustments, the postings clipping mechanism is more precise.

Experiments
We now describe experiments that quantify the benefits arising from the postings clipping approach.Our experiments make use of both MSMARCO-v1 (8.8 million passages) and MSMARCO-v2 (138.4 million passages) collections, four representative ranking algorithms, and the PISA query processing system which was recently shown to outperform the commonly used Anserini system for document-at-a-time retrieval over learned sparse indexes (Mackenzie et al., 2021).Full details of the experimental setup are provided in Appendix A.
Index Size Table 2 reports the space consumption of each index/model combination for both the default index, and the index with postings clipping.Since clipping is applied only to postings with more than 256 elements, and even then only adds 1/64 as many new postings, the space overhead compared to the default index is negligible.For instance, the largest overhead of 600 MiB to the ≈ 33 GiB index for the uniCOIL model on MSMARCO-v2 represents an increase of only 1.8%.Table 4: Query processing times, all in average milliseconds per query, for the MSMARCO-v2 collection, four retrieval models, and three dynamic pruning approaches.The fastest time in each column is highlighted in blue, and the best of the three baseline approaches in each column is shown in black.The speedups in the last row are the ratio between the black and blue values in that column.

Query Speed
impact score; then the list splitting mechanism is added, with 1/64 of the postings in each list longer than 256 extracted and placed in the high-impact list H(t); then the application (where possible) of Eqn. 2 to set an initial heap threshold; then the further addition of smart bounds.Finally, the last row in each block shows the combination of postings clipping, again with 1/64 postings taken into H(t), in conjunction with priming (and length-based ordering for MaxScore).Both WAND and BMW apply the smart bounds adjustments during the pivoting step, and have no equivalent to the MaxScore static sorting step.The fastest query time in each of the six sections is highlighted in blue.
As can be seen, for DeepImpact retrieval, the fastest approach in five of the six table sections is achieved by MaxScore pruning with postings clipping.That combination takes less than half the time of standard MaxScore processing.The gains from posting clipping are less for WAND and VBMW, in part because both algorithms exhibit greater sensitivity to doubling the number of query terms.
Table 4 then applies all four retrieval models to the large MSMARCO-v2 collection.The six rows correspond to the first and last rows in each block in Table 3, with the first row in each pair showing "standard" retrieval, applying an inverted index and a dynamic pruning method; and then the second row compares that baseline against what can be achieved by postings clipping, priming using the same U L(t) information that arises from the clipping, and (in the case of MaxScore), the matched static sorting.The best baseline in each column is shown in black, and the best overall time in each column in blue.Compared to a standard computation, postings clipping creates speedups of between two and nearly ten in connection with the two most expensive models, with MaxScore plus postings clipping being the best overall method for both k = 10 shallow retrieval and k = 1000 deep retrieval.
Retrieval Quality All of the enhancements investigated above result in rank-safe effectiveness.That is, the changes to the indexing structures and query processing regimes shown in Tables 3 and 4 do not degrade the quality of results compared to the unmodified algorithms, making the speedups even more attractive to search practitioners.Detailed effectiveness results for the four retrieval models are presented in Appendix B.

Conclusion and Future Work
To keep up with increasingly large volumes of data, search practitioners require sophisticated structures and processing algorithms, so that response times can remain plausible.In this paper, we have demonstrated the speed benefits that arise through the use of a new technique we call postings clipping.We have established new benchmarks for querying speed, with minimal costs overheads, for both shallow k = 10 and deep k = 1000 retrieval.Our techniques can also be embedded as part of a multiphase processing stack, and are applicable to both normal term-based search and also to retrieval via enhanced learned sparse approaches.

Limitations
This paper modifies existing inverted index-based storage and query processing schemes to handle the different impact distributions produced by learned index representations.We have not explored how adjusting the training objective of models such as DeepImpact could produce better impact distributions directly targeting efficient query processing algorithms that exploit list upper bounds.Such approaches, if they were fruitful, would potentially mitigate the need for the techniques proposed in this paper.
Table 1 indicates that some of the latency problems arise from learned representations distinguishing between different semantic meaning of words, correctly assigning high importance to terms based on context.We have not explored incorporating these pre-index construction insights into the proposed splitting and subsequent query processing schemes, and instead have relied solely on numeric impact values.It is possible that making splitting decisions in conjunction with the learning process might lead to even better outcomes.
Resource constraints have meant that we have restricted our investigation to the DeepImpactand uniCOIL-based learned sparse representations.While we believe our techniques will provide similar benefits to other learned sparse retrieval techniques such as TILDE (Zhuang and Zuccon, 2021a) and SPLADE (Formal et al., 2021b,a), we have not explored those approaches as part of this work.
Finally, our investigation explored how split lists can be used to prime the initial heap threshold θ.Recent work has shown that more accurate predictions can further accelerate querying on traditional ranking models (Petri et al., 2019;Mallia et al., 2020).To determine whether these approaches translate to learned sparse models, we applied idealized "oracle" thresholds to our experimental framework (see Appendix B for details).While the results are promising (up to a 2.1× speedup over the best results in Table 5: Query processing times, all in average milliseconds per query, for the MSMARCO-v1 collection, four retrieval models, and three dynamic pruning approaches, with the same structure and interpretation as Table 4.The fastest time in each column is highlighted in blue, and the best of the three baseline approaches in each column is shown in black.The speedups in the last row are the ratio between the black and blue values in that column. • uniCOIL (Lin and Ma, 2021) employs TILDE (Zhuang and Zuccon, 2021b,a) document expansion, and learns per-term weights according to a simplified (1-dimensional) COIL model (Gao et al., 2021).Unlike DeepImpact, uniCOIL also applies term weighting at querytime, transforming bag-of-words queries into weighted queries (and resulting in weighted bag-of-words ranking; see Section 2).
The learned sparse models work with pre-quantized scores, and so we also pre-computed and quantized the BM25 and DocT5Query indexes into integer impact scores in the range [0, 255] using uniform quantization (Anh et al., 2001).All experimentation then involved computing document scores as (weighted) sums of impacts.Note also that DocT5Query, DeepImpact, and uniCOIL were all fine-tuned on the MSMARCO-v1 training data; those same models are then applied in a zero-shot manner to MSMARCO-v2.
Setting Hyperparameters In order to decide the split points for use in our experimentation, we ran a preliminary experiment where we tried splits of 1/p for p ∈ {8, 16, 32, 64, 128, 256} using the DeepImpact ranker and the MSMARCO-v1 dev queries.While all split values resulted in large efficiency improvement, p = 64 was the best choice.We then fixed p = 64 for all remaining collections and experiments, and did not further tune this value.
Reproducibility Our list splitting, clipping, and priming contributions were all implemented inside the C++ PISA framework; this modified version of PISA is available at https://github.com/jmmackenzie/postings-clipping.Scripts for downloading and pre-processing data, computing split points, building indexes, and running the experiments are also available in that repository to facilitate reproducibility.Our experimentation is all based on widely-available datasets (Bajaj et al., 2018;Ma et al., 2022).

B Additional Measurements and Results
Query Speed Table 5 provides a set of timings for the MSMARCO-v1 collection, in the same format as was employed in Table 4.A similar pattern of behavior arises, demonstrating that our findings apply to both the smaller MSMARCO-v1 and the large MSMARCO-v2 collections.Unsurprisingly, the observed speedup ratios for MSMARCO-v1 are typically less than those measured for MSMARCO-v2.
Effectiveness Table 6 presents effectiveness scores of the four retrieval models, as measured within our experimental framework.While the emphasis in this paper is on efficiency rather than effectiveness, it is interesting to note the strong improvements that the neural augmented DocT5Query method and the two learned sparse methods obtain relative to the standard BM25 approach.Those substantial gains arise because of a combination of document-level term expansion, and the non-linear context-based relationships that are uncovered be- Table 6: Effectiveness of the different models on both collections, using the official metrics associated with each, and runs of length k = 1000.
tween term frequency and term impact.
Our implementations achieve similar effectiveness scores to those previously reported for these three recent techniques -see, for example, Mackenzie et al. ( 2021) and Ma et al. (2022).
Idealized Initial Thresholds Threshold estimation is a technique that improves the efficiency of query processing (Mallia et al., 2020;Petri et al., 2019).Like priming, it enables better skipping over unimportant documents during index traversal by providing an initial minimum threshold score a document needs to obtain to be considered during ranking; unlike priming, however, various unsafe alternatives can be used for predicting initial thresholds.Table 7 demonstrates the potential speedup if the initial heap threshold for each query could (in an omniscient manner) be set at exactly the final score of the k th most similar document; that is, if the priming process could be clairvoyant.The substantial difference in execution times achievable, up to 2.1× relative to the clipping runs shown in

Figure 1 :
Figure 1: Normalized maximum list impact distribution stratified by list length buckets b ∈ [2 b , 2 b+1).Classical schemes such as BM25 and DocT5Query exhibit small maximum impacts for long lists, whereas DeepImpact and uniCOIL assign high importance to terms regardless of their document frequency.Note the irregular scale on the horizontal axis.

Figure 4 :
Figure 4: Bounding scores for list splitting (left) and postings clipping (right) using DeepImpact, with U H(t) plotted as a function of U L(t) for the unique terms occurring in the MSMARCO-v1 queries.

Table 2 :
Table 3 presents query processing times recorded for the MSMARCO-v1 collection and DeepImpact retrieval, with response latency measured as average milliseconds per query, and with Index space requirement, in GiB, for default inverted indexes, and those with postings clipping.Results are shown for both collections and all four ranking models.

Table 3 :
Query processing times, all in average milliseconds per query, for the MSMARCO-v1 collection and DeepImpact retrieval model.Algorithmic enhancements are cumulative stepping down each of the three blocks in the table, except for postings clipping, which is an independent enhancement relative to smart bounds.Similar relativities were also observed in regard to median query times, and 90% and 99% tail latency query times.thethree blocks of values corresponding to three dynamic query pruning approaches.Within each block, we systematically add heuristics.First to be added in the MaxScore block is static term ordering based on length rather than on maximum

Table 4 ,
indicates that more accurate initial threshold prediction mechanisms are a promising direction for further accelerating learned sparse retrieval mechanisms.

Table 7 :
Demonstrating the potential of accurate threshold estimation on the MSMARCO-v2 collection and the DeepImpact model, assuming clairvoyant pre-knowledge for each query.If the final heap threshold could be predicted accurately, further speedups are possible.