NAIL: Lexical Retrieval Indices with Efficient Non-Autoregressive Decoders

Neural document rerankers are extremely effective in terms of accuracy. However, the best models require dedicated hardware for serving, which is costly and often not feasible. To avoid this serving-time requirement, we present a method of capturing up to 86% of the gains of a Transformer cross-attention model with a lexicalized scoring function that only requires 10-6% of the Transformer's FLOPs per document and can be served using commodity CPUs. When combined with a BM25 retriever, this approach matches the quality of a state-of-the art dual encoder retriever, that still requires an accelerator for query encoding. We introduce NAIL (Non-Autoregressive Indexing with Language models) as a model architecture that is compatible with recent encoder-decoder and decoder-only large language models, such as T5, GPT-3 and PaLM. This model architecture can leverage existing pre-trained checkpoints and can be fine-tuned for efficiently constructing document representations that do not require neural processing of queries.


INTRODUCTION
We attempt to answer the following question: to what extent can the computationally-intensive inference in modern neural retrieval systems be pushed entirely to indexing time?
Neural networks have revolutionized information retrieval, both with powerful reranking models that cross-attend to query and document, and with dual-encoder models that map queries and documents to a shared vector space, leveraging approximate nearest neighbor search for top-k retrieval.The strongest systems typically use a dual-encoder for retrieval followed by a cross-attention reranker to improve the ordering.However, both these components tend to be built on increasingly large Transformers [13,14,26,31] and thus rely on dedicated accelerators to process queries quickly at serving time.In many application settings, this may be impractical or costly, and as we will show, potentially unnecessary.
In particular, we explore a retrieval paradigm where documents are indexed by predicted query token scores.As a result, scoring a query-document pair (, ) simply involves looking up the scores for the tokens in  associated with  in the index.While the scores are predicted by a neural network, the lookup itself involves no neural network inference so can be far faster than other approaches.However, this naturally means that there can be no cross-attention between a specific query and document or even a globally learned semantic vector space.Given these shortcomings, it would seem surprising that such a model, which offloads all neural network computation to indexing time, could be a practical alternative to its more expensive neural counterparts.
In addition, while we want to make use of large pre-trained language models, which have been shown to generalize well over a number of language and retrieval tasks [1,2,26,29,34], a key challenge is that they have universally adopted a sequence-to-sequence architecture which is not obviously compatible with precomputing query scores.Naive approaches are either computationally infeasible (scoring all possible queries), or rely on sampling a small, incomplete set of samples (such as in Lewis et al. [22]).
To overcome this challenge, we introduce a novel use of nonautoregressive decoder architecture that is compatible with existing Transfomer-based language models (whether Encoder-Decoder or Decoder-only [2]).It allows the model, in a single decode step, to score all vocabulary items in parallel.This makes document indexing with our model approximately as expensive as indexing with document encoders used in recent dual-encoder retrieval systems [6,14,26].We call the retrieval system based on this proposed model nail (Non-Autoregressive Indexing with Language models).
We summarize our contributions as follows: (1) We advance prior work on learned sparse retrieval by leveraging pretrained encoder-decoder LMs with a novel nonautoregressive decoder.(2) We describe a range of experiments using the BEIR benchmark [39] that explore the performance and efficiency of our model as a reranker and as a retriever compared with a variety of existing systems.As a reranker, nail can recover 86% of the performance of a large cross-attention reranker [27], while requiring 10 −6 % of the inference-time FLOPS per query.As a retriever, nail has an extremely high upper bound for recall-exceeding the performance of all other retrievers in the zero-shot setting.Finally, by using BM25 as a retriever and nail as a reranker, we can match state-of-the-art dual-encoders [14,26] with 10 −4 % of the inference-time FLOPS.(3) We propose our model as a preferred solution when significant compute is available at indexing time, but not ondemand at serving time, and we provide a cost analysis that illustrates when our approach could be preferred to previous work that harnesses LLMs.

RELATED WORK
There has been much work in information retrieval leveraging neural networks in recent years, which we cannot adequately cover in this paper.For a comprehensive overview, we refer the reader to the survey by Hambarde and Proença [12].In this section we describe methods that minimize the use of expensive neural methods at query inference time which are typically methods of sparse retrieval, focusing on those that leverage large language models.

LM-based Term Weighting
Bag-of-words models, such as TF-IDF and BM25 [36], use term weighting based on corpus statistics to determine relevance of document terms to query terms.Our work can be seen as a way to construct document term weights that are both (1) unconditional with respect to the query, and (2) indexed using lexicalized features (specifically, we use a vector of token scores).As a result, this type of document representation can be precomputed (at indexing time) and does not require expensive computation at query-time.Prior work on leveraging language models to produce such lexicalized term weighting can be roughly divided into two groups: those with just document-side encoders, and those with query-side and document-side encoders.
Examples of the first group include DeepCT [4], DeepTR [43], and DeepImpact [24], Tilde v2 [44], and Splade-doc [6].These systems are examples of the model paradigm we are exploring, in which all neural network computation happens at indexing time.Our work can be seen as an attempt to update these systems (which use word2vec embeddings or encoder-only language models) to modern encoder-decoder architectures.Splade-doc is the most recent (and performant) of these, so is in many cases the most useful point of comparison for our work.We include results for the best version of Splade-doc [19].

LM-based Document Expansion
Another way to improve retrieval indices with the help of language models is to perform document expansion.This consists of augmenting the terms in a document that do not occur in its original text, but are likely to be useful for retrieval.When used in combination with a lexicalized retrieval index, document expansion can be implemented without additional query-time computational requirements.Recent examples of LM-based document expansion systems include Doc2Query [30] and Doc2Query-T5 [28].
Other forms of document expansion include the Probably asked questions database [22] which, via an expensive offline system, uses a generative language model to produce lists of questions for every document in the corpus.
We agree with Lin and Ma that document expansion typically improves the quality of retrieval systems, irrespective of representation used [23].Our approach, however, makes no assumptions about which terms should be used to index a document, allowing the model to score all tokens in the vocabulary.

Non-autoregressive decoders
Non-autoregressive sequence-to-sequence models have been previously proposed and studied, particularly in the context of machine translation [11,20,40], motivated by the computational complexity of standard auto-regressive decoding, which requires a decode step per generated token.Non-autoregressive decoding breaks the interstep dependency and thus provides two computational benefits: (1) a single step through the decoder can produce outputs for more than one position, and (2) computation can be easily parallelized since are is no time-wise dependencies between computations.
While these systems use non-autoregressive decoding to perform iterative generation of text, we know of no existing work that uses non-autoregressive decoding to produce document representations or for retrieval purposes.

NAIL MODEL
A major goal of this work is to investigate retrieval methods that forego neural computation and the need for specialized accelerator hardware at query time.As such, we focus on a method that uses a large neural model to precompute the required representations of the retrieval items (documents) ahead of time.Then, at retrieval time, the method performs only basic featurization (e.g., tokenization) of the queries.
Specifically, we investigate query-document scoring functions that score the compatibility of a query-document pair with the inner-product of separate featurizations of the query   () and document   ().score(, ) = ⟨  (),   ()⟩ This form is familiar from both traditional lexicalized retrieval and from more recent work on dense retrieval.In lexicalized retrieval, (e.g., TF-IDF and BM25) [36,37],   and   assign non-zero scores to sub-strings of  and .On the other hand, in dense retrieval [14,16,26],   and   are neural networks that map  and  to dense vectors.Note that this formulation does not allow for deeper interactions between  and , such as typical cross-encoder scorers, as these cannot be computed efficiently and without an accelerator at query time.
We investigate an alternative formulation of Equation 1 than either traditional lexicalized retrieval or dense retrieval.In this formulation,   can be an arbitrarily complex neural network, but   must be a sparse featurization that can be quickly computed on commodity CPUs.This way, it is possible to push all costly neural network inference to indexing time, and avoid the need for accelerators at serving-time.For this paper, we choose   to be a Figure 1: Our model adapts the T5 encoder-decoder architecture to predict query token scores given an input passage.The encoder (a) reads an input passage prepended with a static prompt.The decoder (b) can be initialized from a pretrained T5 checkpoint, but the architecture is modified in a few ways to be non-autoregressive: the only inputs are the standard position embeddings, the decoding is parallelized for efficiency, and the output at each position is the full distribution over the vocabulary.Finally, we take a max over the position axis (c) to produce a vector of token scores corresponding to the multi-hot vector of tokens appearing in the target query.
simple tokenizer, but we believe that our results could also extend to more complex sparse featurizations.

Independent prediction of query tokens
Given the choice of   described above, we need to learn a function   that can assign high scores to tokens that are are likely to occur in a query associated with the input document and low scores to tokens that are unlikely to appear in such a query.This goal differs from related work on query prediction for document expansion [22,29] where only a few likely query terms are added to the set of document terms.
Instead of aiming to predict a small number of queries that are related to , we aim to predict a featurization of  that can be used to score any query.Given that an important motivation of this work is to make use of large pretrained language models, we must also investigate how best to adapt the sequence-to-sequence generative architecture that most such models have adopted [1,2,33,34].In particular, the Transformer-based language models adopt an autoregressive decoding strategy, where the model predicts a single token position at a time, conditioned on the output of previous predictions.A naive decoding strategy, of decoding every possible target query ahead of time, is not computationally feasible, requiring 32 16 = 10 72 decode steps.
How do we generate document representations, using a sequenceto-sequence architecture, in a computationally efficient way?
To do this, while also making use of pre-trained Transformer language models, we modify the decoder stack to support independent predictions of the output tokens (also known in the literature as non-autoregressive decoding [11,20]).In addition, we modify the output of the model so that instead of generating a token sequence, it generates a sequence of scores over the vocabulary.We use this predicted sequence of vector of scores over the vocabulary as a representation of the document  in our system.
Our model architecture is illustrated in Figure 1.In this model, each output token is predicted independently from other output tokens, and is conditioned only on input sequence and positional information.This allows the model to produce output for all positions in parallel.In addition, because the output representation is no longer a single token, but scores over the entire vocabulary, we can obtain a representation for scoring any possible query  in a single step of the decoder.
The nail model is based on the T5 architecture [34] and, for the experiments in Section 5, we start with pre-trained T5 checkpoints.There are several ways to use such a model to predict feature scores.nail uses the T5 vocabulary as its featurization, consisting of 32,000 tokens.In order to quickly score all 32,000 tokens, we modify the baseline model in two ways: (1) The standard encoder-decoder model proceeds auto-regressively, predicting the next token based on the previous predicted tokens.Each output token additionally conditions on a relative position embedding based on the current decode position.Here, instead there are a fixed number of decode positions which all proceed simultaneously, conditioning only on the input and a fixed position embedding.(2) In both the standard T5 model and our adaptation of it, each token position outputs a distribution over the entire output vocabulary.Normally, this produces a single sequence of tokens by sampling or taking the maximum probability token at each position.Here, we instead pool over all positions, taking the maximum score produced at any position for each token.
A simpler alternative would be to have the model decode for only a single position and then use the produced distribution as the scores for each token.However, we found that the model was able to represent a more diverse and better-performing distribution of query tokens when it could distribute their predictions over multiple output positions.

Contrastive training
Similar to previous work that has trained dual encoders for retrieval, we utilize negative training examples in order to do contrastive learning.In particular, we assume training data of the form D = {( 0 ,  + 0 , d − 0 ), . . ., (  ,  +  , d −  )} made up of triples that associate a query   with a positive passage  +  and a set of  negative passages d −  = { − :0 , . . .,  − : }.The negative passages generally represent passages that are related to the query but are poor retrievals than the positive passages.
We train nail by assembling D into batches of  examples and calculating an in-batch softmax that includes both positive and negative passages from the batch [26].Let a single batch of  examples be b ) and let p  be all of the positive and negative candidate passages in this batch.The per-example loss for a query  and positive passage  + drawn from batch b  is and we train the model to incrementally minimize the per-batch loss, summed over all  examples in the batch.Note carefully that the number of explicit negative passages can vary under this setup, as the positive passages for other queries serve as implicit negative passages for every other query.More details about the exact training setup are given in the following section.

EXPERIMENTAL SETUP AND MODEL TRAINING
We describe the procedures for model training and evaluation in this section.To train the nail model, we have empirically found it beneficial to perform two stages of training (1) a pre-training stage the uses self-supervised tasks over a large, unlabeled text corpus, and (2) a fine-tuning stage that relies on question-answering data via explicit hard negatives.We present the details of each of the training steps in Sections 4.1 and 4.2.
Our model is implemented within the T5X framework [35] and, in all experiments, we initialize model weights with published T5.1.1 checkpoints [34], which have been pre-trained for approximately one million steps on a span corruption language modeling task.Our models were trained on 64 chips of TPU v4 accelerator.Unless otherwise noted, the nail model size used in the experiments is XL, with roughly 3 billion parameters.We saw no further gains from increasing parameters further.
In order to be compatible with T5 existing checkpoints, we also adopt the T5 vocabulary and attendant SentencePiece tokenizer [18].The vocabulary consists of roughly 32,000 tokens extracted from a English-focused split of Common Crawl.

Pre-training
For pretraining, we combine two closely related self-supervision tasks for retrieval: inverse cloze task and independent cropping [14,21].Both of these tasks take in a passage from a document and generate a pair of spans of text, forming a positive example.One of the generated spans serves as a pseudo-query and the other serves as a pseudo-passage.In the case of independent cropping, two contiguous spans of text are sampled from the passage.As the spans are selected in a conditionally independent way, overlaps between them are possible.For the inverse cloze task, a contiguous span is initially selected from the passage, forming a pseudo-query.The second span encompasses the remainder of the passage with the sub-sequence selected in the first span omitted.
In both tasks, we use the C4 corpus [34], a cleaned version of Common Crawl's web crawl corpus.In each training batch, half of the examples are from the independent cropping task and the other half are from the inverse cloze task.In addition, each target has a single correct corresponding input, and all other inputs serve as (random) negatives.
We found this pre-training to be very important to calibrate language model scores to lexical retrieval scores.One possible reason is that while highly frequent words (stop words) typically have a high score in LMs, they are known to be insignificant or harmful in ranking retrievals independent of the context or inputs in which they occur.Additional discussion of the need for pre-training can be found in Section 6.2.We run pre-training for 500k steps on batches of 2048 items, the largest size we are able to fit into accelerator memory.

Fine-tuning
We finetune our model on the MS-MARCO dataset [25].It consists of roughly 500,000 queries, each with a corresponding set of gold passages (typically one per query) along with a set of 1,000 negative passages produced by running a BM25 system over the full corpus of 8.8M passages.For constructing training batches, we use the gold passage as positive, along with a small sample of the BM25 candidate passages as hard negatives.
As we will discuss in Section 6.1, the performance on MS-MARCO and BEIR evaluation sets differs based on the different number of hard-negatives included with each example in the batch.Each example contains between 3 and 63 hard negative passages and one positive passage; similar to pre-training, each batch consists of 2048 total passages and thus a variable number of queries.For instance, with 64 total passages, each batch has 32 queries; with 4 total passages, each batch has 512 queries.As with pre-training, the other passages in the batch also serve as random negatives.
We used early stopping for checkpoint selection, using as criteria the loss value on a held-out set from MS MARCO development slice of data.In most cases, this required fewer than 30K steps.As this consists of multiple epochs, we use different hard negative passages for each example.These are sampled according to their score, overrepresenting higher scoring passages.

Evaluation Methodology
For evaluation, we focus on the public, readily-available, datasets available in the BEIR [39] suite and which have baseline numbers present in the leaderboard, which totals 12 distinct datasets.We specifically target BEIR since it contains a heterogeneous set of retrieval datasets, and equally importantly, evaluates these datasets in zero-shot setting.While neural models have made huge gains over BM25 on in-domain data, BEIR shows that a variety of neural retrievers underperform relative to BM25 on out-of-domain data.
BEIR results are typically presented as two separate tasks, where most systems are only evaluated on either the reranking variant or the full retrieval variant.In the full retrieval variant, systems must retrieve over the provided corpus of document passages, which range from a few thousand to a few million, and they are generally evaluated based both on their recall@100 and their nDCG@10 [15], providing a view into their ability to retrieve the gold passages into the top 100 and the ordering of the top ten passages, respectively.In the reranking variant, models do not have to do retrieval, and the recall@100 is fixed to the performance of an off-the-shelf BM25 system, so only nDCG@10 is reported.

EXPERIMENTAL EVALUATION
In this section, we compare the proposed nail system to other systems that have published results on BEIR.To compare with some sparse systems that have not been evaluated on BEIR datasets, we also make of use the MS-MARCO passage ranking task.We focus on answering the following questions: • How does nail perfom as a reranker, particularly when compared to much more expensive neural reranker systems?• What is the potential for using nail for full corpus retrieval?
Can nail representations be effectively sparsified?• How does nail compare to recent term weighting retrieval systems that use neural or language models?• How does nail compare with a similarly trained dualencoder system that uses an expensive query-side encoder?  ).Note that we use the BM25 candidates from the Elas-ticSearch system.Results for all systems excluding nail are copied from the BEIR reranking leaderboard.

Reranking
In the reranking BEIR task, each system must rerank the 100 passages returned by an off-the-shelve BM25 system.
Baselines In this section we divide approaches into two types of systems: lexical-based approaches and cross-encoders.In the crossencoder category, we compare to MonoT5-3B [27] and MiniLM-L6 1 .MiniLM-L6 is a BERT-based models trained on MS-MARCO using a cross-encoder classifier.MonoT5-3B uses a T5-based model finetuned on MS-MARCO, using a generative loss for reranking.
Results Table 2 shows the reranking results.The baseline comparison for nail's performance here is BM25 alone: using BM25 without a reranker is the only other method that does not need to run a neural network for each query.We see that nail improves over BM25 fairly consistently.The improvement on MS-MARCO, which has in-domain training data, is especially striking.On BEIR, nail improves performance on 10 out of the 12 datasets increasing the average score by over 5 points.
While cross-encoder models are much more powerful, they are also much more expensive.Cross-encoder models have to run inference on all 100 documents for each query.Thus, nail uses 8 to 9 orders of magnitude fewer FLOPS than the cross encoder models, corresponding to almost 1 trillion fewer FLOPS for a single query.Moreover, nail significantly closes the gap between the BM25 baseline and the top performing cross-encoder rerankers, capturing 86% of the gains on MS MARCO and 45% of the gains on the broader suite of BEIR tasks.Thus, it presents an attractive alternative to expensive rerankers when compute is limited.In the full corpus retrieval task, each system must retrieve and rank from each dataset's entire passage corpus for the dataset.

Full Corpus Retrieval
Because nail is very cheap to run as a reranker, it is reasonable to compare the BM25+nail results from Section 5.1 to direct retrieval systems that do not include a reranking step, but typically consume many orders of magnitude more FLOPs at query time.Table 3 presents this comparison.
As nail could be used to populate an inverted index, we investigate how well nail works when scoring all candidates in the corpus, 64.4 -Table 3: BEIR nDCG@10 and recall@100 results on the full retrieval task.The SPLADE-doc + results are previously unpublished, corresponding to the model described in [19], and obtained via correspondence with authors.Other numbers are obtained from their respective publications.
which is an upper-bound for a nail-only retrieval system.These results are presented as nail-exh in Table 3.
We later present a brief investigation into the effect of sparsification of the nail output, to further understand the potential for using nail to populate a sparse inverted index for retrieval.
Baselines For full retrieval, we compare nail to lexical-based and dual-encoder systems.
GTR-XXL [26] is one of the largest and best performing dualencoder systems publicly available.It is pre-trained on a large, non-public, corpus of 2 billion question answer pairs scraped from the web, and fine-tuned on MS-MARCO.Contriever is another dualencoder system, which employs novel self-supervised pretraining task [14], and is also fine-tuned on MS-MARCO; we describe it in more detail in Section 5.4.
SPLADE v2 [6] develops a query encoder and a document encoder to produce sparse representations, differing from dense dualencoders systems.The query and document representations in SPLADE v2 are used for slightly different objectives.The query encoder is used to perform query expansion into BERT vocabulary, and the document encoder is used to produce sparse document representations for indexing.This system is trained via distillation of a cross-encoder reranker, and finally fine-tuned on MS-MARCO.
Colbert v2 adopts a late interaction model that produce multivector representations for both documents and passages.In this model, per-token affinity between query and document tokens are scored using per-token representations.This model is also trained via distillation of a cross-encoder reranker.
Besides BM25 and nail, SPLADE-doc + is the only other retriever that does not require neural network inference at query time.This model is a variant of SPLADE v2 where the query encoder is dropped, and only the document encoder is used [19].As with SPLADE v2, SPLADE-doc + is trained using distillation of crossencoder reranker, with additional fine-tuning on MS-MARCO.
Results Table 3 shows the results for nDCG@10 and recall@100 on BEIR full corpus retrieval for all systems that report it.We stratify the results into two sets, (1) MS-MARCO, which with the exception of BM25, is used as a training dataset, and (2) the average over all the other BEIR datasets, which are evaluated as zero-shot.
On the out-of-domain BEIR tasks, BM25+nail beats all but one of the neural retrieval systems, despite not needing to encode the query with a neural network at query time and being limited in recall to BM25.Additionally, we note that nail-exh outperforms all other retrieval systems according to the recall@100 metric, suggesting potential for a nail-based retriever that uses nail to populate an inverted index.However, given the lower nDCG@10 than BM25+nail, this may only be worthwhile to implement if combined with a different reranker.Note that while this recall@100 result is highest for nail on the out-of-domain BEIR tasks, nail does worse than other models like GTR-XXL on the in-domain MSMARCO task.This is likely to be, in part, due to the training recipes used by other work to optimize for MS-MARCO performance, including model distillation and large non-public corpora of QA pairs.When comparing to the other system that does not require querytime use of an encoder, SPLADE-doc, we observe that nail is lagging behind on the in-domain evaluation, but outperforms SPLADEdoc on both metrics of the zero-shot datasets in BEIR.As with many of the other retrievers, the SPLADE-doc model was distilled from a cross-attention reranker teacher that is trained on MS-MARCO, which may account for this in-domain gain in performance [7,9].
Sparsification To further explore the potential for using nail for full retrieval, we experiment with a naive approach to sparsifying nail document representations.Specifically, we simply order tokens by their scores and keep the top-k scoring tokens.
Figure 2 demonstrates the effect on the recall@100 metric of reducing the number of terms per document from the original vocabulary of 32 thousand tokens down to 100 tokens.For both MS-MARCO and other BEIR datasets, recall@100 falls considerably when using only the top 100 tokens.Nonetheless, with only two thousand tokens we are able to maintain the same level of performance for MS-MARCO and roughly 97% of the recall performance on BEIR.This observation, along with the results in Table 3, suggest that nail could be used to populate an efficient inverted index for retrieval, with little loss of recall.Such an index could serve as a more powerful alternative to BM25.We leave this to future work.

Comparison to Term Weighting Models
In this work we are primarily interested in zero-shot retrieval evaluations, which is why we focus on BEIR.However, there are a few recently proposed retrieval systems that also make use of language models to compute term weights.In this section, we compare these systems to nail using the MS-MARCO passage retrieval task.The metrics typically used in this task are the mean reciprocal rank  4: Evaluation on the MS-MARCO dev set for passage ranking task.Numbers reported are taken from corresponding publications.Except for DeepImpact system, all results are obtained with no document expansion, while DeepImpact results include doc2query-T5 [28] document expansion terms.
with a cutoff of 10 (MRR@10) measuring the ranking of the top ten results and recall@1000.Table 4 contains the results.For nail, we report both the version which uses BM25 retrievals (in that case, the recall metric is derived from the BM25 system) and the system described in the previous section which uses exhaustive scoring.The results demonstrate that both nail-exh and BM25+nail outperform the other term weighting models presented on the MRR@10 metric for the MS-MARCO passage ranking task.With respect to recall metric, nail-exh clearly improves over the previous systems.Exhaustive scoring is much more expensive than the other systems shown; however, given the sparsification results shown in Figure 2, we believe a sparse version of nail would be competitive with the models presented.We obtain Contriever reranking performance by using their released model and ranking the same set of BM25 candidates as nail.The average BEIR nDCG@10 does not include MS-MARCO.

Comparison to
There are several confounding factors in comparing the systems presented in Tables 2 and 3.As mentioned, each system uses different training recipes and training data while also having slightly different architectures.Training techniques presented in the baselines presented in this work include unsupervised pretraining, hard negative mining, and distillation from a cross-attention teacher.These factors can make it difficult to pinpoint the cause of the variance in performance across models.
However, nail and Contriever [14] share training recipes to a large extent, having both a similar pretraining stage followed by fine-tuning on MS-MARCO.Contriever is a recently introduced dual-encoder model that inspired the pretraining task in this work.However, architecturally, nail and Contriever are quite different.nail's query representation is not learned and is tied to the fixed set of vocabulary terms; this approach is potentially less powerful than a fully learned dense representation.
The summary of the comparison is available in Table 5.We observe that on the BEIR reranking task, nail matches both the indomain and zero-shot performance of the Contriever model, despite lacking a query time neural network.Without using BM25 for initial retrievals, both methods perform slightly worse on nDCG@10 for the zero-shot BEIR tasks, but they remain comparable.Order of magnitude of FLOPS for one query %Improvement of nDCG@10 on BEIR Figure 3: Improvement over BM25 and extra FLOPS to score one query on the BEIR retrieval task.The nail and MonoT5 use BM25 retrievals; SPLADE-v2 uses its own retrievals over the full corpus.Note that the vast majority of the computation for SPLADE and dual encoders is in encoding the query; reranking BM25 retrievals would not reduce computation.

Performance versus query-time FLOPS
We have motivated this work by asking how much can we leverage large language models at indexing time while making query time computational costs small enough for a commodity CPU.As the results in this section shows, there are tradeoffs between reranking accuracy improvements and computational costs.To illustrate this tradeoff, we present results of percentage nDGC@10 improvement over BM25 versus query-time FLOPS in 3.In general, we think lexicalized approaches like nail provide another interesting point on this curve, where much higher performance than BM25 can be achieved for only a small amount more compute.Note that Lassance and Clinchant [19] discuss smaller versions of Splade; see Table 1 for the approximate reduction.

ALTERNATE TRAINING RECIPES
Our primary goal has been to determine the extent to which the performance of an expensive neural network can be captured in a fast, sparse, featurization for general purpose retrieval.Subsequently, we have prioritized a training recipe that is aligned with previous work and well suited to the multi-domain BEIR task.However, the performance of learned retrievers as rerankers is very sensitive to the exact nature of the training recipe, and in this section we present analyses of the choices we made, and the associated trade-offs on BEIR and MSMARCO performance.

Effects of hard-negative selection during fine-tuning
One key choice in contrastive learning is the distribution of negative examples used in Equation 2. This is commonly a combination of hard negatives, which are chosen to be challenging for a single example, and batch negatives, which are drawn from the distribution of all positive and hard-negative candidates across training examples [16,32,41].Our pretraining task (described in Section 4.1) does not use hard negatives; however, the MS-MARCO fine-tuning task includes hard negatives created by running BM25 retrieval over the set of candidate passages.Table 6 shows how BEIR and MS-MARCO results change we change the number of MS-MARCO hard-negatives that we sample during fine tuning.As this number increases, the MS-MARCO performance also increases until it matches the performance of the cross-attention rerankers in Table 2 when 63 hard negatives are sampled for each training example.However, increasing the number of MS-MARCO hard negatives also hurts BEIR performance.

Effects of pretraining and fine-tuning
The training recipe, presented in Section 4.1, has two stages beyond the language model training from Raffel et al. [34].Table 7 shows that both stages benefit both the BEIR and MSMARCO results.However, NAIL still yields a nice improvement over BM25 across the BEIR tasks using only the pre-training task.This is encouraging because these data are heuristically generated rather than relying on human relevance labels, so they can be trivially applied to new domains.The MS-MARCO results are unsurprisingly more dependent on fine-tuning on MS-MARCO.Pre-trained nail does not outperform BM25 on MS-MARCO without fine-tuning.More sophisticated methods of synthetic training data generation, such as Promptagator [5], could also help improve nail further, but we leave this to future work.

QUALITATIVE ANALYSIS
In this section, we present a qualitative analysis of the tokens that score highest according to the nail model for a given input.We choose the Natural Questions (NQ) subset of the BEIR benchmark for this analysis, as the queries tend to be complete questions that are easily interpretable.Table 8 shows the percentage of nail's top predicted tokens that appear in the passage input to the nail model along with the gold query that is paired with this passage in the NQ development set. Figure 4  Almost all of the tokens in both the input passages and the unseen query are present in nail's top 1000 predictions (Table 8).However, tuning towards MS-MARCO significantly increases the number of query tokens predicted in the top 100 and 1000 positions, while simultaneously reducing the number of passage tokens predicted.This is unsurprising: the fine-tuning stage represents a domain shift from the pre-training task, which is predicting document tokens, toward predicting query tokens.One indication of this shift is the increase in the prevalence of 'wh' words (what, who, where) in the top terms from the finetuned model in Figure 4.
Figure 4 also illustrates some other interesting shifts in nail's output during fine-tuning.For example, in Example (3) the pretrained model predicts many dates associated with the Eagles (e.g., album release years).These are likely to occur in adjacent passages in the same document as the input passage, so they are good predictions for the pre-training task (Section 4.1).However, they are very unlikely to occur in queries associated with the input passage, and thus they are replaced in the fine-tuned predictions with terms that are more likely to occur in queries targeting the passage ('sang', 'sing', 'wrote', 'who', 'released').
Figure 4 also illustrates nail's ability to predict the type of query that is likely to be paired with a given passage.Passages containing definitions, such as the one presented in Example (1), are highly associated with the wh-word 'what'.On the other hand, passages about individuals or groups of individuals (Examples (3) and ( 4)) are more highly associated with 'who'.
Finally, the predicted terms in Figure 4 contain a lot of small surface-form variations of the same root word, with different segmentations and capitalizations treated separately by the query tokenizer.For example, the tokens 'chic', 'chi', 'CHI', 'Ch', 'ch', 'CH' in Example (2) Glen, 1976, 1995, rack, IVE, 1977, during, 1975, 1993, keep, 1972, 1974, 1996, 1997, Don, album Eagle, alive, sang, love, live, song, Cap, written, keep, sing, live, wrote, lov, Love, kept, Will, who, will, hell, keep, ll, keeps, Live, tim, Us, gle, singer, songs, cap, IVE, Car, written (4) nitty gritty dirt band fishin in the dark album "Fishin' in the Dark" is a song written by Wendy Waldman and Jim Photoglo and recorded by American country music group The Nitty Gritty Dirt Band.It was released in June 1987 as the second single from their album Hold On. [1] It reached number-one on the U.S. and Canadian country charts.It was the band's third number-one single on the U.S. country music charts and the second in Canada.After it became available for download, it has sold over a million digital copies by 2015.[2] It was certified Platinum by the RIAA on September 12, 2014.[3] Wald, itty, glo, hin, Dir, Dark, Wendy, Fi, RIA, fishing, dark, 1987, song, 5, 4, million, 3, Gr, Fish, 5., single, became, Hold, Band, number, 1986, 1, (4), 6, country, band, reached, Jim, 500,000, 1988 hin, fi, dark, itty, fishing, song, Dir, Wald, sang, sing, hold, wald, fish, ?, ity, Fish, band, gg, who, shing, band, hit, dir, songs, held, ies, Wendy, singer, dirty, Hold, released, Band, ISH, dirt, country, fish, Dark, Song, ities, written, music, single, Country, ddy, when, wrote Figure 4: Sample of top token predictions from pre-trained only and pre-trained+fine-tuned nail models.The table shows a few evaluation examplars from the Natural Questions evaluation set included in BEIR.We display the corresponding question associated with the answer passage for the benefit of the reader, but this is not shown to the model.We have explicitly removed stop words and non-words (control sequences).Note that due to the the use of SentencePiece tokenizer [18], tokens do not necessarily correspond to full words.it does not abstract over diverse surface forms.Future work could examine more efficient and discriminative featurizations than the tokenization used in this work.

CONCLUDING REMARKS
We introduce a new model for sparse, lexicalized retrieval, called nail.With nail, we are able to adapt expensive pretrained sequenceto-sequence language models that use Transfomer architectures (e.g., T5, PaLM, GPT-3) for document indexing.The main elements of nail are (1) the use of a non-autoregressive decoder, and (2) the use of vocabulary based representation for documents and queries.We train nail using a query prediction task, finding that pretraining on self-supervised retrieval is critical for good performance.
With nail we study the tradeoffs of offloading expensive neural computation wholly to indexing time, allowing serving to operate cheaply and without the use of accelerators.Evaluating retrieval on BEIR, we show that the nail approach is as effective as recent dual-encoder systems and captures up to 86% of the performance gains of a cross-attention model on MS-MARCO while being able to serve requests on commodity CPUs.

Figure 2 :
Figure 2: Effect of sparsification of document representation on recall@100, using a top-k strategy.

Table 1 :
[3]imated FLOPS required to score a (, ) pair[3].Note that for dual-encoder and lexical systems, document representations are precomputed. is assumed to be of length 16 tokens, and  is assumed length of 128 tokens.Note that the standard versions of Splade-v2 and Contriever are based on BERT-base.

Table 2 :
BEIR results on reranking task (top 100 results from BM25

Table 5 :
Comparison of Contriever and nail on BEIR and MS-MARCO.

Table 6 :
Effect of varying the number of hard negatives on reranking evaluation for MS-MARCO and BEIR.The BEIR average is computed without MS-MARCO.

Table 7 :
Effect of pretraining on nail for the BEIR reranking task.The BEIR nDCG@10 metric corresponds to average score of datasets excluding MS-MARCO.

Table 8 :
presents the top predicted terms for a randomly sampled set of passages.Percent of NQ query and gold passage tokens contained in the top 100 and 1000 scores from nail.