CoRT: Complementary Rankings from Transformers

Many recent approaches towards neural information retrieval mitigate their computational costs by using a multi-stage ranking pipeline. In the first stage, a number of potentially relevant candidates are retrieved using an efficient retrieval model such as BM25. Although BM25 has proven decent performance as a first-stage ranker, it tends to miss relevant passages. In this context we propose CoRT, a simple neural first-stage ranking model that leverages contextual representations from pretrained language models such as BERT to complement term-based ranking functions while causing no significant delay at query time. Using the MS MARCO dataset, we show that CoRT significantly increases the candidate recall by complementing BM25 with missing candidates. Consequently, we find subsequent re-rankers achieve superior results with less candidates. We further demonstrate that passage retrieval using CoRT can be realized with surprisingly low latencies.


INTRODUCTION
The successful development of neural ranking models over the past few years has rapidly advanced state-of-the-art performance in information retrieval [5,12]. One key aspect of the success is the exploitation of query-document interactions based on representations from self-supervised language models (LMs) [9,26,28]. While early models use static word embeddings for this purpose [7,11,35], more recent models incorporate contextualized embeddings from transformer-based models such as BERT [19,21,27]. However, models using such query-document interactions have an important shortcoming: To calculate a single relevance score for one passage in response to a given query, a forward pass through an often massive neural network is necessary [21]. Hence, it is not feasible to rank an entire corpus with such an interaction-focused approach. Instead it is a common practice to re-rank relatively small subsets of potentially relevant candidates. These candidates are retrieved by a scalable first-stage ranker, commonly a term-based bag-of-words model such as BM25. However, such a first-stage ranker may act as a "gate-keeper" [37], which effectively hinders the re-ranking model from discovering more relevant documents. We argue that relevant passages for some particular queries -especially if they are naturally formulated questions -can only effectively be retrieved, if a) a soft matching between strongly related terms is involved and b) the model recognizes that a term's meaning may change in interaction with its context. Hence, we suggest to employ a neural ranking approach that acts as a complementizer to existing term-based models and uses candidates from both to compile more complete candidates for subsequent re-rankers. Boytsov et al. [2] utilized nearest neighbor search based on IDF-weighted averages of static word embeddings for queries and document to retrieve re-ranking candidates in addition to BM25. In contrast to that we propose COmplementary Rankings from Transformers (CoRT), a neural first-stage ranking model and framework that aims to leverage contextual representations from transformer-based language models [9,32] with the goal to complement term-based first-stage rankings. CoRT optimizes an underlying text encoder towards representations that reflect the concept of relevance through vector similarity. The model is specifically trained to act complementary to BM25 by sampling negative examples from BM25 rankings. To our knowledge, CoRT is the first representation-focused neural ranking approach that makes use of the self-attention mechanism [32] of a pretrained language model [9] to produce context-sensitive vector representations of fixed size per query and document.
We study the characteristics of CoRT with four types of experiments based on the MS MARCO dataset. First, we measure various ranking metrics and compare the results with various first-stage ranking baselines. In course of this, we demonstrate the complementary portion of relevant candidates that were added due to CoRT. Second, we combine the candidates from CoRT and BM25 with a state-of-the-art re-ranker based on BERT [9,21] and investigate how many candidates are needed to saturate the ranking quality. Third, we train CoRT with various representation sizes and measure its impact on the first-stage ranking quality. Finally, we measure the retrieval latencies of CoRT with two retrieval modalities: an distributed exhaustive search on four GPUs and an approximate search based on a graph-based nearest-neighbor index with pruning heuristics [15]. We consider our study on the relationship between the re-ranking quality, the numbers of candidates used, and the quality of the candidates as one of our main contributions. We also contribute a framework for training and deploying a representationfocused and transformer-based first-stage ranker that complements term-based candidate retrievers, for which we will provide an opensource implementation 1 .

RELATED WORK
Neural ranking approaches employ neural networks to rank documents in response to a query. In this section we describe key concepts of neural ranking and reference exemplary works for each of them. Subsequently, we present neural first-stage ranking approaches that allow a direct comparison with our results.

Key Concepts of Neural Ranking
According to Guo et al. [11], Neural ranking approaches can be categorized into two types of models depending on the architecture. Representation-focused approaches [14,31,37] aim to build representations for queries and documents which are used to predict relevancy scores with a simple distance or similarity measure. In this context, exploiting local interactions between neighboring terms is often seen as an important technique for good representations [31,37] . Related models often follow the idea of Siamese Networks [3] where identical neural networks are joined at their outputs using a distance measure [8,30]. Here are pair-wise learning objectives an effective training modality [12,37]. In contrast to point-wise objectives [8,21] where relevance prediction is modeled as binary classification, optimize pair-wise objectives the relative preferences of positive and negative documents [8,12,37].
Models of the interaction-focused type exploit interactions between query and document terms [7,11,21,35]. In contrast to representation-focused approaches, this requires one forward pass through the whole model for each potentially relevant document. However, this kind of interaction usually leads to superior ranking quality [11,12,27]. Many related models employ a dedicated layer to enforce interactions via soft matching between query and document terms [7,11,35]. Another approach is to use the attention mechanism [32], or more specifically, a pretrained transformer encoder such as BERT [9] to exploit both local and query-document interactions [21,27].
Recently, some hybrid approaches have been proposed that combine typical representation-focused techniques with interactionfocused approaches to reduce computational cost. Gao et al. [10] proposes a model architecture comprising three modules for document understanding, query understanding and relevance judging as part of their framework EARL. The understanding modules produce token-level representations, which can be cached as usual in representation-focused approaches. The relevance judging module, on the other side, uses those representations to apply querydocument interactions more quickly when document representations are cached. Each module is a stack of transformer layers [32], initialized with weights from BERT. Khattab and Zaharia [16] proposes a related approach, namely ColBERT. The model architecture incorporates a relatively inexpensive max-similarity mechanism instead of a shallow transformer network to perform query-document interactions. The authors proposes to store token representation in an Approximate Nearest Neighbor (ANN) index [15] to quickly retrieve only those documents that have token representations in the proximity to those of the query. Thus, ColBERT can be described as a re-ranker that brings its own first-stage retrieval mechanism allowing to perform a full ranking in less than a second (458 on MS MARCO [16]).

Neural First-stage Ranking
A first-stage ranker can be characterized as an efficient full ranker that is used to retrieve documents for a subsequent re-ranker. In this context, various neural approaches have been proposed with the goal to overcome the limitations of traditional ranking functions. Many of them make use of existing infrastructure for term-based ranking functions based on inverted indexing [20]. Zamani et al. [37] proposes SNRM, a representation-focused approach with sparse representations that can be used for inverted indexing as if each feature dimensions corresponds to a term in a bag-of-words representation. SNRM uses pretrained GloVe Word Embeddings [26] to model soft-matched n-grams which are then encoded and aggregated to a sparse representation. Nogueira et al. [23] predict queries for given documents to expand those documents by corresponding query terms. In their first work, known as doc2query, they used a sequence-to-sequence (seq2seq) transformer model [32]. Nogueira and Lin [22] reported large effectiveness gains for their follow-up model docTTTTTquery by replacing the employed seq2seq model with T5 [29]. Another approach aims to predict near-optimal document term weights as a function of the term's context. DeepCT, proposed by Dai and Callan [6], utilizes BERT to predict those context-aware weights based on associated queries in the training data. Inverted indexing is only applicable for either sparse representations or approaches that extend or re-weight existing bag-of-words representations. Representation-focused models using dense representations can instead employ an ANN index, which heuristically prunes documents that are unlikely to be in the top-proximity of the query representation to realize low response latencies [2,13].

PROPOSED APPROACH
With CoRT we describe a first-stage ranking model that acts as a complementary ranker to existing term-based retrieval models such as BM25. To achieve this, we make use of local interactions [31] and sample negative training examples from BM25 rankings.

Architecture
The model architecture of CoRT, illustrated in Figure 1, follows the idea of a Siamese Neural Network [3]. Thus, passages and queries are encoded using the identical model with shared weights except for one detail: the passage encoder and the query encoder use different segment embeddings [9]. CoRT computes relevance scores as angular similarity between query and passage representations. The parameters of are trained using a pair-wise ranking objective.

Encoding
CoRT can incorporate any encoder of a BERT-like language model as underlying text encoder. Here, we choose a pretrained ALBERT [17] encoder for its smaller model size, the tougher sentence coherence pretraining and increased first-stage ranking results throughout our early-stage experiments compared to BERT. The tokenizer of ALBERT is a WordPiece tokenizer [34] including the special tokens [CLS] and [SEP] known from BERT. From the text encoder we seek a single representation vector for the whole passage or query, which we call context representation. From ALBERT we take the last-layer representation of the [CLS] token for this purpose. The context representation obtained from the underlying encoder for an arbitrary string is denoted with ( ) ∈ R ℎ where ℎ is the output representation size.
(AL)BERT's language modeling approach involves training of sentence coherence for which segment embeddings are used to signal different input segments. Although we only feed single segments to the encoder, i.e. a query or a passage, we use segment embeddings allowing the model to represent queries and passages differently. We refer to the segment embeddings and (illustrated in Figure 1) with the context encoder functions and for passages and queries respectively. The context representation is further projected to the desired representation size using a linear layer followed by a tanh activation function. Thus, the complete passage encoder function is ( ) := tanh( ( ) + ) where ∈ R ℎ× and ∈ R are parameters of the linear layer. The query encoder is defined analogous with shared parameters.

Training
Training CoRT corresponds to updating the parameters of the encoder towards a representation that reflects relevance between queries and documents through vector similarity. Each training sample is a triple comprising a query , a positive passage + and a negative passage − . While positive passages are taken from relevance assessments, negative passages are sampled from term-based rankings (i.e. BM25) to support the complementary property of CoRT. The relevance score for a query-passage pair ( , ) is calculated using the angular cosine similarity function. 2 As illustrated in Figure 1, the training objective is to score the positive example + by at least the margin higher than the negative one − . We use the regular triplet margin loss function as part of our batch-wise loss function, which is defined as: Inspired by Oh Song et al. [24], we aim to take full advantage of the whole training batch. For each query, each passage in the batch is used as a negative example except for the positive one. Thus, the batch-wise loss function can be defined as follows: , + and − denote the triple of the ℎ sample in the batch and the number of samples per batch. We found this technique makes the training process more robust towards exploding gradients thus the model can be trained without gradient clipping [38]. Also, it positively affects on the first-stage ranking results 3 .

Indexing and Retrieval
For retrieval with CoRT, each passage must be encoded by the passage encoder . Subsequent normalization of each vector allows us to use the dot product as a proxy score function for , which is sufficient to accurately compile the ranking. Given a query , we calculate its representation ( ) and the dot product for with each normalized passage vector. From those, the highest scores are selected and sorted to form the CoRT ranking. This procedure can be implemented heavily parallelized using GPU matrix operations. Alternatively, the passage representations can be indexed in an ANN index to avoid exhaustive similarity search. Finally, we combine the resulting ranking of CoRT with the respective BM25 ranking by zipping the positions beginning with CoRT until a new ranking of the same length has been arranged. During this process, each passage that was already added by the other ranking is omitted.

EXPERIMENTS
We conducted four experiments studying the ranking quality and recall of CoRT, the connection between the number of candidates and re-ranking effectiveness, the impact of the representation size , and CoRT's retrieval latencies. Finally, we outline a competitive end-to-end ranking setup with CoRT and a BERT-based re-ranker.

Datasets
MS MARCO Passage Retrieval. The Microsoft Machine Reading Comprehension [1] dataset for passage ranking was introduced in 2018 and provides a benchmark for passage retrieval with realworld queries and passages gathered from Microsoft's Bing search. It comprises 8.8M passages sampled from web pages and about 1M queries that are formulated as questions. The objective is to rank those passages high that were labeled as relevant to answer the respective question. The annotations, however, are sparse. There are 530k positive relevance labels distributed over 808k queries in the training set, whereby most queries are associated to one passage. Only 25k queries are associated to more than one passage. The validation and evaluation sets, dev and eval, comprise 101k queries each. An official subset of dev, called dev.small comprises 6980 queries and 7437 relevance labels and is often used for publicly reported evaluations. We follow this convention and use dev.small for testing. The associated labels for eval, however, are not publicly available. The dataset does not contain any true negatives, hence any unlabeled passage is assumed to be negative. This means there might be situations where assumed negative passages are actually more relevant than labeled ones. The creators suggest to use the mean reciprocal rank cut at the tenth position (MRR@10) as primary evaluation measure. Additionally, we measure NDCG@20 [20] as less punishing ranking quality measure and the recall at various positions to indicate how many relevant passages a re-ranker would miss when the respective ranking is used for candidate selection.  manual relevance assessments per query for a set of 43 MS MARCO queries. Each assessments corresponds to rating on a scale from 0 (not relevant) to 3 (perfectly relevant). The passages that were chosen to be rated by human assessors were selected with a pooling strategy. Selected are the top-10 rankings of submitted runs plus at least 100 passages selected by a special tool that models relevance based on already found relevant passages. We adopt the evaluation metrics MRR (uncut), NDCG@10 and MAP from the official TREC overview [5]. In contrast to the original MS MARCO benchmark, this evaluation set provides dense annotations, but only for few queries.

First-Stage Ranking
We trained CoRT as described in Section 3.3 while using a representation size of = 768. In this section we discuss the first-stage ranking results of our model using the datasets and their associated metrics described in Section 4.1.

MS MARCO Passage Retrieval
The results of CoRT and its baselines on the MS MARCO passage retrieval task (dev.small) are reported in Table 1. Next to the obvious choice of BM25 as a baseline, we include DeepCT [6], doc2query [23] and its successor docTTTT-Tquery [22]. All three are recent first-stage rankers with average retrieval latencies below 100ms per query on the MS MARCO passage corpus. The metrics MRR@10 and NDCG@20 reveal a quite decent ranking quality for the standalone CoRT ranker. Both only slightly increase due to merging with BM25 (CoRT BM25 ). The recall, however, increases by a large margin. Since CoRT's primary use is candidate retrieval rather than standalone ranking, we pay particular attention to the recall at various cuts. From the perspective of BM25, the absolute increase of recall due to merging with CoRT ranges between 15.1 (RECALL@50) and 9.2 (RECALL@1000), which

Candidate Re-ranking
We re-rank candidates from both BM25 and CoRT BM25 to study the impact of the candidates on a subsequent interaction-focused  re-ranking. By varying the numbers of candidates, we investigate when adding more candidates becomes ineffective. The corresponding metrics, reported in Table 3, have been calculated based on MS MARCO (dev.small).
Re-ranking Model. Similar to [21], we use a simple binary classifier based on a BERT. The model takes a query-passage pair and yields a relevance confidence. The pair ( , ) is concatenated to one token sequence with two segments (as it is conventional in BERT). This sequence is processed by the BERT encoder while the [CLS] embedding of the last layer, which we denote with ( , ), is projected to a single classification logit. We then apply the sigmoid activation function to obtain the relevance confidence for query and passage . This procedure can be formalized as ( , ) = ( ′ ( , ) + ′ ) where ′ ∈ R ℎ×1 and ′ ∈ R are the parameters of a linear layer with a single output activation. To form a ranking at inference time, we sort the candidates by the model's confidence. Following [21], this model is trained using a point-wise objective. We sample query-passage pairs, each associated with a binary relevance label ∈ {0, 1} and minimize the binary cross-entropy loss described below.
Re-ranking Results. The results, reported in Table 3, show superior ranking quality in terms of MRR@10 for candidates from CoRT BM25 . This is especially true, if low numbers of candidates are used. We also notice earlier saturation 4 of MRR@10 for CoRT BM25 , which is illustrated in Figure 2. Only 64 candidates from CoRT BM25 are sufficient to achieve top results with this re-ranker. In contrast, 256 candidates from BM25 are needed to reach the point of saturation, which translates in quadrupled re-ranking time. We also report the recall for the top-20 re-ranked positions (RECALL@20) and the recall for all candidates that were available for the re-ranker (RECALL@ALL

Impact of Representation Size
As described in Section 3.2, CoRT projects the context representation of the underlying encoder to an arbitrary representation size . This size determines the size of the final index and also influences the retrieval latency. The total size of the encoded corpus is easy to calculate. For example, with = 128 and the MS MARCO corpus, the index size (without overhead) amounts 8.8 × 128 / × 4 / ≈ 4.5 × 10 9 ≈ 4.2 . Thus, is proportional to the total size and reducing to 64 would halve the memory footprint. If is small, however, it is more difficult to attain the margin objective (Eq. 2). Thus, can be used for a trade-off between ranking quality and computational effort / resource cost. We investigate the relation between the representation size and the ranking quality by conducting identical training runs with different numbers for . The results in Table 4 show that MRR@10 already saturates at = 128. However, if is reduced to 64 or below, a loss in ranking quality can be noticed. We also report the corresponding metrics for CoRT BM25 to illustrate if the loss in ranking quality due to the size reduction is intercepted by BM25. Indeed, the loss of recall when reducing is much lower if we use the compound ranking, indicating that many candidates that we lose due to representation size reduction are already covered by BM25. We conclude = 64 would still perform similarly to = 768 in a re-ranking setting.

Latency Measurement
We propose two methods for the deployment of CoRT. The first exhaustively calculates similarity scores using multiple GPUs while the second incorporates an Approximate Nearest Neighbor index (ANN). We measure retrieval latencies of those methods and compare them with BM25 as representative for term-based retrieval models based on inverted indexing. Approaches that are based  on the bag-of-words model such as DeepCT or doc2query have latencies slightly greater or equal to BM25. We conduct the latency measurement based on the top-1000 retrieval for the 6980 queries of the dev.small split while using MS MARCO passage corpus containing around 8.8M passages. Since some approaches profit from batch computing, we also measure the latency for batches of 32 queries. As representation size, we have chosen = 128 because it is the smallest representation size investigated in Section 4.4 that does not hurt the ranking quality of CoRT BM25 . Lucene BM25 Baseline. As retrieval latency baseline, we use a Lucene index generated by the Anserini toolkit [36]. The retrieval was performed on an Intel Core i9-9900KS with with 16 logical cores (8 physical) and enough memory to fit the whole corpus. Single queries were processed using the single-threaded search function, while batch-wise search has been performed with 16 threads. First-Stage Ranking using multiple GPUs. Multiple GPUs can be used to deploy CoRT for fast large-scale ranking. We propose to uniformly distribute the vector representations of the corpus on the available GPUs. Each GPU ranks its own partition of the corpus as described in Section 3.4. Afterwards, the results for each partition are aggregated by selecting the top-k candidates with highest scores. First-Stage Ranking using ANN. Since CoRT operates on vector similarities, it can make use of ANN search. We measure the retrieval latency and the loss of ranking quality, which occur due to the imperfection of pruning heuristics. For this purpose, we use a graph-based index including a special optimization method called ONNG [15]. An implementation of this method is publicly available as part of the NGT Library 5 . To control the trade-off between retrieval latency and accuracy, we can tweak the search range coefficient . Latency Results. The latency measurements are reported in Table 5. For CoRT the total retrieval latency per query consists of two factors, query encoding and the actual retrieval. The query encoding has to be performed by the query encoder ( ), which we highly recommend to run on a GPU. The latency of the actual retrieval depends on the retrieval methods described above. The exhaustive search using 4 GPUs takes 17ms for a single query. Together with the encoding the total retrieval time per query sums up to 17 + 8 = 25 , which is below the BM25 baseline. However, this is only possible due to the ability of GPUs to perform massively parallel computing. Furthermore, we observe a substantial increase in efficiency when processing queries batch-wise on the GPUs. The retrieval of 32 queries at once takes only about twice as long as a single query. The tested BM25 index, on the other side, seems to suffer from multiprocessing overhead or other computational limitations when operating on a single instance. The latencies for the ANN index has been measured with three different values for the search range coefficient . While this significantly affects the retrieval latency, only slight differences on the quality of the firststage ranking can be observed.

End-to-end Retrieval
Intrigued by the exceptional ratio of retrieval latency and ranking quality of ColBERT's full-ranking approach [16], we used our above findings to create a competitive end-to-end ranking setup. We suggest to re-rank the top-64 candidates from CoRT BM25 with = 128, retrieved by an ANN index ( = 0.4). The end-to-end latency comprises 8 for query encoding, 71 for CoRT retrieval based on ONNG, 38 for BM25 retrieval 6 , and 192 for re-ranking. As reported in Table 5, we outperform ColBERT's end-to-end ranking performance in terms of MRR@10 and retrieval latency 7 . CoRT's representations for the MS MARCO corpus only weight 4.3GB when is set to 128, or 7.0GB when indexed in an ONNG index. The size of the query encoder amounts about 50MB, which is due to ALBERT's parameter sharing. To compile the full CoRT BM25 candidates, the corresponding BM25 index is needed, which amounts 2.2 GB on disk. Although more memory is needed to deploy and operate both indexes, this is by far less than the 154GB footprint reported by Khattab and Zaharia [16] for ColBERT's end-to-end approach.

CONCLUSION
In this paper, we propose CoRT, a framework and neural first-stage ranking model that combines term-based retrieval models with the benefits of local interactions in a neural ranking model to compile improved re-ranking candidates. As a result, we observe high recall measures on our candidates improving re-ranking results in multi-stage pipelines. At the same time, we are able to decrease the number of candidates without hurting the end-to-end ranking performance. Our further experiments reveal the sweet spots for CoRT's representation size and the number of re-ranking candidates. We also propose two deployment strategies for CoRT and measured their performances in terms of efficiency and effectiveness. Finally, we demonstrate CoRT can be used to create a highly competitive multi-stage ranking pipeline. The longest recorded time a human has ever gone without sleep is 18 days, 21 hours, and 40 minutes, which resulted in hallucinations, paranoia, etc. However most people can only last 4-6 days without stimulants, and about 7-10 days before the body will be unable to function and long term damage can be caused.
None >1.000 1 and HuggingFace's Transformers [33] as deep learning libraries. PyTorch has been built from source code to ensure computation as fast as possible. Any BM25 ranking were generated by the Anserini toolkit [36]. Anserini ensures reproducibility by providing optimized parameter sets and ranking scripts based on Apache Lucene for several datasets including MS MARCO.
CoRT Training Details. We trained CoRT based on the pretrained ALBERT model "albert-base-v2", which is the lightest available version in HuggingFace's repository 8 [33]. Each model has been trained for 10 epochs, where each epoch includes all queries that are associated to at least one relevant document plus one randomly sampled positive and one negative passage. Most queries are only associated to one relevant passage, though. Negative examples are sampled from the corresponding top-100 BM25 ranking to support the complementary property of our model. We filter positively labeled passages as well as the = 8 highest ranks for their relatively high probability of actually being relevant. Due to high computational effort, this parameter was not tuned systematically. However, we achieved 0.7 p.p. higher MRR@10 and 1.2 p.p. higher RECALL@100 on the MS MARCO dataset when training with = 8 compared to = 0. As usual for BERT-based models we use the ADAM optimizer including weight decay fix [18] with the default parameters 1 = 0.9, 2 = 0.999, = 10 −6 , a weight decay rate 8 https://huggingface.co/transformers/pretrained_models.html of = 0.1 and a linearly decreasing learning rate schedule starting with = 2 × 10 −6 after 2.000 warm-up steps. We train mini-batches of size = 6 samples (triples) while accumulating the gradients of 100 mini-batches before performing one update step. The triplet margin (eq. 2 in Section 3.3) has been set to = 0.1, which has been tuned in the range of [0.01, 0.2]. Re-ranker Training Details. Our BERT re-ranking experiment utilized the pretrained "bert-base-uncased" model, hosted by HuggingFace [33]. We used the same optimizer settings than for CoRT except for the learning rate, which we empirically set to 5 × 10 −5 . The batch-size has been set to 8 and accumulated the gradients of 16 batches before performing one update step. Originally, we trained dedicated model instances for varying numbers of candidates in Section 4.3. However, we found that a model that is trained with negatives from the Top1000 BM25 ranking generally performs better than a model that only uses the Top100 candidates. Hence, all re-ranking models have been trained on random negatives from the Top1000 candidates of the corresponding first-stage ranking. Table 6 shows top-1 retrieval examples of CoRT and BM25. The first query exemplifies the advantage of local interactions in the query encoder. We hypothesize, the query could successfully be "interpreted" as a question about the density although the term density was not included. The second query is an example, where BM25 works well due to favorable keywords in the passage. Although CoRT's top result is not labeled, it clearly is somewhat relevant to the question. Since the passage misses the keyword "insane", it is difficult to retrieve for a term-based model. We hypothesize, due to the terms "hallucinations" and "paranoia", CoRT can correctly match the context in this example.