Hybrid and Collaborative Passage Reranking

In passage retrieval system, the initial passage retrieval results may be unsatisfactory, which can be refined by a reranking scheme. Existing solutions to passage reranking focus on enriching the interaction between query and each passage separately, neglecting the context among the top-ranked passages in the initial retrieval list. To tackle this problem, we propose a Hybrid and Collaborative Passage Reranking (HybRank) method, which leverages the substantial similarity measurements of upstream retrievers for passage collaboration and incorporates the lexical and semantic properties of sparse and dense retrievers for reranking. Besides, built on off-the-shelf retriever features, HybRank is a plug-in reranker capable of enhancing arbitrary passage lists including previously reranked ones. Extensive experiments demonstrate the stable improvements of performance over prevalent retrieval and reranking methods, and verify the effectiveness of the core components of HybRank.


Introduction
Information retrieval is a fundamental component within the field of natural language processing (Chen et al., 2017).Retrieval aims to search a set of candidate documents from a large-scale corpus, and thus high recall retrieval with efficiency is required to cover more relevant documents as far as possible.Traditionally, retrieval has been dominated by lexical methods like TF-IDF and BM25 (Robertson and Zaragoza, 2009), which treat queries and documents as sparse bag-of-words vectors and match them in token-level.Recently, neural networks have become prevalent to deal with information retrieval, where queries and documents are encoded into dense contextualized semantic vectors (Huang et al., 2020;Karpukhin et al., 2020;Ren et al., 2021a;Zhang et al., 2022), and then 1 Our code is available at https://github.com/zmzhang2000/HybRank retrieval is performed with highly optimized vector search algorithms (Johnson et al., 2021).
Although numerous efforts have been dedicated to retrieval, the inherent efficiency requirement restricts the interaction between query and passage to a shallow level, leading to unsatisfactory retrieval results.Thus, in typical reranking (Nogueira and Cho, 2020;Sun et al., 2021), query and passage are concatenated and fed into a Transformer (Vaswani et al., 2017) pre-trained on large corpus, to estimate a more fine-grained relevance score and further enhance the retrieval results with richer interaction.These methods consider each passage in isolation, ignoring the context of the retrieved passage list.Some learning to rank (Rahimi et al., 2016;Xia et al., 2008) and pseudo-relevance feedback (Zamani et al., 2016;Zhai and Lafferty, 2001) methods utilize the ordinal relationship or listwise context of retrieved documents to further refine the retrieval.Moreover, the necessity of integrating listwise context is confirmed in multi-stage recommendation systems (Liu et al., 2022).
Inspired by the success of listwise modeling and collaborative filtering (Goldberg et al., 1992) in recommendation systems, we find that collaboration also exists among the passages in the retrieval list and has not been fully exploited.Intuitively, for a specific query, a set of passages relevant to the query tend to describe the same entities, events and relations (Lee et al., 2019), while irrelevant ones outside of this set involve multifarious objects.Therefore, a passage is more likely to be relevant with the query if most of other passages share similar content with it.Similarities between passages can be naturally derived from retrievers, like BM25 scores in sparse2 retrievers and dot product of embeddings in dense retrievers.
In addition, the sparse and dense retrieval methods emphasize distinct linguistic aspects.Sparse retrieval relies on lexical overlap while dense retrieval focuses on semantic and contextual relevance.Several researchers have attempted to integrate the merits of these two types of methods.Karpukhin et al. (2020), Lin et al. (2020) and Luan et al. (2021) exploit the linear combination of these two types of retrieval scores.Seo et al. (2019), Khattab and Zaharia (2020) and Santhanam et al. (2022) index smaller units in sentence, i.e., words or phrases, to obtain fine-grained similarity.Gao et al. (2021a) and Yang et al. (2021) retrain dense retriever from scratch with the supervision of sparse signals.Nevertheless, the linear score combination lacks sufficient interaction, indexing smaller units sacrifices efficiency due to tremendous amount of embeddings, while rebuilding of retrievers discards their origin ranking capability.
To fully exploit the context of retrieved passages list and explore more sufficient ensemble of heterogeneous retriever, we propose a Hybrid and Collaborative Passage Reranking (HybRank) method, which leverages the collaboration within retrieved passages and incorporates diverse properties of retrievers for reranking.Our method is a flexible plug-in reranker that can be applied to arbitrary passage lists, including those that have already been reranked by other methods.In this work, without loss of generality, we employ the two most representative types of retrievers: sparse and dense retriever.Given a query and an initial retrieval list, we first extract similarities between them and a set of anchor texts via both the sparse and dense retrievers.We project and group them to form a set of hybrid and collaborative sequences, each corresponding to a query or passage.Afterwards, the relevance scores between the query and these passages are evaluated in the light of these sequences.
Extensive experiments demonstrate the consistent performance improvement brought by Hyb-Rank over passage lists from prevalent retrievers and strong rerankers.We elaborate ablation studies on the collaborative information, feature hybrid, anchor-wise interaction and the number of anchor passages, verifying the impact and indispensability of these components in HybRank.

Method
In mainstream information retrieval systems, the first-stage retrieval is designed to fetch a coarse candidate list from a large corpus C. Inevitably, false positives, i.e., irrelevant passages in the re-trieval list, are returned in the first-stage retrieval.To improve the precision of retrieval systems, the follow-up procedure reranking aims to distinguish the relevant passages from others in the retrieval list.This paper focuses on the reranking stage.
Formally, given a query q and an initial passage list P = [p 1 , p 2 , . . ., p N ] from upstream retriever, the reranking task is to reorder the passage list by reassigning scores S = [s 1 , s 2 , . . ., s N ] for each of these passages.We denote positive passages in the list as P + and negative ones as P − .In this section, we will present the details of HybRank.The pipeline of HybRank is illustrated in Figure 1.

Preliminaries
Sparse Retrieval Traditionally, text retrieval is dominated by token-matching, where texts are encoded into high-dimensional sparse vectors using the statistic information of tokens.The most commonly-used sparse retrieval methods include TF-IDF, BM25 and so forth.We adopt BM25 score as the similarity metric of sparse retrieval due to its robustness and popularity.
Specifically, given the query q and the document d, the BM25 score between q and d is obtained by summing the BM25 weights over the terms cooccurred in q and d: ) where t is a term, w RSJ t is t's Robertson-Spärck Jones weight, c t,d is the frequency of t in d, |d| is the document length and l is the average length of all documents in the collection.k 1 and b are tunable parameters.Refer to Robertson and Zaragoza (2009) for more details about BM25.
Dense Retrieval Owning to the flexibility for a task-specific representation provided by learnable parameters, recent works leverage neural networks to encode text into dense vectors, and search similar documents for queries in vector space.Typically, the query and document are encoded separately, and the relevance score is measured by the similarity of their embeddings.Any neural architectures capable of encoding text into a single fixed-length vector are suitable for dense retrieval.We use the predominant Transformer (Vaswani et al., 2017) encoder and dot product similarity, formulated as (2)  The passage list may have been reranked by another reranker before HybRank.We display a 5-passage list as an example.First, similarities between query, passages and anchors are derived from sparse and dense retrievers.Then, these similarities are converted to hybrid and collaborative sequences as the representations of query and passages.
Finally, these sequences are encoded into dense vectors via interaction and aggregation, and the reranking scores are obtained by dot product between the dense vectors of the query and each passage.
where T q (•) and T d (•) are Transformer encoders for queries and documents.Dot product similarity permits offline pre-encoding of large corpus and efficient retrieval via highly optimized vector nearest neighbor searching library (Johnson et al., 2021).

Hybrid and Collaborative Sequence
For a specific query, relevant passages tend to describe the same entities, events and relations from the query (Lee et al., 2019).In other words, most passages in the retrieval list would resemble to the true positive ones.Inspired by the success of collaborative filtering (Goldberg et al., 1992) in recommendation systems, we utilize the similarities between passages to distinguish the positive passages in the retrieval list.
Collaborative Sequence Similarity measurements can be naturally derived from retrievers.e.g., BM25 score in sparse retriever and dot product in dense retriever as described in Section 2.1.We compute the similarity between each passage and a set of anchors, which are the top-L passages of the retrieval list in this work and will collaborate to distinguish the positive passages.These similarity scores between passages can be pre-computed, as Hyb-Rank utilizes off-the-shelf retrievers.Denoting similarity score between passage p i and p j as f ij ∈ R, the passage p i can be represented as a sequence of similarity scalars Nevertheless, according to our observation, the similarity scalars within a retrieval list tend to concentrate on a small range.This is a reasonable phenomenon for that retrievers fetch relatively sim-ilar passages from the large corpus.To obtain more distinctive features, we employ a temperature softmax to stretch the distribution of similarities.After that, a min-max normalization is applied to scale them into range [−1, 1].These two transforms are formulated as where t is the temperature.Subscripts of x p i are omitted for brevity.
Feature Hybrid Similarity metrics of sparse and dense retrievers concentrate on lexical overlap and semantic relevance, respectively.To combine the lexical and semantic properties embedded in sparse and dense retrievers, we mix their similarity scores3 by stacking them in a channel manner.Formally, we substitute the similarity scalar where f s ij is the sparse similarity computed as Eqn. 1 and f d ij is the dense similarity computed as Eqn. 2. After that, the representation of passage p i is turned into a sequence of similarity vectors . Additionally, we map these similarity vectors in the sequence to D dimension with a trainable linear projection: where W ∈ R 2×D is a learnable parameter and e ij ∈ R D are embedded similarities.Thereafter, passage p i 's representation becomes a sequence of similarity embeddings E p i = [e i1 , e i2 , . . ., e iL ] ∈ R L×D , which comprises the similarity information between p i and anchor passages originating from both sparse and dense retrievers.These similarities deliver substantial information for the collaboration of passages and hold both the lexical and semantic properties from retrievers.With the same procedure, we compute the similarities between query and anchors, and derive the query representation E q = [e q1 , e q2 , . . ., e qL ] ∈ R L×D .Noted that the similarities from sparse and dense retriever are stretched and normalized individually before linear projection, as described in Eqn. 3. Consequently, we obtain N + 1 collaborative sequences in total, each representing a passage or a query and consisting of their lexical and semantic similarity information with L anchor passages.

Interaction and Aggregation
Following the prevalent sequence similarity learning paradigm in the field of natural language processing (Reimers and Gurevych, 2019;Gao et al., 2021b), we expect to measure the relevance of query and passage with their collaborative sequences in vector space.We obtain these vector representations by anchor-wise interaction and sequence aggregation in HybRank.
Anchor-wise Interaction The j-th elements e * j in these collaborative sequences E * indicate the similarities between retrieved passages and the j-th anchor passage.The importance of these anchors varies since they are picked with a single strategy.Specifically, an anchor is worthy of more consideration if showing strong correlation with a majority of retrieved passages, and vice versa.
To assess the quality of anchor passages, we conduct anchor-wise interaction.Concretely, for each position j, we collect the j-th similarity embedding e * j from query sequence and every passage sequences and refine them with a Transformer encoder, denoted as = Trans inter (e qj ; e 1j ; e 2j ; . . .; e N j ), where e ′ * j ∈ R D .Position embeddings are added to e * j according to its rank " * " for retaining the passage rank information.Subsequently, the similarity embedding sequences E * are converted to Sequence Aggregation We encode these sequences into dense vectors by aggregating the enhanced similarity embeddings.To be specific, we prepend a [CLS] embedding to the collaborative sequence, feed the extended sequence into another Transformer encoder and use the output of [CLS] as the representation of p i , formulated as where L×D and ⊕ denotes the concatenation operation.h p i ∈ R D is the vector representation of passage p i .The query representation h q ∈ R D is derived analogously.
Receptive Field and Complexity Interestingly, from another perspective, the anchor-wise interaction and sequence aggregation are equivalent to a column-wise and a row-wise attention applied on the matrix formulated by similarities of query, passages and anchors.Global receptive field is provided by these two axial-wise attentions (Ho et al., 2019).Consequently, similarity vector x ij perceives with each other, and the vector representations of query and passages are aware of the collaborative information from others.
A more direct approach to obtain global receptive field is element-wise interaction.Concretely, we can feed the concatenation of all sequences E into a single Transformer encoder, and output representations for each passage and query via multiple separate [CLS] tokens.However, due to the self-attention operation in Transformer, the computational complexity of element-wise interaction achieves O(N 2 L 2 ).In contrast, our method reduce the complexity to O(N 2 L + N L 2 ), by decomposing the element-wise attention on the similarity matrix into axial-wise.Note that the complexity can be further reduced to O(N L + N L) if leveraging linear Transformers (Katharopoulos et al., 2020;Wang et al., 2020) instead of vanilla Transformers.

Reranking and Training
Reranking Considering that query and passages have been converted into dense vectors encoded with collaborative information, we have several alternatives to judge the vector similarity as the relevance score between the query and passage.We use dot product in this work and thus the relevance score between query q and p i is computed by Then passages are sorted in descending order of their relevance score s i with query.
Training In order to assign high scores to relevant passages and low scores to irrelevant ones, HybRank needs to pull together the representation of relevant passages and query, while push the representation of irrelevant ones as apart from the query as possible.As there may exist more than one positive passage in the list, vanilla softmax loss fails to be directly applied to HybRank.We adopt the supervised contrastive loss (Khosla et al., 2020) to cope with multiple positives, which performs summation over positives outside the log function in softmax.The loss is formulated as where |P + | is the number of positive passages in the retrieval list, and τ is a tunable temperature.

Datasets
Natural Questions (Kwiatkowski et al., 2019) consists of real English questions from Google search engine with golden passages from English Wikipedia pages and answer span annotations.Following the settings from Karpukhin et al. (2020), we report the test set top-k accuracy (R@k), which evaluates the percentage of queries whose top-k retrieved passages contain the answers.(Bajaj et al., 2018) includes English queries from Bing search logs and was originally designed for machine reading comprehension.Following previous works (Qu et al., 2021;Ren et al., 2021b), we evaluate the dev set R@k as well as Mean Reciprocal Rank (MRR), which means the average reciprocal of the first retrieved relevant passage rank.

MS MARCO
TREC 2019/2020 (Craswell et al., 2020b,a) originate from TREC 2019/2020 Deep Learning (DL) Track.These two tracks provide additional Bing search queries and require to retrieve passages from the MS MARCO corpus.We use the official setting and evaluate the NDCG@10 of HybRank trained on MS MARCO with their test set.

Implementation Details
HybRank is a flexible plug-in reranker which can be applied on arbitrary passage lists including those that have already been reranked by other methods.Thus, we test HybRank against not only retrieval systems but also systems with other rerankers in it.We adopt dense retrievers which outperform sparse ones after elaborated pre-training (Chang et al., 2020;Gao andCallan, 2021, 2022) and finetuning (Sachan et al., 2021), as well as strong crossencoder based rerankers, to initialize the passage list.We simply select all passages in the initial list as anchors.The impact of anchor passages will be discussed in Section 3.4.These methods are implemented using RocketQA toolkit4 and Pyserini toolkit (Lin et al., 2021a) which is built on Lucene5 and FAISS (Johnson et al., 2021).
The hyper-parameters in HybRank are as follows.The temperature t in the feature normalization is set to 100 and 10 for sparse and dense similarity, respectively.We randomly initialize a 2-layer Transformer encoder for Trans inter and 1-layer for Trans aggr using Huggingface Transformers (Wolf et al., 2020).The embedding dimension, MLP inner-layer dimension and number of heads are 64, 256 and 8, respectively.There are 0.22M parameters in total.The temperature τ in the loss function is 0.07.We adopt the Adam optimizer with an initial learning rate 1 × 10 −3 with the warm-up ratio 0.1, followed by a cosine learning rate decay.We use gradient clipping of 2 and weight decay of 1 × 10 −6 .We train the model for 100 epochs with batch size 32, which takes about 13 hours on Natural Questions and 4 days on MS MARCO.All experiments are conducted on a single NVIDIA RTX 3090 GPU.

Results
Table 1 and Table 2 summarize the performance of HybRank and baselines on the Natural Questions, MS MARCO and TREC 2019/2020 datasets.More detailed evaluation results are listed in Appendix B. Some of adopted retrieval baselines involve both sparse and dense similarity from different perspectives.DPR (Karpukhin et al., 2020) selects hard negative samples from passages returned by BM25; FiD-KD (Izacard and Grave, 2021) starts its iterative training with passages retrieved using BM25; TCT-ColBERT-v1 (Lin et al., 2020)   The reranking performance of HybRank on MS MARCO and TREC 2019/2020 from a single run.We built HybRank upon DistilBERT-KD (Hofstätter et al., 2021a), ANCE (Xiong et al., 2021), TCT-ColBERT-v1 (Lin et al., 2020), TAS-B (Hofstätter et al., 2021b), TCT-ColBERT-v2 (Lin et al., 2021b), RocketQA (Qu et al., 2021) and RocketQAv2 (Ren et al., 2021b).The performance of these baselines and HybRank built upon them are on the left and right side of arrows, respectively.Improvements brought by HybRank are highlighted in bold.
alternative approximation for linear combination of dense and sparse retrieval; TCT-ColBERT-v2 (Lin et al., 2021b)  From the results we can observe that HybRank shows a consistent improvements over upstream retrievers and even rerankers.In general, Hyb-Rank based on stronger baselines can produce bet-ter reranking results.For example, HybRank built upon the retriever of RocketQA outperforms Hyb-Rank built upon DPR-Multi on Natural Questions, and the same phenomenon can be observed on most retrievers.Additionally, HybRank built upon systems with reranker further improves the performance on both datasets.These results prove the advantage of reranking based on arbitrary offthe-shelf retrievers and even other reranked results, which distinguishes HybRank from other rerankers.
The most surprising aspect of these results is that, in spite of inferior reranking results, low-scoring retrievers gain more relative improvements from HybRank than high-scoring ones.This result may be explained by the fact that HybRank relies heav-R@1 R@5 R@10 R@20 R@50 retriever  ily on the complementary information provide by sparse similarity.Low-scoring retrievers receive relatively more valuable information from sparse similarity than high-scoring retrievers, and accordingly improve more performance over upstream retrievers.We will discuss more on sparse-dense hybrid in Section 3.4.

Analysis
In this section, we discuss the impact of core components of HybRank: the hybrid and collaborative features, the anchor-wise interaction and the number of anchor passages.All experiments are conducted on Natural Questions dataset with DPR-Multi retriever.
Collaborative Feature The main difference between HybRank and other works is, it leverages the collaborative information between retrieved passages.To verify the impact of passage collaboration on reranking, we omit the collaborative feature in "w/o collab" by substituting only query-passage similarities for collaborative sequences, i.e., representing each passage as a one-token sequence according to its similarities with query.Besides, we exclude the query-passage similarity in "w/o q-p" by representing query via a learnable token rather than aggregated collaborative sequence.The results are presented in Table 3, where "retriever" denotes the assessment of initial passage list.Table 3 indicates that "w/o collab" shows an appreciable gain over "retriever", demonstrating that query-passage similarity is an essential and indispensable feature for HybRank.The most remarkable phenomenon is, "w/o q-p" surpasses "retriever" by a large margin, despite the fact that "w/o q-p" is completely unaware of the query.Namely, Hyb-Rank has the ability to distinguish the positive even only with the collaborative information among passages.Furthermore, standing on the shoulder of query-passage similarity, HybRank achieves even list feature R@1 R@5 R@10 R@20 R@50  better results than "w/o collab", which sufficiently substantiates the reranking capability of collaborative information.
Anchor-wise Interaction Apart from the collaborative sequence, anchor-wise interaction provides extra collaboration between sequences.We eliminate the Trans inter and directly aggregate the linear projected collaborative sequence to study the effectiveness of anchor-wise interaction.Table 3 shows that there is a noticeable drop of performance without anchor-wise interaction.The discrepancy could be attributed to the restricted receptive field."w/o inter" individually encodes each collaborative sequences of query and passages into dense vectors without anchor-wise interaction.In this manner, the relevance of these sequences is evaluated only in vector space where sequence information are severely compressed and not expressive enough.In contrast, equipped with anchorwise interaction, HybRank is capable of obtaining a global receptive field.Each element in these sequences captures the context of elements in all sequences, enabling more informative vector representation and fine-grained relevance estimation.
Feature Hybrid Despite the fact that the similarities of sparse and dense retriever reflect different aspect of linguistics, i.e., lexical overlap and semantic relevance, both of them tend to have collaborative property.Hence, it is more natural and easier to mix sparse and dense retrieval from the perspective of collaboration.To illustrate the complementarity of sparse and dense features and the necessity of feature hybrid in HybRank, we separately validate the effect of the two individual features and their hybrid.The ablations are conducted not only on initial passage list retrieved by dense retriever, but also list retrieved by sparse retriever for integrity and comparison.Identical trends can be observed from two settings of experiments in Table 4.The performance gains are limited when retrievers used for passage retrieval and similarity computation are same, but dramatically increase when they are different.Furthermore, additional slight improvements can be seen with the hybrid of the two features on both settings.These phenomena reveal that the main performance gains originate from the retriever different with that in retrieval stage, while the same type only plays an auxiliary role.Consequently, we draw the credible conclusion that different types of similarities provide additional complementary information over the initial passage list.
Moreover, regardless of feature used, HybRank achieves better results on passage list retrieved by dense retriever than sparse one, as more positives are contained in the dense retrieved list.This also corroborates the findings of Section 3.3 that superior initial passage list leads to better reranking results with HybRank.

Number of Anchor Passages
We evaluate the performance of HybRank under different number of anchors to study its impact.What can be clearly seen in Figure 2 is a consistent growth of performance as the anchor number L increasing.The underlying philosophy is that, with more anchor passages the passage list can derive more agreement to facilitate the collaboration between passages and alleviate the distraction from noisy ones.The positive correlation between the performance and anchor number indicates the effect of collaborative information in the retrieval list.
Despite the consistent growth with anchor number, the rate of performance increase begins to slow down when the number of anchors is greater than 60.Anchor passages are used for deriving collaborative information, and thus with more diverse anchors we can obtain more distinctive collaborative features.As the anchor number approaches to 100, the diversity of passages levels off, leading to stable performance with larger anchor numbers.
As L increase to a very large number, the average relevance of anchors will degrade to a low level.A legitimate concern may be that poor quality anchor set would pollute the collaborative aspect.Due to the O(L 2 ) computational complexity of sequence aggregation in HybRank, it is hard to directly perform experiments on large L. But we simulate the poor quality anchor set by randomly selecting anchor passages from corpus C. "r/d anchor" in Table 3 indicates that random anchors slightly improves the performance but still lags far behind the relevant anchors, demonstrating the benefits of collaborative information and the predominance of the anchor quality.
Nevertheless, the selection of anchor passages is flexible.Ideally, more elaborated anchor passage selection, e.g., clustering the passages from the corpus and selecting a fixed number of clustering centroids as anchors, would further enhance the performance and efficiency of HybRank.We leave the exploration of other anchor selecting strategy as a future work.

Text Retrieval
Retrieval is the first stage of information retrieval which requires high recall to cover more relevant document in the retrieval list.Traditional sparse approaches like TF-IDF and BM25 (Robertson and Zaragoza, 2009) rely on lexical overlap between query and documents.Although having dominated the field of text retrieval for a long time, these sparse methods suffer from lexical gap (Berger et al., 2000), namely, the synonymy problem.To tackle this issue, earlier techniques (Nogueira et al., 2019;Dai and Callan, 2020) adopt neural networks to reinforce the sparse methods.Recently proposed dense retrieval approaches (Karpukhin et al., 2020;Xiong et al., 2021) directly encode the query and passages into dense vectors via dual-encoder, which capture semantic in text and enable lowlatency search via highly optimized algorithms, e.g., FAISS (Johnson et al., 2021).
These two types of methods are not mutually exclusive and one's weakness is the other's strength.Some researchers combine the sparse and dense methods by score ensemble, improved training or trade-off model between sparse and dense retriever.Karpukhin et al. (2020) samples hard negatives from sparse retriever for the training of dense retriever.Seo et al. (2019), Khattab and Zaharia (2020) and Santhanam et al. (2022) index terms or phrases instead of documents for more fine-grained similarity and higher efficiency.Lin et al. (2020) and Luan et al. (2021) explore the linear sparsedense score combination and its alternatives.Gao et al. (2021a) and Yang et al. (2021) leverages the lexical matching or token-level interaction signals to train the dense retriever.
However, among these methods, score ensemble lacks sufficient interaction of sparse and dense methods, smaller units indexing sacrifices efficiency, and retraining one type of retriever with the help of the other discards its origin ranking capability.In contrast, our method can be applied to arbitrary passage list, incorporating the lexical and semantic properties of off-the-shelf retrievers and meanwhile ensuring the generality and flexibility.

Text Reranking
The second stage reranking is based on the results of retrieval system and aims to create a more fine-grained comparison within retrieval list.Typically, cross-encoder is utilized to capture the interactions between query and passage in token-level.Nogueira and Cho (2020) and Sun et al. (2021) adopt BERT (Devlin et al., 2019) to achieve token-level interactions with attention mechanism (Vaswani et al., 2017).To reduce the massive computation overhead (Reimers and Gurevych, 2019), Khattab and Zaharia (2020) and Gao et al. (2020) propose a lightweight interaction on dense representations from retrievers.While based on first-stage retrieval, these methods individually compute the relevance for each retrieved passage, omitting the extra information implied by the whole list and requiring multiple runs.
Several pseudo-relevance feedback approaches (He and Ounis, 2009;Zamani and Croft, 2016;Zamani et al., 2016) aim to refine the query model with the top-retrieved documents.Listwise context is also well explored in multi-stage recommendation systems (Liu et al., 2022), such as PRM (Pei et al., 2019), which regards each item as a token, learns the mutual influence between items using self-attention and reranks all items altogether.Different from prior approaches, we extract the collaborative feature from the retrieval list, represent the query and each passages as hybrid and collaborative sequences, and measure the relevance between query and passages using these sequences from the perspective of collaboration.

Conclusion
We introduce HybRank, a hybrid and collaborative passage reranking method.HybRank extracts the similarities between texts via off-the-shelf retrievers to form hybrid and collaborative sequences as the representations of query and passages.Efficient reranking is based on these sequences which incorporate the lexical and semantic properties of sparse and dense retrievers.Extensive experiments confirm the effectiveness of HybRank upon arbitrary passage list.Elaborated ablation studies investigate the impact of core components in HybRank.We hope our work could provide inspiration for researchers in the field of information retrieval, and steer more exploration on collaboration and correlation between texts.

Limitations
We evaluate HybRank on Natural Questions, MS MARCO and TREC 2019/2020 datasets, which focus on English Open-domain Question Answering.Although none of the components in HybRank are specifically designed for English, the verification of HybRank on other languages is limited.Otherwise, there are more general information retrieval tasks involving diversity or broader coverage in the returned results.Considering the possibility of lacking collaborative property, whether HybRank can generalize to these high-coverage retrieval tasks is still inconclusive.
As Transformer encoder architecture is adopted in the sequence interaction and aggregation, the computation cost would be unacceptable when the length of passage list or number of anchors is too large.This is also the reason why we only conduct experiments with anchor numbers no more than 100.Besides, HybRank only uses similarities computed by off-the-shelf retrievers as input features, and thus lacks sufficient interaction between raw inputs.The performance of HybRank may be limited by the capability of upstream retrievers.How to incorporate the interaction of raw inputs into Hyb-Rank while avoiding massive computation cost is still an open problem for further investigation.

Ethics Statement
This work focuses on improving the ranking results of passage retrieval systems.Retrieval is the fundamental component for many downstream tasks.However, it poses risks in terms of bias, misuse and misinformation due to the yet inaccurate results.Selection bias resulting from data collection, e.g., lexical bias, may exist in the adopted datasets.Additionally, as the reranking approach in this work is built upon off-the-shelf retrievers, bias may ensue from upstream retrievers.

A Datasets Details
Dataset Natural Questions is under CC BY-SA 3.0 license.MS MARCO and TREC 2019/2020 are under CC BY-SA 4.0 license.The statistics of these datasets are presented in Table 5.

B Full Evaluation Results
We present the full evaluation results on Natural Questions, MS MARCO and TREC 2019/ 2020 in Table 6 and 7.

C Reranking Cases
We present reranking cases in Figure 3 and Figure 4.The first lines in these figures are the query sentence.We illustrate the distribution of positives in the passage list before and after reranking.
Blue squares indicate positive passages while white squares stand for negative passages in the retrieval list.We only show top-50 out of 100 passages in these lists due to the space limitation.Following the positive distribution, we list several raw texts of reranked passages for the question.Observed from the distribution visualization and rank changes of passages, the positive distributions shift toward the front of the lists as the quantitative analysis in Section 3.3.Ranks of many positive passages are raised by a large margin.Besides, it is apparent that positive passages tend to describe the same entities, events and relations as discussed in Section 1. Case 1 in Figure 3 involves "the king of England" while case 2 in Figure 4 is about "Where's Waldo".

Figure 1 :
Figure1: Illustration of HybRank pipeline.For a specific query, the passage list is initialized by an arbitrary retriever.The passage list may have been reranked by another reranker before HybRank.We display a 5-passage list as an example.First, similarities between query, passages and anchors are derived from sparse and dense retrievers.Then, these similarities are converted to hybrid and collaborative sequences as the representations of query and passages.Finally, these sequences are encoded into dense vectors via interaction and aggregation, and the reranking scores are obtained by dot product between the dense vectors of the query and each passage.

Figure 2 :
Figure2: Impact study on the number of anchor passages.We conduct experiments on the test set of Natural Questions with anchor number 5, 20, 40, 60, 80, 100.The metric of anchor number 0 denotes the assessment of initial retrieval list.

Table 3 :
The results of ablation study for collaborative features, anchor-wise interaction and anchor passages on the test set of Natural Questions.

Table 4 :
The results of ablation study for feature hybrid on the test set of Natural Questions.

Table 5 :
Statistics of Natural Questions, MS MARCO and TREC 2019/2020 datasets.