RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking

In various natural language processing tasks, passage retrieval and passage re-ranking are two key procedures in finding and ranking relevant information. Since both the two procedures contribute to the final performance, it is important to jointly optimize them in order to achieve mutual improvement. In this paper, we propose a novel joint training approach for dense passage retrieval and passage reranking. A major contribution is that we introduce the dynamic listwise distillation, where we design a unified listwise training approach for both the retriever and the re-ranker. During the dynamic distillation, the retriever and the re-ranker can be adaptively improved according to each other’s relevance information. We also propose a hybrid data augmentation strategy to construct diverse training instances for listwise training approach. Extensive experiments show the effectiveness of our approach on both MSMARCO and Natural Questions datasets. Our code is available at https://github.com/PaddlePaddle/RocketQA.


Introduction
Recently, dense passage retrieval has become an important approach in the task of passage retrieval (Karpukhin et al., 2020) to identify relevant contents from a large corpus. The underlying idea is to represent both queries and passages as low-dimensional vectors (a.k.a., embeddings), so that the relevance can be measured via embedding similarity. Additionally, a subsequent procedure of passage re-ranking is widely adopted to further improve the retrieval results by incorporating a reranker Luan et al., 2021). Such a two-stage procedure is particularly useful in a variety of natural language processing tasks, including question answering ; Xiong * Equal contribution. The work was done when Ruiyang Ren was doing internship at Baidu. † Corresponding authors. et al., 2020b), dialogue system (Ji et al., 2014;Henderson et al., 2017) and entity linking (Gillick et al., 2019;. Following a retrieve-then-rerank way, the dense retriever in passage retrieval and the re-ranker in passage re-ranking jointly contribute to the final performance. Despite the fact that the two modules work as a pipeline during the inference stage, it has been found useful to train them in a correlated manner. For example, the retriever with a dual-encoder can be improved by distilling from the re-ranker with a more capable cross-encoder architecture , and the re-ranker can be improved with training instances generated from the retriever Huang et al., 2020;Gao et al., 2021b). Therefore, there is increasing attention on correlating the training of the retriever and re-ranker in order to achieve mutual improvement (Metzler et al., 2021;Huang et al., 2020;. Typically, these attempts train the two modules in an alternative way: fixing one module and then optimizing another module. It will be more ideal to mutually improve the two modules in a joint training approach. However, the two modules are usually optimized in different ways, so that the joint learning cannot be trivially implemented. Specially, the retriever is usually trained by sampling a number of in-batch negatives to maximize the probabilities of positive passages and minimize the probabilities of the sampled negatives (Xiong et al., 2020a;Karpukhin et al., 2020), where the model is learned by considering the entire list of positive and negatives (called listwise approach 1 ). As a comparison, the re-ranker is usually learned in a pointwise or pairwise manner (Nogueira and Cho, 2019;Nogueira et al., 2019b), where the model is learned based on a single passage or a pair of passages. To address this issue, our idea is to unify the learning approach for both retriever and re-ranker. Specially, we adopt the listwise training approach for both retriever and re-ranker, where the relevance scores are computed according to a list of positive and negative passages. Besides, it is expected to include diverse and high-quality training instances for the listwise training approach, which can better represent the distribution of all the passages in the whole collection. Thus, it requires more effective data augmentation to construct the training instances for listwise training.
To this end, we present a novel joint training approach for dense passage retrieval and passage re-ranking (called RocketQAv2). The major contribution of our approach is the novel dynamic listwise distillation mechanism for jointly training the retriever and the re-ranker. Based on a unified listwise training approach, we can readily transfer relevance information between the two modules. Unlike previous distillation methods that usually froze one module, our approach enables the two modules to adaptively learn relevance information from each other, which is the key to mutual improvement in joint training. Furthermore, we design a hybrid data augmentation strategy to generate diverse training instances for listwise training approach.
The contributions of this paper can be summarized as follows: • We propose a novel approach that jointly trains the dense passage retriever and passage re-ranker. It is the first time that joint training has been implemented for the two modules.
• We make two major technical contributions by introducing dynamic listwise distillation and hybrid data augmentation to support the proposed joint learning approach.
• Extensive experiments show the effectiveness of our proposed approach on both MS-MARCO and Natural Questions datasets.

Related Work
Recently, dense passage retrieval has demonstrated better performance than traditional sparse retrieval methods (e.g., TF-IDF and BM25) on the task of passage retrieval. Existing approaches of learning dense passage retriever can be di-vided into two categories: (1) self-supervised pretraining for retrieval Guu et al., 2020) and (2) fine-tuning pre-trained language models (PLMs) on labeled data (Lu et al., 2020;Karpukhin et al., 2020;Xiong et al., 2020a;Luan et al., 2021; . Our work follows the second class of approaches, which show better performance with less cost. There are two important tricks to train an effective dense retriever: (1) incorporating hard negatives during training (Karpukhin et al., 2020;Xiong et al., 2020a; and (2) distilling the knowledge from cross-encoder-based reranker into dual-encoder-based retriever (Izacard and Grave, 2020; Yang and Seo, 2020;. Based on the retrieved passages from a retriever, PLM-based rerankers with the cross-encoder architecture have recently been applied on passage re-ranking to improve the retrieval results (Qiao et al., 2019;Nogueira and Cho, 2019;Yan et al., 2019), and yield substantial improvements over the traditional methods. Apart from separately considering the above two tasks, it has been proved that passage retrieval and passage re-ranking are actually highly related and dependent (Huang et al., 2020;Khattab and Zaharia, 2020). The retriever needs to capture the relevance knowledge from the re-ranker, and the re-ranker should be specially optimized according to the preceding results of the retriever. Some efforts studied the possibility of leveraging the dependency of retriever and re-ranker, and try to enhance the connection between them in an alternative way Huang et al., 2020). Furthermore, several studies attempted to jointly train the retriever and the reader for Open-domain Question Answering (Guu et al., 2020;Sachan et al., 2021;Karpukhin et al., 2020). Different from the prior studies, our method is a joint learning architecture of the dense passage retriever and the re-ranker.

Methodology
In this section, we describe a novel joint training approach for dense passage retrieval and passage re-ranking (called RocketQAv2)

Overview
In this work, we consider two tasks including dense passage retrieval and passage re-ranking, which are described as follows.
Given a query q, the aim of dense passage retrieval is to retrieve k most relevant passages from a large collection of M text passages. The dualencoder (DE) architecture is widely adopted by prior works (Karpukhin et al., 2020;Luan et al., 2021;, where two separate dense encoders E P (·) and E Q (·) are used to map passages and queries to d-dimensional real-valued vectors (a.k.a., embeddings) separately, and then an index of all passage embeddings is built for efficient retrieval. The similarity between the query q and the passage p is defined using the dot product: Given a list of candidate passages retrieved by a passage retriever, the aim of passage re-ranking is to further improve the retrieval results with a re-ranker, which estimates a relevance score s(q, p) measuring the relevance level of a candidate passage p to a query q. Among the implementations of the re-ranker, a cross-encoder (CE) based on PLMs usually achieves superior performance (Nogueira and Cho, 2019;Qiao et al., 2019), which can better capture the semantic interactions between the passage and the query, but requires more computational efforts than the dualencoder. To compute the relevance score s ce (q, p), a special token [SEP] is inserted between q and p, and then the representation at the [CLS] token from the cross-encoder is fed into a learned linear function.
Usually, the passage retriever and the passage re-ranker are learned in either a separate or alternative way (i.e., fixing one and then training the other). To achieve the joint training, we introduce dynamic listwise distillation (Section 3.2), which can adaptively improve both components in a joint optimization process. To support the listwise training, we further propose hybrid data augmentation (Section 3.2) for generating diverse and high-quality training instances. Based on the two major contributions, we present the learning procedure in Section 3.4 and related discussion in Section 3.5.

Dynamic Listwise Distillation
Since the re-ranker adopts the more capable cross-encoder architecture, it has become a common strategy to distill the knowledge from reranker into the retriever. However, in prior stud-  ies (Karpukhin et al., 2020;Xiong et al., 2020a;, the retriever and re-ranker are usually learned in different ways, and the parameters of the re-ranker are frozen, which cannot jointly optimize the two components for mutual improvement. Considering this issue, we design a unified listwise training approach to learn both the retriever and the re-ranker, and dynamically update both the parameters of the re-ranker and the retriever during distillation. In this way, the two components can adaptively improve each other. We call this approach as dynamic listwise distillation. Next, we will describe the details of dynamic listwise distillation. Formally, given a query q in a query set Q and the corresponding list of candidate passages (instance list) P q = {p q,i } 1≤i≤m related to query q, we can obtain the relevance scores S de (q) = {s de (q, p)} p∈Pq and S ce (q) = {s ce (q, p)} p∈Pq of a query q and passages in P q from the dual-encoderbased retriever and the cross-encoder-based reranker, respectively. Then, we normalize them in a listwise way to obtain the corresponding relevance distributions over candidate passages: s ce (q, p) = e sce(q,p) p ∈Pq e sce(q,p ) . ( The main idea is to adaptively reduce the difference between the two distributions from the retriever and the re-ranker so as to mutually improve each other. To achieve the adaptively mutual improvement, we minimize the KL-divergence be-  tween the two relevance distributions {s de (q, p)} and {s ce (q, p)} from the two modules: Additionally, we provide ground-truth guidance for the joint training. Specifically, we also adopt a cross-entropy loss for the re-ranker based on passages in P q with supervised information: where N is the number of training instances, and p + and p − denote the positive passage and negative passage in P q , respectively. We combine the KL-divergence loss and the supervised crossentropy loss defined in Eq. (4) and Eq. (5) to obtain the final loss function: Figure 1 presents the illustration of the dynamic listwise distillation. The re-ranker is optimized with labeled lists (Eq. (5)), and it produces relevance distributions to train the retriever (Eq. (4)). Unlike RocketQA that conducts hard pseudo labeled data , we utilize soft labels (i.e., estimated relevance distributions) for relevance distillation. Besides, we dynamically update the parameters of the re-ranker in order to adaptively synchronize the two modules for mutual improvement. To discriminate from the previous static distillation based on pseudo labels, we call our method as dynamic listwise distillation.

Hybrid Data Augmentation
To perform dynamic listwise distillation, we need to generate the candidate passage list P q for query q. Since our approach relies on listwise training, we expect the candidate passage list in-cludes diverse and high-quality candidate passages, which may better represent the distribution of all the passages in the whole collection. Prior works (Xiong et al., 2020a;Karpukhin et al., 2020) demonstrate that it is important to include hard negatives in the candidate passage list. Basically, ANCE (Xiong et al., 2020a) and DRP (Karpukhin et al., 2020) introduces the randomly sampled hard negatives, while RocketQA  incorporates denoised hard negatives. Inspired by prior works, we design a hybird data augmentation way to construct diverse training instances by incorporating both random sampling and denoised sampling.
As shown in Figure 2, our proposed hybrid data augmentation includes both undenoised and denoised instances. First, we utilize the RocketQA retriever to retrieve top-n passages from the corpus. For undenoised instances, we randomly sample the undenoised hard negatives from retrieved passages and include ground-truth positives. For denoised instances, we utilize the RocketQA reranker to remove the predicted negatives with low confidence scores. We also include denoised positives that are predicted as positives by the Rock-etQA re-ranker with high confidence scores.
Compared with previous methods, our data augmentation method utilizes more ways (undenoised or denoised) to generate both positives and negatives to improve the diversity of instances list P q . Specially, we mainly focus on including hard negatives. This is particularly important to dynamic listwise distillation, since weak negatives are easy to be identified, which cannot increase additional gain for both modules.

Training Procedure
In this section, we present the training procedure of our approach. Figure 3 presents the illustration of the training procedure for our approach. We first initialize the retriever and re-ranker with the learned dualencoder and cross-encoder of RocketQA 2 . Then, we utilize the retriever and re-ranker in RocketQA   to generate the training data via hybrid data augmentation in Section 3.3. Finally, we preform dynamic listwise distillation to jointly optimize the retriever and re-ranker following Section 3.2. During distillation, the retriever and re-ranker are mutually optimized according to the final retrieval performance. After the training stage, we can apply the retriever and re-ranker for inference in a pipeline.

Discussion
In this section, we discuss the comparison with RocketQA.
This work presents an extended contribution to RocketQA , called Rock-etQAv2. As seen from above, RocketQAv2 reuse the network architecture and important training tricks in RocketQA. A significant improvement is that RocketQAv2 incorporates a joint training approach for both the retriever and the re-ranker via dynamic listwise distillation. For dynamic listwise distillation, RocketQAv2 designs a unified listwise training approach, and utilizes soft relevance labels for mutual improvement. Such a distillation mechanism is able to simplify the training process, and also provides the possibility for end-toend training the entire dense retrieval architecture.

Experiments
In this section, we first describe the experimental settings, then report the main experimental results, ablation study, and detailed analysis.

Experimental Setup
Datasets We adopt two public datasets on dense passage retrieval and passage re-ranking, including MSMARCO (Nguyen et al., 2016) and Natural Questions (Kwiatkowski et al., 2019). Table 1 lists the statistics of the datasets. MSMARCO was originally designed for multiple passage machine reading comprehension, and its queries were sampled from Bing search logs. Based on the queries and passages in MSMARCO Question Answering, MSMARCO Passage Ranking for passage retrieval and ranking was created. Natural Questions (NQ) was originally introduced for opendomain QA. This corpus consists of real queries from the Google search engine along with their long and short answer annotations from the topranked Wikipedia pages. DPR (Karpukhin et al., 2020) selected the queries that had short answers and processed all the Wikipedia articles as the collection of passages. In our experiments, we reuse the NQ version created by DPR.
Evaluation Metrics Following previous work, we adopt Mean Reciprocal Rank (MRR) and Recall at top k ranks (Recall@k) to evaluate the performance of passage retrieval. MRR calculates the averaged reciprocal of the rank at which the first positive passage is retrieved. Recall@k calculates the proportion of questions to which the top k retrieved passages contain positives.

Model Specifications
Our retriever and reranker largely follow ERNIE-2.0 base , which is a BERT-like  model with 12-layer transformers and introduces a continual pre-training framework on multiple pretrained tasks. As described in previous section, the retriever is initialized with the parameters of the dual-encoder in the first step of RocketQA, and the re-ranker is initialized with the parameters of the cross-encoder in the second step of RocketQA.

Results on Passage Retrieval
In this part, we first describe the comparing baselines, then report the results on passage retrieval.

Baselines
To have comprehensive comparison, we choose as baselines the state-of-the-art approaches that consider both sparse and dense passage retrievers. The sparse retrievers include the traditional retriever BM25 (Yang et al., 2017) and five traditional retrievers enhanced by neural networks, including doc2query (Nogueira et al., 2019c), DeepCT (Dai and Callan, 2019), docTTTTTquery (Nogueira et al., 2019a), GAR , UHD-BERT (Jang et al., 2021) and COIL (Gao et al., 2021a). Both doc2query and docTTTTTquery employ neural query generation to expand documents. In contrast, GAR employs neural generation models to expand queries and UHD-BERT is empowered by extremely high dimensionality and controllable sparsity. Different from them, DeepCT and COIL utilizes BERT to learn the term weight or inverted list.

Results
The results of different passage retrieval methods are presented in Table 2. It can be observed that: (1) Among all methods, we can see the Rock-etQAv2 retriever and PAIR outperform other baselines by a large margin. PAIR is a contemporaneous work with RocketQAv2, which obtains improvement by pre-training on out-of-domain data. We observe that RocketQAv2 outperforms PAIR in the metrics of MRR@10 and Recall@5, we consider that dynamic listwise distillation enables the retriever to capture the re-ranker ability of passage ranking at top ranks. Our model is trained with complete in-domain training data. Different from the baselines, we adopt a listwise training ap-  (Khattab and Zaharia, 2020) BERTbase 1000 BM25 34.9 BERTlarge (Nogueira and Cho, 2019) BERTlarge 1000 BM25 36.5 RepBERT (Zhan et al., 2020) BERTlarge 1000 RepBERT 37.7 Multi-stage (Nogueira et al., 2019b) BERTbase 1000 BM25 39.0 CAKD (Hofstätter et al., 2020) DistilBERT 1000 BM25 39.0 ME-BERT (Luan et al., 2021) BERTlarge 1000 ME-BERT 39.5 ME-HYBIRD (Luan et al., 2021) BERTlarge 1000 ME-HYBIRD 39.4 TFR-BERT  BERTlarge 1000 BM25 40.5 RocketQA  ERNIEbase 50 RocketQA 40.9 RocketQAv2 (re-ranker)  proach to jointly train both retriever and re-ranker and couple the two models by dynamic listwise distillation with hybrid data augmentation.
(2) We notice that different PLMs are used in different approaches, as shown in the second column of Table 2. In our approach, we use ERNIE base as the backbone model. We replacing BERT base used in DPR with ERNIE base to examine the effect of the backbone model, namely DPR-E. we observe that although both two methods employ the same backbone PLM, our method significantly outperforms DPR-E, indicating that PLM is not the factor for improvement.
(3) Among sparse retrievers, we find that COIL outperforms other methods, which seems to be a robust sparse baseline that gives substantial performance on the two datasets. We also observed that sparse retrievers overall perform worse than dense retrievers, such a finding has also been reported in prior studies (Xiong et al., 2020a;Luan et al., 2021;, which indicates the effectiveness of the dense retrieval approach.

Results on Passage Re-ranking
In this part, we first describe the comparing baselines, then report the results on passage re-ranking.
Among these methods, BM25 is a term-based method, and the rest are BERT-based methods based on neural networks. Since RocketQA does not report re-ranking results, we use the opensource re-ranker in RocketQA repository for evaluation. We report the results of RocketQAv2 reranker based on BM25 retriever with 1000 candidates, RocketQA retriever with 50 candidates and RocketQAv2 retriever with 50 candidates for comparing.
The prior works follow the two-stage approach (i.e., retrieve-then-rerank), where a passage retriever retrieves a (usually large) list of candidates from the passage collection in the first stage. In the second stage, a more expensive model (e.g., BERT-based cross-encoder) re-ranks the candidates. Note that the retrievers in baseline models may be differently designed. Table 3 summarizes the passage re-ranking performance of RocketQAv2 re-ranker and all baselines on MSMARCO dataset.

Results
As we can see, the RocketQAv2 re-ranker significantly outperforms all the competitive methods, demonstrating that the re-ranker benefits from our joint learning process, which is optimized to fit the relevance distribution of the retriever with dynamic listwise distillation. Morever, if we use RocketQAv2 re-ranker to replace RocketQA reranker and apply it on the retrieval results by RocketQA retriever, we can see that RocketQAv2 re-ranker brings 0.9 percentage point improvement comparing to RocketQA re-ranker. This also demonstrates the effectiveness of RocketQAv2 reranker. Additionally, if we apply RocketQAv2 re-  ranker on the top 1000 candidates by BM25, the performance is significantly better than other base models, and comparable to other large models.

Detailed Analysis
Apart from the above illustration, we also implement detailed analysis on both dynamic listwise distillation and hybrid data augmentation.

Analysis on Distillation
In this section, we analyze the results of retriever by replacing the optimization form in dynamic listwise distillation.
Dynamic or Static? To examine the effect of dynamic optimization in distillation, we utilize a well-trained cross-encoder based re-ranker as a teacher model to perform static distillation comparing with dynamic listwise distillation. During static distillation, the parameters of re-ranker model are not updated and the retriever captures the relevance knowledge from the re-ranker in a traditional knowledge distillation manner. As shown in Table 4, training with static distillation brings a performance drop. It demonstrates that dynamic optimization of both retriever and re-ranker enables to share relevance distributions with each other and brings a significant performance improvement.
Listwise or Pointwise? To study the effect of the listwise training approach, we replace it with the pointwise training approach for the re-ranker during joint training. In such case, the training approaches of the retriever and the re-ranker are actually different. The re-ranker mainly optimized by the pointwise relevance scores of instances in P q , while the retriever has to learn the relevance 3 from the re-ranker in a listwise way. shows that the pointwise training approach brings performance drop, and the listwise training approach performs better in our joint training archtecture. It demonstrates that the listwise training approach is more suitable in our joint training architecture than pointwise, since it can better simulate the relevance distribution in dynamic listwise distillation.

Analysis on Hybrid Data Augmentation
In this section, we conduct a detailed analysis for the hybrid data augmentation.
The Effect of Denoised Instances In order to examine the effect of hybrid training data, we remove the denoised instances in training data and only use the undenoised data for joint training. Table 4 shows the performance drop in terms of MRR@10 without denoised instances, which indicates that training data generated from different ways better represent the distribution of all the passages in the whole collection, and improve the performance especially on the metrics at top ranks.

The Number of Hard Negatives
In hybrid data augmentation, we focus on obtaining diverse hard negatives. In our experiments, we find that the number of hard negatives significantly affects the performance of our joint training approach. As we described in previous section, for each query, we sample one positive instance and the rest of instances in the instance list P q are hard negatives. Thus, the effect of the number of hard negatives should be equivalent to the effect of the number of instances. Figure 4 shows the effect of the num-ber of instances on both the passage retrieval and the passage re-ranking. From Figure 4, we can observe that a larger number of instances (i.e., number of hard negatives) improves the performance. The result demonstrates that instance list P q with more instances can better represent the distribution of all the passages in the whole collection.
Incorporation of In-batch Negatives For further study, we examine the effect of in-batch negatives in joint training process. Besides the hard negatives, we incorporate in-batch sampling during the joint training process, which can increase the amount of negatives for each query. Although the queries have additional in-batch negatives, we did not observe the performance improvements.

Conclusion
This paper has presented a novel joint training approach for dense passage retrieval and passage re-ranking. To implement the joint training, we have made two important technical contributions, namely dynamic listwise distillation and hybrid data augmentation. Such an approach is able to enhance the mutual improvement between the retriever and the re-ranker, which can also simplify the training process. Extensive results have demonstrated the effectiveness of our approach. To our knowledge, it is the first time that the retriever and re-ranker are jointly trained in a unified architecture, which provides the possibility of training the entire retrieval architecture in an endto-end way.