PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval

Recently, dense passage retrieval has become a mainstream approach to finding relevant information in various natural language processing tasks. A number of studies have been devoted to improving the widely adopted dual-encoder architecture. However, most of the previous studies only consider query-centric similarity relation when learning the dual-encoder retriever. In order to capture more comprehensive similarity relations, we propose a novel approach that leverages both query-centric and PAssage-centric sImilarity Relations (called PAIR) for dense passage retrieval. To implement our approach, we make three major technical contributions by introducing formal formulations of the two kinds of similarity relations, generating high-quality pseudo labeled data via knowledge distillation, and designing an effective two-stage training procedure that incorporates passage-centric similarity relation constraint. Extensive experiments show that our approach significantly outperforms previous state-of-the-art models on both MSMARCO and Natural Questions datasets.


Introduction
With the recent advances of pre-trained language models, dense passage retrieval techniques (representing queries and passages in low-dimensional semantic space) have significantly outperformed traditional term-based techniques (Guu et al., 2020;Karpukhin et al., 2020). As the key step of finding the relevant information, it has been shown that dense passage retrieval can effectively improve the performance in a variety of tasks, including question answering ; Xiong * Equal contribution. † The work was done when Ruiyang Ren was doing internship at Baidu. ‡ Corresponding authors. 1 Our code is available at https://github.com/ PaddlePaddle/RocketQA  Figure 1: An illustrative case of a query q, its positive passage p + and negative passage p − : (a) Query-centric similarity relation enforces s(q, p + ) > s(q, p − ); (b) Passage-centric similarity relation further enforces s(p + , q) > s(p + , p − ), where s(p + , q) = s(q, p + ). We use the distance (i.e., dissimilarity) for visualization: the longer the distance is, the less similar it is. et al., 2020b), information retrieval (Luan et al., 2021;Khattab and Zaharia, 2020), dialogue (Ji et al., 2014;Henderson et al., 2017) and entity linking (Gillick et al., 2019;. Typically, the dual-encoder architecture is used to learn the dense representations of queries and passages, and the dot-product similarity between the representations of queries and passages becomes ranking measurement for retrieval. A number of studies have been devoted to improving this architecture (Guu et al., 2020;Karpukhin et al., 2020;Xiong et al., 2020a) for dense passage retrieval. Previous studies mainly consider learning query-centric similarity relation, where it tries to increase the similarity s(q, p + ) between a query and a positive (i.e., relevant) passage meanwhile decrease the similarity s(q, p − ) between the query and a negative (i.e., irrelevant) passage. We argue that query-centric similarity relation ignores the relation between passages, and it brings difficulty to discriminate between positive and negative passages. To illustrate this, we present an example in Figure 1, where a query q and two passages p + and p − are given. As we can see in Figure 1(a), although query-centric similarity relation can enforce s(q, p + ) > s(q, p − ) and identify the positive passages in this case, the distance (i.e., dissimilarity) between positive and negative passages is small. When a new query is issued, it is difficult to discriminate between positive passage p + and negative passage p − .
Considering this problem, we propose to further learn passage-centric similarity relation for enhancing the dual-encoder architecture. The basic idea is shown in Figure 1(b), where we set an additional similarity relation constraint s(p + , q) > s(p + , p − ): the similarity between query q and positive passage p + should be larger than that between positive passage p + and negative passage p − . In this way, it is able to better learn the similarity relations among query, positive passages and negative passages. Although the idea is appealing, it is not easy to implement due to three major issues. First, it is unclear how to formalize and learn both query-centric and passage-centric similarity relations. Second, it requires large-scale and highquality training data to incorporate passage-centric similarity relation. However, it is expensive to manually label data. Additionally, there might be a large number of unlabeled positives even in the existing manually labeled datasets (Qu et al., 2020), and it is likely to bring false negatives when sampling hard negatives. Finally, learning passagecentric similarity relation (an auxiliary task) is not directly related to the query-centric similarity relation (a target task). In terms of multi-task viewpoint, multi-task models often perform worse than their single-task counterparts (Alonso and Plank, 2017;McCann et al., 2018;Clark et al., 2019). Hence, it needs a more elaborate design for the training procedure.
To this end, in this paper, we propose a novel approach that leverages both query-centric and PAssage-centric sImilarity Relations (called PAIR) for dense passage retrieval. In order to address the aforementioned issues, we have made three important technical contributions. First, we design formal loss functions to characterize both query-centric and passage-centric similarity relations. Second, we propose to generate pseudolabeled data via knowledge distillation. Third, we devise a two-stage training procedure that utilizes passage-centric similarity relation during pre-training and then fine-tunes the dual-encoder according to the task goal. The improvements in the three aspects make it possible to effectively leverage both kinds of similarity relations for improving dense passage retrieval.
The contributions of this paper can be summarized as follows: • We propose an approach that simultaneously learns query-centric and passage-centric similarity relations for dense passage retrieval. It is the first time that passage-centric similarity relation has been considered for this task.
• We make three major technical contributions by introducing formal formulations, generating high-quality pseudo-labeled data and designing an effective training procedure.
• Extensive experiments show that our approach significantly outperforms previous state-of-the-art models on both MSMARCO and Natural Questions datasets.

Related Work
Recently, dense passage retrieval has demonstrated better performance than traditional sparse retrieval methods (e.g., TF-IDF and BM25). Different from sparse retrieval, dense passage retrieval represents queries and passages into lowdimensional vectors (Zhao et al., 2022;Guu et al., 2020;Karpukhin et al., 2020), typically in a dualencoder architecture, and uses dot product as the similarity measurement for retrieval. The existing approaches for dense passage retrieval can be divided into two categories: (1) unsupervised pretraining for retrieval (2) fine-tuning only on labeled data. In the first category, different pre-training tasks for retrieval were proposed.  proposed a specific approach to pre-training the retriever with an unsupervised task, namely Inverse Cloze Task (ICT), and then jointly finetuned the retriever and a reader on labeled data. REALM (Guu et al., 2020) proposed a new pretraining approach, which jointly trained a masked language model and a neural retriever. Different from them, our proposed approach utilizes the pseudo-labeled data via knowledge distillation in the pre-training stage, and the quality of the generated data is high (see Section 4.6).
In the second category, the existing approaches fine-tuned pre-trained language models on labeled data (Karpukhin et al., 2020;Luan et al., 2021). Both DPR (Karpukhin et al., 2020) and ME-BERT (Luan et al., 2021) used in-batch random sampling and hard negative sampling by BM25, while ANCE (Xiong et al., 2020a), NPRINC (Lu et al., 2020) and RocketQA (Qu et al., 2020) explored more sophisticated hard negative sampling approach. Izacard and Grave (2020) and Yang et al. (2020) leveraged a reader and a crossencoder for knowledge distillation on labeled data, respectively. RocketQA found large batch size can significantly improve the retrieval performance of dual-encoders. ColBERT (Khattab and Zaharia, 2020) incorporated light-weight attention-based re-ranking while increasing the space complexity.
The existing studies mainly focus on learning the similarity relation between the queries and the passages, while ignoring the relation among passages. It makes the model difficult to discriminate the positive passages and negative passages. In this paper, we propose an approach simultaneously learn query-centric and passage-centric similarity relations.

Methodology
In this section, we present an approach that leverages both query-centric and PAssage-centric sImilarity Relations (called PAIR) for dense passage retrieval.

Overview
The task of dense passage retrieval (Karpukhin et al., 2020) is described as follows. Given a query q, we aim to retrieve k most relevant passages {p j } k j=1 from a large collection of M passages. For this task, the dual-encoder architecture is widely adopted (Karpukhin et al., 2020;Qu et al., 2020), where two separate encoders E Q (·) and E P (·) are used to represent the query q and the passage p into d-dimensional vectors in different representation spaces. Then a dot product is performed to measure the similarity between q and p based on their embeddings: (1) Previous studies mainly capture the query-centric similarity relation. As shown in Figure 1, passagecentric similarity relation reflects important evidence for improving the retrieval performance. Therefore, we extend the original query-centric learning framework by leveraging the passagecentric similarity relation.
To develop our approach, we need to address the issues described in Section 1, and we consider three aspects to extend. First, we design a new loss function that considers both query-centric and passage-centric similarity relations. Second, we utilize knowledge distillation to obtain large-scale and high-quality pseudo-labeled data to capture more comprehensive similarity relations. Third, we design a two-stage training procedure to effectively learn the passage-centric similarity relation and improve the final retrieval performance.

Defining the Loss Functions
Our approach considers two kinds of losses, namely query-centric loss and passage-centric loss, as shown in Figure 2. The two kinds of losses are characterized by the two different similarity relations, query-centric similarity relation and passage-centric similarity relation.

Query-centric Loss
The query-centric similarity relation regards the query q as the center and pushes the negative passages p − farther than the positive passages p + . That is: where s (Q) (q, p + ) and s (Q) (q, p − ) represent the similarities for the relevant and irrelevant passages to query q, and they are defined the same as s(q, p) in Eq. (1). Following (Karpukhin et al., 2020;Qu et al., 2020), we learn the query-centric similarity relation by optimizing query-centric loss that is the negative log likelihood of the positive passage: As shown in Figure 1, for a given query, there might exist some negative passages similar to the positive passage, making it difficult to discriminate between positive and negative passages. Hence, we further incorporate passage-centric loss to address this issue.
Passage-centric Loss The aim of learning passage-centric similarity relation is to push negative passage p − farther from positive passage p + , and making the similarity between positive passage p + and query q larger than the similarity between positive passage p + and negative passage  p − . Formally, we introduce the following passagecentric similarity relation: where s (P ) (p + , q) and s (P ) (p + , p − ) are defined as Similarly, we learn the passage-centric similarity relation by optimizing the passage-centric loss function that is the negative log likelihood of the query: By comparing Eq. (3) and Eq. (5), we can observe that the difference in two kinds of loss lies in the normalization part (underlined).

The Combined Loss
We present an illustrative sketch of the above two loss functions in Figure 2. Next, we propose to simultaneously learn both query-centric and passage-centric similarity relations in Eq.
(2) and Eq.(4). Therefore, we combine query-centric and passage-centric loss functions defined in Eq. (3) and (5) to obtain the final loss function: where α is a hyper-parameter and is tuned in experiments. By considering passage-centric similarity relation, our approach will be more capable of discriminating between a positive passage and a highly similar yet irrelevant passage See Figure 1(b) .

Dual-encoder with Shared Parameters
Most of the existing studies (Eq. (2)) equip the dualencoders with two separate encoders (E Q and E P ) for queries and passages, respectively. In this case, different encoders may project queries and passages into two different spaces. However, to simultaneously model the query-centric similarity relation and the passage-centric similarity relation, the representations of queries and passages should be in the same space. Otherwise, the similarity between passages and the similarity between queries and passages are not comparable. Therefore, we propose using the encoders that share the same parameters and structures for both queries and passages, i.e., E Q (·)=E P (·).

Generating the Pseudo-labeled Training Data via Knowledge Distillation
By optimizing both query-centric loss and passage-centric loss, we can capture more comprehensive similarity relations. However, more similarity relation constraints require large-scale and high-quality training data for optimization. Additionally, there might be a large number of unlabeled positives even in the existing manually labeled datasets (Qu et al., 2020), and it is likely to bring false negatives when sampling hard negatives. Hence, we propose to generate pseudolabeled training data via knowledge distillation.

Cross-encoder Teacher Model
The teacher model is used to generate large-scale pseudolabeled data. Following RocketQA (Qu et al., 2020), we adopt the cross-encoder architecture to implement the teacher, which takes as input the concatenation of query and passage and models the semantic interaction between query and passage representations. Such an architecture has been demonstrated to be more effective than the dual-encoder architecture in characterizing querypassage relevance . We follow Qu et al. (2020) to train the cross-encoder teacher with the labeled data.
Generating Pseudo Labels In this paper, we follow Qu et al. (2020) to obtain positives and hard negatives 2 for unlabeled queries 3 . First, we retrieve the top-k candidate passages of unlabeled queries from the corpus by an efficient retriever DPR (Karpukhin et al., 2020), and score them by the well-trained cross-encoder (i.e., teacher model). We set two values s pos and s neg (s pos > s neg ) as the positive and hard negative thresholds,  respectively. Then, given each query, a candidate passage with a score above s pos or below s neg will be considered as positive or negative. Note that we also apply this on labeled corpus to obtain more positives and reliable hard negatives. Because there might be a large number of unlabeled positives even in the existing manually labeled datasets (Qu et al., 2020) and it is likely to bring false negatives in hard negative sampling.

Two-stage Training Procedure
Although passage-centric similarity relation Eq. (5) is able to incorporate additional relevance evidence, it is not directly related to the final task goal (i.e., query-centric similarity relation). Therefore, we design a two-stage training procedure that incorporates the passage-centric loss in the pre-training stage, and then only optimize the tasks-specific loss (i.e., query-centric loss) in the fine-tuning stage. We present an illustration for the two-stage training procedure in Figure 3. Next, we present the detailed training procedure.
Pre-training In the pre-training stage, we train the dual-encoder by optimizing the loss function L in Eq. (6) (i.e., a combination of query-centric loss and passage-centric loss). The pseudo-labeled data from unlabeled corpus is adopted as the pretraining data (Section 3.3).
Fine-tuning In the fine-tuning stage, we only fine-tune the dual-encoder (pre-trained in the first stage) according to the query-centric loss L Q in Eq. (3). In this way, our approach focuses on learning the task-specific loss, yielding better retrieval performance. In this stage, we use both ground-truth labels and pseudo labels derived from the labeled corpus for training.

Experiments
In this section, we first describe the experimental settings, then report the main experimental results, ablation study and detailed analysis.

Fine-tuning with learning query-centric similarity relation (QSR)
Pseudo Labels Figure 3: Overview of the proposed two-stage method.

Experimental Settings
Datasets This paper focuses on the passage retrieval task. We conduct experiments on two public datasets: MSMARCO (Nguyen et al., 2016) and Natural Questions (Kwiatkowski et al., 2019). The statistics of the datasets are listed in Ta  tion data, we collect about 1.8 million unlabeled queries from Yahoo! Answers 4 , OR-CAS (Craswell et al., 2020), SQuAD (Rajpurkar et al.), TriviaQA (Joshi et al., 2017) and Hot-potQA (Yang et al., 2018). In the pre-training stage, we reuse the passage collections from the labeled corpus (MSMARCO and NQ).

Implementation Details
We conduct experiments with the deep learning framework PaddlePaddle (Ma et al., 2019) on up to eight NVIDIA Tesla V100 GPUs (with 32G RAM).

Pre-trained LMs
The dual-encoder is initialized with the parameters of ERNIE-2.0 base (Sun et al., 2020). ERNIE-2.0 has the same networks as BERT , and it introduces a continual pre-training framework on multiple pretrained tasks. The cross-encoder setting follows the cross-encoder in RocketQA (Qu et al., 2020) Hyper-parameters (a) batch size: Our dualencoder is trained with a batch size of 512 × 1 in fine-tuning stage on NQ and 512 × 8 in other settings. We use the in-batch negative setting (Karpukhin et al., 2020)  Optimizers We use LAMB optimizer (You et al., 2020) to train the dual-encoder on MS-MARCO, which is more suitable in cross-batch negative setting. In other settings, we always use ADAM optimizer (Kingma and Ba, 2015).
The choice of alpha α is a hyper-parameter to balance the query-centric loss and passage-centric loss (Eq. (6)). We searched for α from 0 to 1 by setting an equal interval to 0.1, and the model achieves the best performance when α is set to 0.1.
(1) We can see that PAIR significantly outperforms all the baselines on both MSMARCO and NQ datasets. The major difference between our approach and baselines lies in that we incorporate both query-centric and passage-centric similarity relations, which can capture more comprehensive semantic relations. Meanwhile, we incorporate the augmented data via knowledge distillation.
(2) We notice that baseline methods use different pre-trained LMs, as shown in the second column of Table 2. In PAIR, we use the ERNIE-base. To examine the effects of ERNIE-base, we implement DPR-E by replacing BERT-base used in DPR as ERNIE-base. From Table 2, we can observe that PAIR significantly outperforms DPR-E, although they employ the same pre-trained LM.
(3) Another observation is that the dense retrievers are overall better than the sparse retrievers. Such a finding has also been reported in prior studies (Karpukhin et al., 2020;Xiong et al., 2020a;Luan et al., 2021), which indicates the effectiveness of the dense retrieval approach.

Ablation Study
In this section, we conduct ablation study to examine the effectiveness of each strategy in our proposed approach. We only report the results on the NQ, while the results on the MSMARCO are similar and omitted here due to limited space.
Here, we consider five variants based on our approach for comparison: (a) w/o PSR removes the loss for passagecentric similarity relation in the pre-training stage; (b) w/o KD removes the knowledge distillation for obtaining pseudo-labeled data and only uses the labeled data (MSMARCO and NQ) for both pre-training stage and fine-tuning stage; (c) w/ PSR FT adds the loss for passage-centric similarity relation in the fine-tuning stage; (d) w/o SP uses separate encoders for queries and passages instead of encoders with shared parameters; (e) w/o PT removes the pre-training stage. Table 3 presents the results for the ablation study. We can observe the following findings: • The performance drops in w/o PSR, demonstrating the effectiveness of learning passagecentric similarity relation; • The performance drops in w/o KD, demonstrating the necessity and effectiveness of the knowledge distillation for obtaining large-scale and high-quality pseudo-labeled data, since the passage-centric loss tries to distinguish highly similar but semantically different passages; • The performance slightly drops in w/ PSR FT, because passage-centric loss is not directly related to the target task (i.e., query-based retrieval), which suggests that passage-centric loss should be only used in the pre-training stage; • The performance drops in w/o SP, demonstrating the effectiveness of dual-encoders with shared parameters; • The performance significantly drops in w/o PT, demonstrating the importance of our pretraining procedure.

Analysis on Passage-centric Similarity Relation
The previous results demonstrate the effectiveness of our proposed approach PAIR. Here, we further analyze the effect of passage-centric loss (Eq. (5)) in a more intuitive way. To examine this, we prepare two variants of our approach, in pigs (swine influenza) and in birds (avian influenza) . . .
H5N1 is a subtype virus which can cause illness in humans and many other animal species. A bird-adapted strain of H5N1, called HPAIA (H5N1) for . . .
Where is gall bladder situated in human body?
The gall bladder is a small hollow organ where bile is stored . . . In humans, the pear-shaped gall bladder lies ::::::::: beneath the liver, although the structure and position . . .
The urinary bladder is a hollow muscular organ in humans and some other animals that collects and stores urine from the kidneys before disposal by urination . . . Table 4: The comparison of the top-1 passages retrieved by PAIR and PAIR ¬PSR , respectively. The bold words represent the main topics in queries and passages. The :::::::::::::::::::::::::: italic words with wavy underline are the right answers. The words with straight underline among passages have many words in common and may mislead the model PAIR ¬PSR to select the wrong passage.
namely the complete PAIR and the variant removing the passage-centric loss (Eq. (5)) denoted by PAIR ¬PSR .
We first analyze how the passage-centric similarity relation (PSR) influences the similarity relations among query, positive passage and negative passage. Figure 4 shows the comparison of PAIR and PAIR ¬PSR for computing the similarities of s(p + , p − ) and s(p + , q). We obtain s(p + , p − ) and s(p + , q) by the averaging the similarity of top 100 retrieved passages for each query in the testing data of Natural Questions. We can see that before incorporating passage-centric similarity relation (PSR), s(p + , p − ) is higher than s(p + , q). As a result, the negatives are close to the positives. After incorporating PSR, s(p + , p − ) becomes lower than s(p + , q). It indicates that passage-centric loss pulls positive passages closer to queries and push them farther away from negative passages in the representation space. The comparison result is consistent with passage-level similarity relation in Eq. (4). Table 4 to understand the performance difference between PAIR and PAIR ¬PSR . In the first example, the top-1 passage retrieved by PAIR has the same topic "H1N1" as the query. In contrast, the top-1 passage retrieved by PAIR ¬PSR has an incorrect but highly relevant topic "H5N1". Actually, the sentences among the positive passage (retrieved by PAIR) and the negative passage (retrieved by PAIR ¬PSR ) share many common words. Such a negative passage is likely to mislead the retriever to yield incorrect rankings. Hence, these two passages should be far away from each other in the representation space. This problem cannot be well solved by only considering the query-passage similarity in existing studies. Similar observations can be find from the second example. The top-1 passage retrieved by PAIR has the same topic "gall  Table 5: The data quality and retrieval performance in different thresholds on NQ. Acc pos denotes accuracy of positives and Acc neg denotes accuracy of negatives.

Next, we further present two examples in
bladder" as the query, while the top-1 passage retrieved by PAIR ¬PSR is about "urinary bladder". These results show that passage-centric similarity relations are particularly useful to discriminate between positive and hard negative passages (highly similar to positive passages).

Analysis on Knowledge Distillation
In this section, we examine the influence of the thresholds on pseudo-labeled data via knowledge distillation, including the data quality and the retrieval performance. We conduct the analyses by using different positive thresholds s pos and negative thresholds s neg (See Section 3.3).
We first manually evaluate the quality of the pseudo-labeled data via knowledge distillation w.r.t. different threshold settings (i.e., the combinations of s neg and s pos ). For each threshold setting, we randomly select 100 queries, each of which corresponding to a positive passage and a hard-negative passage. In total, we have 4 threshold settings (as shown in Table 5) and 800 querypassage pairs. We ask two experts to manually annotate the query-passage pairs and evaluate the quality of pseudo-labeled data, the Cohen's Kappa of experts is 0.9. As shown in the first two columns of Table 5, we can observe that when s pos = 0.9 and s neg = 0.1, the data quality is relatively good. Additionally, when setting a low value of s pos and a high value of s neg , the data quality becomes worse.
The last three columns of Table 5 also present the retrieval performance w.r.t. different threshold settings. When choosing a low value of s pos and a high value of s neg , the retrieval performance drops. Hence, our approach is configured with a strict threshold setting (s pos = 0.9, s neg = 0.1) in experiments to achieve good performance.

Conclusion and Future Work
This paper presented a novel dense passage retrieval approach that leverages both query-centric and passage-centric similarity relations for capturing more comprehensive semantic relations. To implement our approach, we made three important technical contributions in the loss formulation, training data augmentation and effective training procedure. Extensive results demonstrated the effectiveness of our approach. To our knowledge, it is the first time that passage-centric similarity relation has been considered for dense passage retrieval. We believe such an idea itself is worth exploring in designing new ranking mechanism. In future work, we will design more principle ranking functions and apply current retrieval approach to downstream tasks such as question answering and passage re-ranking.

Ethical Impact
The technique of dense passage retrieval is effective for question answering, where the majority of questions are informational queries. Semantic crowdedness problem of passages, and term mismatch between questions and passages are typical problems, which bring barriers for the machine to accurately find the information. Our technique contributes toward the goal of asking machines to find the answer passages to natural language questions from a large collection of documents. With these advantages also come potential downsides: Wikipedia or any potential external knowledge source will probably never fully cover the breadth of user questions. The goal is still far from being achieved, and more efforts from the community is needed for us to get there.