Sentence-aware Contrastive Learning for Open-Domain Passage Retrieval

Training dense passage representations via contrastive learning has been shown effective for Open-Domain Passage Retrieval (ODPR). Existing studies focus on further optimizing by improving negative sampling strategy or extra pretraining. However, these studies keep unknown in capturing passage with internal representation conflicts from improper modeling granularity. Specifically, under our observation that a passage can be organized by multiple semantically different sentences, modeling such a passage as a unified dense vector is not optimal. This work thus presents a refined model on the basis of a smaller granularity, contextual sentences, to alleviate the concerned conflicts. In detail, we introduce an in-passage negative sampling strategy to encourage a diverse generation of sentence representations within the same passage. Experiments on three benchmark datasets verify the efficacy of our method, especially on datasets where conflicts are severe. Extensive experiments further present good transferability of our method across datasets.


Introduction
Open-Domain Passage Retrieval (ODPR) has recently attracted the attention of researchers for its wide usage both academically and industrially Yang et al., 2017). Provided with an extremely large text corpus that composed of millions of passages, ODPR aims to retrieve a collection of the most relevant passages as the evidences of a given question.
With recent success in pretrained language models (PrLMs) like BERT , RoBERTa (Liu et al., 2019), dense retrieval techniques have achieved significant better results than traditional lexical based methods, including TF-IDF (Ramos et al., 2003) and BM25 (Robertson and Zaragoza, 2009), which totally neglect semantic similarity. Thanks to the Bi-Encoder structure, dense methods Guu et al., 2020; * *Corresponding author. This paper was partially supported by Key Projects of National Natural Science Foundation of China (U1836222 and 61733011). Karpukhin et al., 2020) encode the Wikipedia passages and questions separately, and retrieve evidence passages using similarity functions like the inner product or cosine similarity. Given that the representations of Wikipedia passages could be precomputed, the retrieval speed of dense approaches could be on par with lexical ones.
Previous approaches often pretrain the Bi-Encoders with a specially designed pretraining objective, Inverse Cloze Task (ICT) . More recently, DPR (Karpukhin et al., 2020) adopts a simple but effective contrastive learning framework, achieving impressive performance without any pretraining. Concretely, for each question q, several positive passages p + and hard negative passages p − produced by BM25 are pre-extracted. By feeding the Bi-Encoder with (q, p + , p − ) triples, DPR simultaneously maximizes the similarity between the representation of q and corresponding p + , and minimizes the similarity between the representations of q and all p − . Following such contrastive learning framework, many researchers are seeking further improvements for DPR from the perspective of sampling strategy (Xiong et al., 2020;Lu et al., 2020;Tang et al., 2021;Qu et al., 2021) or extra pretraining (Sachan et al., 2021), or even using knowledge distillation (Izacard and Grave, 2021;Yang et al., 2021).
However, these studies fail to realize that there exist severe drawbacks in the current contrastive learning framework adopted by DPR. Essentially, as illustrated in Figure 1, each passage p is composed of multiple sentences, upon which multiple semantically faraway questions can be derived, which forms a question set Q = {q 1 , q 2 , ..., q k }. Under our investigation, such a one-to-many problem is causing severe conflicting problems in the current contrastive learning framework, which we refer to as Contrastive Conflicts. To the best of our knowledge, this is the first work that formally studies the conflicting problems in the contrastive learn-Which society in England also played a significant role in public sphere and spread of Enlightenment ideas?
Women's education common stressed which literature genre?
Age of Enlightenment ing framework of dense passage retrieval. Here, we distinguish two kinds of Contrastive Conflicts.
• Transitivity of Similarity The goal of the contrastive learning framework in DPR is to maximize the similarity between the representation of the question and its corresponding gold passage. As illustrated in Figure 2, under Contrastive Conflicts, the current contrastive learning framework will unintendedly maximize the similarity between different question representations derived from the same passage, even if they might be semantically different, which is exactly the cause of low performance on SQuAD (Rajpurkar et al., 2016) for DPR (SQuAD has an average of 2.66 questions per passage).
• Multiple References in Large Batch Size According to Karpukhin et al. (2020), the performance of DPR highly benefits from large batch size in the contrastive learning framework. However, under Contrastive Conflicts, one passage could be the positive passage p + of multiple questions (i.e. the question set Q). Therefore, a large batch size will increase the probability that some questions of Q might occur in the same batch. With the widely adopted in-batch negative technique (Karpukhin et al., 2020;Lee et al., 2021), such p + will be simultaneously referred to as both the positive sample and the negative sample for every q in Q, which is logically unreasonable.
Since one-to-many problem is the direct cause of both conflicts, this paper presents a simple but effective strategy that breaks down dense passage representations into contextual sentence level ones, which we refer to as Dense Contextual Sentence Representation (DCSR). Unlike long passages, it is hard to derive semantically faraway questions from one short sentence. Therefore, by modeling ODPR

Related Work
Open-Domain Passage Retrieval Open-Domain Passage Retrieval has been a hot research topic in recent years. It requires a system to extract evidence passages for a specific question from a large passage corpus like Wikipedia, and is challenging as it requires both high retrieval accuracy and specifically low latency for practical usage. Traditional approaches like TF-IDF (Ramos et al., 2003), BM25 (Robertson and Zaragoza, 2009) retrieve the evidence passages based on the lexical match between questions and passages. Although these lexical approaches meet the requirement of low latency, they fail to capture non-lexical semantic similarity, thus performing unsatisfying on retrieval accuracy.
With recent advances of pretrained language models (PrLMs) like BERT , RoBERTa (Liu et al., 2019), a series of neural approaches based on cross-encoders are proposed (Vig and Ramea, 2019;Wolf et al., 2019). Although enjoying satisfying retrieval accuracy, the retrieval latency is often hard to tolerate in practical use. More recently, the Bi-Encoder structure has captured the researchers' attention. With Bi-Encoder, the representations of the corpus at scale can be precomputed, enabling it to meet the requirement of low latency in passage retrieval.  first proposes to pretrain the Bi-Encoder with Inverse Cloze Task (ICT). Later, DPR (Karpukhin et al., 2020) introduces a contrastive learning framework to train dense passage representation, and has achieved impressive performance on both retrieval accuracy and latency. Based on DPR, many works make further improvements either by introducing better sampling strategy (Xiong et al., 2020;Lu et al., 2020;Tang et al., 2021;Qu et al., 2021) or extra pretraining (Sachan et al., 2021), or even distilling knowledge from cross-encoders (Izacard and Grave, 2021;Yang et al., 2021).
Our method follows the contrastive learning research line of ODPR. Different from previous works that focus on either improving the quality of negative sampling or using extra pretraining, we make improvements by directly optimizing the modeling granularity with an elaborately designed contrastive learning training strategy.
Contrastive Learning Contrastive learning recently is attracting researchers' attention in all area. After witnessing its superiority in Computer Vision tasks He et al., 2020), researchers in NLP are also applying this technique Karpukhin et al., 2020;Yan et al., 2021;Giorgi et al., 2021;Gao et al., 2021). For the concern of ODPR, the research lines of contrastive learning can be divided into two types: (i) Improving the sampling strategies for positive samples and hard negative samples. According to (Manmatha et al., 2017), the quality of positive samples and negative samples are of vital importance in the contrastive learning framework. Therefore, many researchers seek better sampling strategies to im-prove the retrieval performance (Xiong et al., 2020). (ii) Improving the contrastive learning framework. DensePhrase (Lee et al., 2021) uses memory bank like MOCO (He et al., 2020) to increase the number of in-batch negative samples without increasing the GPU memory usage, and models retrieval process on the phrase level but not passage level, achieving impressive performance.
Our proposed method follows the second research line. We investigate a special phenomenon, Contrastive Conflicts in the contrastive learning framework, and experimentally verify the effectiveness of mediating such conflicts by modeling ODPR in a smaller granularity. More similar to our work, Akkalyoncu Yilmaz et al. (2019) also proposes to improve dense passage retrieval based on sentence-level evidences, but their work is not in the research line of contrastive learning, and focuses more on passage re-ranking after retrieval but not retrieval itself.

Contrastive Learning Framework
Existing contrastive learning framework aims to maximize the similarity between the representations of each question and its corresponding gold passages.
Suppose there is a batch of n questions, n corresponding gold passages and in total k hard negative passages. Denote the questions in batch as q 1 , q 2 , ..., q n , their corresponding gold passages as gp 1 , gp 2 , ..., gp n , and hard negative passages as np 1 , np 2 , ..., np k . Two separate PrLMs are first used separately to acquire representations for questions and passages {h q 1 , h q 2 , ...; h gp 1 , h gp 2 , ...; h np 1 , h np 2 , ...}. The training objective for each question sample q i of original DPR is shown in Eq (1): (1) The sim(·) could be any similarity operator that calculates the similarity between the question representation h q i and the passage representation h p j .
Minimizing the objective in Eq (1) is the same as (i) maximizing the similarity between each h q i and h gp i pair, and (ii) minimizing the similarity between h q i and all other h gp j (i = j) and h np k .  Figure 3: An illustration of our DCSR processing pipeline. The left part shows the contrastive training paradigm of our method, and the right part presents the inference pipeline.
As discussed previously, this training paradigm will cause conflicts under current contrastive learning framework due to (i) Transitivity of Similarity, and (ii) Multiple References in Large Batch Size.

Dense Contextual Sentence Representation
The cause of the Contrastive Conflicts lies in oneto-many problem, that most of the passages are often organized by multiple sentences, while these sentences may not always stick to the same topic, as depicted in Figure 1. Therefore, we propose to model passage retrieval in a smaller granularity, i.e. contextual sentences, to alleviate the occurrence of one-to-many problem.
Since contextual information is also important in passage retrieval, simply breaking down passages into sentences and encoding them independently is infeasible. Instead, following (Beltagy et al., 2020;Lee et al., 2020;, we insert a special <sent> token at the sentence boundaries in each passage, and encode the passage as a whole to preserve the contextual information, which results in the following format of input for each passage: We then use BERT  as encoder to get the contextual sentence representations by these indicator <sent> tokens. For convenience of illustration, taking a give query q into consideration, we denote the corresponding positive passage in the training batch as p + , which consists of several sentences: Similarly, we denote the corresponding BM25 negative passage as: Here ( * ) −/+ means whether the sentence or passage contains the gold answer. We refine the original contrastive learning framework by creating sentence-aware positive and negative samples. The whole training pipeline is shown in the left part of Figure 3.

Positives and Easy Negatives
Following Karpukhin et al. (2020), we use BM25 to retrieve hard negative passages for each question. To build a contrastive learning framework based on contextual sentences, we consider the sentence that contains the gold answer as the positive sentence (i.e. p s + i ), and randomly sample several negative sentences (random sentences from N ) from a BM25 random negative passage. Also, following (Karpukhin et al., 2020;Lee et al., 2021), we introduce in-batch negatives as additional easy negatives.

In-Passage Negatives
To handle the circumstance where multiple semantically faraway questions may be derived from one single passage, we hope to encourage the passage encoder to generate contextual sentence representations as diverse as possible for sentences in the same passage. Noticing that not all the sentences in the passage contain the gold answer and stick to the topic related to the given query, we further introduce in-passage negatives to maximize the difference between contextual sentences representations within the same passage. Concretely, we randomly sample one sentence that does not contain the gold answer (i.e. a random sentence from P/{P s + i }). Note that a positive passage might not contain such sentence. If it does not exist, this in-passage negative sentence is substituted by another easy negative sentence from the corresponding BM25 negative passage (a random sentence from N ). These in-passage negatives function as hard negative samples in our contrastive learning framework.

Retrieval
For retrieval, we first use FAISS (Johnson et al., 2019) to calculate the matching scores between the question and all the contextual sentence indexes. As one passage has multiple keys in the indexes, we retrieve top 100 × k (k is the average number of sentences per passage) contextual sentences for inference. To change these sentence-level scores into passage-level ones, we adopt a probabilistic design for ranking passages, which we refer to as Score Normalization. Score Normalization After getting the scores for each contextual sentences to each question by FAISS, we first use a Softmax operation to normalize all these similarity scores into probabilities. Suppose one passage P with several sentences s 1 , s 2 , ..., s n , and denote the probability for each sentence that contains the answer as p s 1 , p s 2 , ..., p sn , we can calculate the probability that the answer is in passage P by Equation 2.
We then re-rank all the retrieved passages by HasAns(P), and select the top 100 passages for evaluation in our following experiments.  (Berant et al., 2013) and TREC (Baudiš and Šedivỳ, 2015).
We experiment our proposed method on SQuAD, TriviaQA and NQ. For the previously concerned Contrastive Conflicts problem, we also analyze the existence frequency of the conflicting phenomenon for each dataset. We count the number of questions for each passage, i.e, the times that this passage is referred to as the positive sample. The corresponding results are shown in Table 1. From this table, we can see that of all three datasets we choose, SQuAD is most severely affected by the Contrastive Conflicts problem, that many passages occur multiple times as the positive passages for different questions. These statistics are consistent with the fact that DPR performs the worst on SQuAD, while acceptable on Trivia and NQ.

Training and Implementation Details
Hyperparameters In our main experiments, we follow the hyperparameter setting in DPR (Karpukhin et al., 2020)   DCSR checkpoint, and use such checkpoint to acquire semantically hard negative passages from the whole Wikipedia corpus.

Incorporated with Negative Sampling
Different from other frontier researches which mainly devote themselves either to investigating better negative sampling strategies, like ANCE (Xiong et al., 2020), NPRINC (Lu et al., 2020), etc., or to extra pretraining (Sachan et al., 2021), or to distilling knowledge from cross-encoders (Izacard and Grave, 2021;Yang et al., 2021)  method directly optimizes the modeling granularity in DPR. Therefore, our method could be naturally incorporated with these researches and achieve better results further. Due to computational resource limitation, we do not intend to replicate all these methods, but use adversarial training as an example. Following ANCE (Xiong et al., 2020), we conduct experiments on NQ and Trivia to show the compatibility of our method, listed in Table 3. With such a simple negative sampling strategy, our DCSR achieves comparable results with its DPR counterpart.

Ablation Study
To illustrate the efficacy of the previously proposed negative sampling strategy, we conduct an ablation study on a subset of OpenQA Wikipedia corpus 1 . We sample 1/20 of the whole corpus, which results in a collection of 1.05 million passages in total. As reference, we reproduce DPR and also list their results in Table 4. We compare the following negative sampling strategies of our proposed method. + 1 BM25 random In this setting, we randomly  sample (i) one gold sentence from the positive passage as the positive sample, and (ii) one negative sentence from the negative passage as the negative sample per question.
+ 2 BM25 random In this setting, we randomly sample (i) one gold sentence from the positive passage as the positive sample, and (ii) two negative sentences from two different negative passages as two negative samples per question.
+ 1 in-passage & + 1 BM25 random In this setting, we randomly sample (i) one gold sentence from the positive passage as the positive sample, (ii) one negative sentence from the positive passage as the first negative sample, and (iii) one negative sentence from the negative passage as the second negative sample per question.
Ablations of Negative Sampling Strategy The results are shown in Table 4. (i) Under the circumstance where only 1.05 million passages are indexed, variants of our DCSR generally perform significantly better than DPR baseline, especially on NQ dataset (over 1% improvement on both Top-20 and Top-100) and SQuAD dataset (8.0% improvement on Top-20 and 4.9% improvement on Top-100), which verifies the effectiveness of solving Contrastive Conflicts. (ii) Further, we found that increasing the number of negative samples helps little, but even introduces slight performance degradation on several metrics. (iii) The in-passage negative sampling strategy consistently helps in boosting the performance of nearly all datasets on all metrics, especially on the SQuAD dataset, which is consistent with our motivation for in-passage negatives, which is to encourage a diverse generation of contextual sentence representations within the same passage in solving the one-to-many problem.

Ablations of Training Data
The results are shown in Table 5  having achieved even better results on the NQ dataset. This augmented dataset is sub-optimal for our model, as these hard negative samples are passage-specific, while our model prefers sentencespecific ones. (ii) We then use our previous best DCSR checkpoint to retrieve a set of sentencespecific hard negatives (marked as DCSR-hard) and train a new DCSR, which achieves further performance gain on both metrics on NQ dataset.

Discussion
In this section, we discuss the transferability difference and the influence of Wikipedia corpus size on both DPR and our DCSR. More discussions from different aspects are presented in the Appendices, including (i) Validation accuracy on dev sets in Appendix A, which is also a strong evidence of alleviating Contrastive Conflicts. (ii) Error analysis for SQuAD in Appendix B, which further shows the generalization ability of our method. (iii) Case study in Appendix C, which discusses the future improvement of DCSR.

Transferability
To  Table 6: Transferability comparing our methods with DPR. We train the retriever model on the SQuAD dataset or the NQ dataset, and evaluate it on Trivia QA (statistics on the left). For reference, we also list the performance where the retriever model is both trained and evaluated on the Trivia QA (statistics on the right).
corpus, we sample 1/20 of the corpus, which results in a collection of 1.05 million passages in total. We test the transferability result from SQuAD to Trivia and from NQ to Trivia, as compared to Trivia, both SQuAD and NQ suffer more from Contrastive Conflicts. The results are shown in Table 6. From Table 6, when compared to DPR, our model enjoys significantly better transferability. In both scenarios, DPR shows over 2% performance gap in all metrics of the transferability tests, indicating that our method performs much better in generalization across the datasets. This phenomenon once again confirms our theorem, that by modeling passage retrieval in the granularity of contextual sentences, our DCSR well models the universality across the datasets, and shows much better transferability than DPR.

Corpus Size
In our extensive experiments, we further found out that our method can achieve overwhelming better performance than DPR on smaller corpus. In this experiment, we take the first 0.1 million, the first 1.05 million and all passages from the original Wikipedia corpus, and conduct dense retrieval on these three corpora varied in size. The statistic results are shown in Table 7.
From Table 7, first of all, our model achieves better performance than DPR in all settings, where such improvement is more significant in smaller corpus. On the setting where only 0.1 million passages are indexed in the corpus, our model achieves over 2.0% exact improvement on all metrics on both NQ and Trivia. We speculate this is because of the following two strengths of our method.
• The alleviation of Contrastive Conflicts, which we have analyzed previously.
• Modeling passage retrieval using contextual sentences enables a diverse generation of indexes. Some sentences may not be the core aim of their corresponding passages, but can still be the clue for some questions.  Secondly, we can discover that the performance gap between DPR and DCSR is decreasing when the size of Wikipedia corpus increases. This is because with the expansion of indexing corpus, many questions that cannot be solved in the small corpus setting may find much more closely related passages in the large corpus setting, which gradually neutralizes the positive effect brought by the second strength of our proposed method discussed above. Still, our model achieves better performance under the full Wikipedia setting on all datasets and all metrics.

Conclusion
In this paper, we make a thorough analysis on the Contrastive Conflicts issue in the current opendomain passage retrieval. To well address the issue, we propose an enhanced sentence-aware conflict learning method by carefully generating sentenceaware positive and negative samples. We show that the dense contextual sentence representation learned from our proposed method achieves significant performance gain compared to the original baseline, especially on datasets with severe conflicts. Extensive experiments show that our pro-posed method also enjoys better transferability, and well captures the universality in different datasets.

A Validation Accuracy
One may argue that the improvement of DCSR might be due to the expansion of indexing corpus (which we have discussed in previous sections), but not the alleviation of Contrastive Conflicts. In this section, we present the validation accuracy comparison during the training process between DPR and our DCSR, which is a strong evidence that DCSR well handles the problem of Contrastive Conflicts.
Under 8 V100 GPUs with a batch size of 16 on each GPU, the validation process could be viewed as a tiny retrieval process for both DPR and DCSR. To maintain a similar validation environment for fair comparison, we use the +1 BM25 random version of DCSR, which results in 8*16=128 questions and 2*8*16=256 contextual sentences in one batch. Therefore, the validation process could be interpreted as retrieving the most relevant contextual sentence for each question in a corpus of 256 sentences. Under such a validation task, the size of the indexing corpus is restricted to the same for both DPR and DCSR.
The result is shown in Figure 4. For both Trivia and NQ, DCSR performs consistently better than DPR with a small accuracy margin. On SQuAD, especially, our DCSR can achieve higher validation accuracy than DPR with only one single epoch, and achieves nearly 20% final validation accuracy improvement. This phenomenon further verifies that improvement of DCSR is also achieved by improving the training strategy which alleviates Contrastive Conflicts, but not only the expansion of the indexing corpus.

B Error Analysis for SQuAD
Although achieving overwhelmingly better performance on SQuAD than DPR, our DCSR on SQuAD still lags far behind its counterparts on NQ or Trivia. Interestingly, we found that the results on SQuAD dev sets are pretty good and comparable to the results on NQ or Trivia. The results of both DPR and DCSR on dev set and test set performance are shown in Table 8.
By analyzing the training instances, we observe that there exists a severe distribution bias problem in SQuAD: SQuAD-dev and SQuAD-train share a great number of positive passages. In fact, almost all positive passages in the SQuAD-dev could also be found in SQuAD-train. Of all 7921 questions that have at least one positive passage containing  the answer in SQuAD-dev, 7624 (96.25%) of these passages' titles could be found in the positive passages of SQuAD-train. More surprisingly, 6973 (88.03%) of these passages are shared between SQuAD-train and SQuAD-dev. However, this feature is exactly what SQuAD-test does not have, resulting in relatively poor performance. But again, this phenomenon reveals another strength of our DCSR, that it enjoys better generalization ability than DPR, thus is more robust in practical use.

C Case Study
To analyze the retrieval performance difference between DPR and DCSR, we especially focus on the different Top 1 predictions on SQuAD. We count the number of winning times for each baseline, where DCSR significantly outperforms DPR (893 vs. 161), shown in Figure 5.

C.1 DCSR winning cases
On the question Who was the NFL Commissioner in early 2012?, the strengths of our DCSR are listed as follows.
• Capability of utilizing contextual information. The key phrase 2012 and NFL is faraway from Commisioner Roger Goodell, while our DCSR is still capable of capturing such distant contextual information.
• Locating the exact sentence of the answer. This is an obvious feature of DCSR, as we are modeling on the granularity of contextual sentences.
On the contrary, due to Contrastive Conflicts, the question encoder of DPR is severely affected that it cannot generate fine-grained question representation. Therefore, on this question, DPR can only find out one key phrase commissioner, falling into a totally wrong prediction.