Unsupervised Dense Retrieval with Relevance-Aware Contrastive Pre-Training

Dense retrievers have achieved impressive performance, but their demand for abundant training data limits their application scenarios. Contrastive pre-training, which constructs pseudo-positive examples from unlabeled data, has shown great potential to solve this problem. However, the pseudo-positive examples crafted by data augmentations can be irrelevant. To this end, we propose relevance-aware contrastive learning. It takes the intermediate-trained model itself as an imperfect oracle to estimate the relevance of positive pairs and adaptively weighs the contrastive loss of different pairs according to the estimated relevance. Our method consistently improves the SOTA unsupervised Contriever model on the BEIR and open-domain QA retrieval benchmarks. Further exploration shows that our method can not only beat BM25 after further pre-training on the target corpus but also serves as a good few-shot learner. Our code is publicly available at https://github.com/Yibin-Lei/ReContriever.


Introduction
Dense retrievers, which estimate the relevance between queries and passages in the dense embedding space, have achieved impressive performance in various applications, including web search (Liu et al., 2021) and open-domain question answering (Karpukhin et al., 2020).One key factor for the success of dense retrievers is a large amount of human-annotated training data, e.g., MS-MARCO (Bajaj et al., 2016) with above 500,000 examples.However, a recent study (Thakur et al., 2021) shows that even trained with enormous labeled data, dense retrievers still suffer from a generalization issue, where they perform relatively poorly on novel domains in comparison to BM25.Meanwhile, collecting human-annotated data for new domains is always hard and expensive.Thus improving dense retrievers with limited annotated data becomes essential, considering the significant domain variations of practical retrieval tasks.
Contrastive pre-training, which first generates pseudo-positive examples from a universal corpus and then utilizes them to contrastively pre-train retrievers, has shown impressive performance without any human annotations (Lee et al., 2019;Gao et al., 2021;Gao and Callan, 2022;Ram et al., 2022;Izacard et al., 2022).For instance, Contriever (Izacard et al., 2022) crafts relevant querypassage pairs by randomly cropping two random spans within the same document.However, owing to the high information density of texts, even nearby sentences in a document can be very irrelevant, as shown in Figure 1.These false positive samples may mislead the model to pull unrelated texts together in the embedding space and further harm the validity of representations.
Motivated by recent findings in computer vision that pre-training performance can be greatly boosted by reducing the effect of such false positives (Peng et al., 2022;Mishra et al., 2022), we propose Relevance-Aware Contrastive Retriever (Re-Contriever).At each training step, we utilize the trained models at the current step itself to estimate the relevance of all the positives.Then the losses of different positive pairs are adaptively weighed using the estimated relevance, i.e., the pairs that receive higher relevance scores obtain higher weight.Moreover, simply applying lower weights to irrelevant pairs will result in insufficient usage of data, since many documents will contribute less to training.Therefore, we also introduce a one-documentmultiple-pair strategy that generates multiple positive pairs from a single document, with a pairweighting process conducted among samples originating from a single document.Such an operation makes sure that the model can learn positive knowledge from every document in the corpus.
To summarize, our contributions in this paper are three-fold: 1) We propose relevance-aware contrastive learning for dense retrieval pre-training, which aims to reduce the false positive problem.
2) Experiments show our method brings consistent improvements to the SOTA unsupervised Contriver model on 10/15 tasks on the BEIR benchmark and three representative open-domain QA retrieval datasets.3) Further explorations show that our method works well given no or limited labeled data.Specifically, on 4 representative domainspecialized datasets it outperforms BM25 when only unsupervised pre-training on the target corpora, and with only a few annotated samples its accuracy can be on par with DPR (Karpukhin et al., 2020) which is trained on thousands of annotated examples.

Preliminary
In this section, we briefly describe the bi-encoder structure used in dense retrieval and the SOTA Contriever model, on which we build our model.
Bi-Encoder Structure Dense retrievers are always a bi-encoder composed of two separate encoders to transform the query and document into a single vector each.The relevance score is obtained by computing the similarity (e.g., inner-product) between the encoded vectors of queries and documents.The typical way to train a dense retriever is using a contrastive loss that aims to pull relevant passages closer to the query and irrelevant passages farther in the embedding space.For each query, the training data involves one positive passage labeled by annotators and a pool of negative passages, which are usually random passages in the corpus.
Contriever It crafts pseudo-positive pairs by randomly cropping two spans of the same document.As negative texts have shown to be a key to the success of retrieval training (Xiong et al., 2021), Contriever also applies the MoCo mechanism (He et al., 2020) to utilize negatives in the previous batches to increase the number of negatives.These two factors make Contriever obtain significant decent performance without any human annotations.

Relevance-Aware Contrastive Learning
We start by 1) producing a larger number of positives (one-document-multi-pair) and 2) forcing the model to pay more attention to the ones with higher relevance (relevance-aware contrastive loss).
One-Document-Multi-Pair Given a text snippet T , previous pre-training methods always craft only one positive (query-passage) pair (q, d + ).To exploit T more effectively, our one-document-multipair strategy generates n positive pairs, denoting as {(q, d + 1 ), (q, d + 2 ), . . ., (q, d + n )}, from T by repeating the procedure several times.We keep the query q unchanged to ensure the relevance comparison is fair among pairs within the same snippet, which is used in our following step.Building upon Contirever, we craft n pairs by random cropping n + 1 spans and setting 1 span as the fixed query for the left n spans.And it is easy to extend this strategy to other contrastive pre-training methods.
Relevance-Aware Contrastive Loss The ordinary contrastive loss for training dense retrievers is the InfoNCE loss.Given a positive pair (q, d + ) and a negative pool where s(•) and τ denote the similarity function and temperature parameter.Then the overall loss of a batch is usually the average across all the m × n positive pairs from m snippets: ).The relevance-aware contrastive loss aims to force the model to focus more on true positive pairs by 1) utilizing trained model θ at present itself as an imperfect oracle to compute the relevance score s θ (q, d + ) between all pairs; and 2) adaptively assigning weights to different pairs according to the estimated relevance.Then the relevance-aware contrastive loss L relevance can be expressed as: (2) In this way, for each text snippet, positive pairs with more confidence to be relevant will thus be more focused on by the model, or vice versa.

Experiments
In this section, we evaluate our model in several settings after describing our experimental setup.We consider unsupervised retrieval performance and two practical use cases: further pre-training on the target domain and few-short retrieval.We then conduct an ablation study to separate the impact of our method's two components.
1 Equation (2) will be invalid when s θ qi, d + ij is negative.In the preliminary study, we found the value is always positive and thus ignore this special case for simplicity.
• Baselines We compare our model with two types of unsupervised models, namely models based on contrastive pre-training and on auto-encoding pretraining.The former models include SimCSE (Gao et al., 2021), coCondenser (Gao and Callan, 2022), Spider (Ram et al., 2022) and Contriever (Izacard et al., 2022).The latter category includes the recently proposed RetroMAE (Xiao et al., 2022).BM25 (Robertson and Zaragoza, 2009) and uncased BERT-base model (Devlin et al., 2019) are also involved for reference.We use the official checkpoints for evaluation.
• Implementation Details We apply our method to the SOTA Contriever model and use its default settings.The pre-training data is a combination of Wikipedia and CCNet (Wenzek et al., 2020), same as Contriever.We generate 4 positive pairs for each document.Refer to Appendix A for more details.We conduct a t-test with p-value 0.05 as threshold to compare the performance of ReContriever and our reproduced Contriever.

BEIR
The NDCG@10 of ReContriever and other fully unsupervised models across 15 public datasets of BEIR are shown in Table 1.ReContriever achieves consistent improvements over Contriever on 10/15 datasets, with a significant improvement observed in 9 of those datasets.Notably, it also only sees  very slight decreases on datasets without promotion (e.g., FiQA, Touche and Climate-Fever with at most -0.1 decrease).Moreover, our method obtains an average rank of 2.2, proving our method to be the best unsupervised dense retriever.BM25 is still a strong baseline under the fully unsupervised scenario, but ReContriever greatly narrows the gap between dense retrievers and it.

Open-Domain QA Retrieval
Table 2 shows the Recall performance of ReContriever on open-domain QA retrieval benchmarks, where supervised DPR (Karpukhin et al., 2020) is involved for reference.Obviously, ReContriever outperforms BM25 by a large margin except for Recall@5 and Recall@10 on TriviaQA with relatively smaller differences, verifying the effect of our method.Moreover, among all unsupervised methods, ReContriever obtains the best performance in nearly all cases, especially substantial improvement over Contriever.Our ReContriever promisingly narrows the gaps between supervised and unsupervised models, making it more valuable.

Practical Use Cases
In this section, we explore the applicability of Re-Contriever in more practical scenarios

Ablation Study
We conduct an ablation study to investigate the contributions of our proposed loss and pairing strategies within ReContriever, using 100,000 training steps.Solely adding relevance-aware loss means estimating the relevance of N pairs from N documents and then normalizing the relevance among the N pairs within a batch, which slightly differs from equation ( 2) that normalizes over 4 pairs from the same document.As shown in Table 5, solely adding relevance-aware contrastive loss to Contriever will lead to a noticeable degeneration, owing to the missing information from the documents with low adjusted weights and the unstable relevance comparison without a fixed query.Applying the one-document-multi-pair strategy can obtain a slight improvement which can be attributed to the effective usage of the unlabeled data.Combining both strategies (i.e., ReContriever) can lead to an obvious improvement, which demonstrates the necessity of both components in our method.

Conclusion
In this work, we propose ReContriever to further explore the potential of contrastive pre-training to reduce the demand of human-annotated data for dense retrievers.Benefiting from multiple positives from the same document as well as relevance-aware contrastive loss, our model achieves remarkable performance under zero-shot cases.Additional results on low data resources further verify its value under various practical scenarios.

Limitations
Although ReContriever narrows the gap between BM25 and unsupervised dense retrievers, it still lags behind BM25 when acting as a generalpurpose retriever.This issue may make ReContriever not directly usable when facing a new domain, thus limiting its practicality.Also, as Re-Contriever is initialized from the language model BERT base , there may exist social biases (Zhao et al., 2017) in ReContriever and thus have the risk of offending people from under-represented groups.

Ethics Statement
We strictly adhere to the ACL Ethics Policy.This paper focuses on reducing the false positives problem of unsupervised dense retrieval.The datasets used in this paper are publicly available and have been widely adopted by researchers.We ensure that the findings and conclusions of this paper are reported accurately and objectively.

A Implementation Details
Basic Infrastructure Basic backbones of our implementation involve Pytorch3 and Huggingface-Transformers4 .We build our code upon the released code of Contriever5 .Models are evaluated using evaluation scripts provided by the BEIR6 (for BEIR evaluation) and Spider7 (for opendomain QA retrieval evaluation) GitHub repositories.The pre-training experiments are conducted on 16 NVIDIA A100 GPUs and the few-shot experiments are conducted on a single NVIDIA A100 GPU.We report the results on a single run with a fixed random seed 0 (same as the setting of Contriever).
Details of ReContriever Following the default settings of Contriever, we pre-train ReContriever for 500,000 steps with a batch size of 2048, initializing from the uncased BERT base model with 110 million parameters.The pre-training data is a combination of Wikipedia and CCNet (Wenzek et al., 2020).The learning rate is set to 5 • 10 −5 with a warm-up for the first 20,000 steps and a linear decay for the remaining steps.Average pooling over the whole sequence is used for obtaining the final representation of the query or document.For each document, we generate 4 positive pairs.For experiments on target domain pre-training, we initialize the model from our pre-trained Contriever/ReContriever checkpoints.To avoid overfitting, the models are further pre-trained with 5000 warm-up steps to a learning rate of 1.25 • 10 −7 on all 4 picked datasets with a batch size of 1024 on 8 NVIDIA A100 GPUs.
For few-shot retrieval experiments, we adopt the training procedure from (Karpukhin et al., 2020): exploiting BM25 negatives and not including negatives mined by the model itself (Xiong et al., 2021) for few-shot fine-tuning.The hyper-parameters are directly borrowed from (Karpukhin et al., 2020) except for the batch size and number of training epochs.We fine-tune all the models with 80 epochs.
For 8 examples, the batch size is set to 8. The batch size is 32 when there are 32 or 128 examples.

B Dataset Statistics
Details about the number of examples in the there open-domain QA retrieval datasets are shown in Table 6.

Figure 1 :
Figure 1: A text snippet from Wikipedia, where two nearby sentences are quite irrelevant.Random cropping may lead to a false positive query-passage pair.

Table 1 :
NDCG@10 of BEIR Benchmark.All models are unsupervised trained without any human-annotated data.Bold indicates the best result.The average and rank across the entire benchmark are included.Four datasets are excluded because of their licenses." † " means ReContriever performs significantly better than our reproduced Contriever, as determined by a t-test with p-value 0.05 as threshold.

Table 2 :
Recall of open-domain retrieval benchmarks.Bold: the best results across unsupervised models." † " means ReContriever performs significantly better than our reproduced Contriever, as determined by a t-test with p-value 0.05 as threshold.

Table 3 :
NDCG@10 after further pre-training on the target domain corpus." ⇑ " denotes the gains of further pre-training." † " means ReContriever performs significantly better than our reproduced Contriever.

Table 4 :
Few-shot Retrieval on NQ. Results are report with Recall." † " means ReContriever performs significantly better than our reproduced Contriever.
2, where only texts in the target corpus (pre-training on the target domain) or very limited annotated training data (few-shot retrieval) are available.

Table 6 :
Statistics of Open-Domain QA Retrieval Datasets