LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval

In this paper, we propose LaPraDoR, a pretrained dual-tower dense retriever that does not require any supervised data for training. Specifically, we first present Iterative Contrastive Learning (ICoL) that iteratively trains the query and document encoders with a cache mechanism. ICoL not only enlarges the number of negative instances but also keeps representations of cached examples in the same hidden space. We then propose Lexicon-Enhanced Dense Retrieval (LEDR) as a simple yet effective way to enhance dense retrieval with lexical matching. We evaluate LaPraDoR on the recently proposed BEIR benchmark, including 18 datasets of 9 zero-shot text retrieval tasks. Experimental results show that LaPraDoR achieves state-of-the-art performance compared with supervised dense retrieval models, and further analysis reveals the effectiveness of our training strategy and objectives. Compared to re-ranking, our lexicon-enhanced approach can be run in milliseconds (22.5x faster) while achieving superior performance.


Introduction
Dense retrieval uses dense vectors to represent documents and retrieve documents by similarity scores between query vectors and document vectors.Different from cross-encoders (Reimers and Gurevych, 2019;Gao et al., 2020;MacAvaney et al., 2020) or late-interaction models (Khattab and Zaharia, 2020;Gao et al., 2021a), which predict a match score for each query-document pair thus are computationally costly, dense retrieval can be run in milliseconds, with the help of an approximate nearest neighbor (ANN) retrieval library, e.g., FAISS (Johnson et al., 2021).
As a drawback, dense retrieval models often require large supervised datasets like MS-MARCO (Nguyen et al., 2016) (533k training examples) or NQ (Kwiatkowski et al., 2019) (133k training examples) for training.Unfortunately, Thakur et al. (2021) empirically show that models trained on one dataset suffer from an out-of-domain (OOD) problem when transferring to another.This hinders the applications of dense retrieval systems.On the other hand, creating a large supervised training dataset for dense retrieval is time-consuming and expensive.For many low-resource languages, there is even no existing supervised dataset for retrieval and it can be extremely difficult to construct one.
The recently proposed BEIR benchmark (Thakur et al., 2021) highlights the generalization ability of text retrieval systems.The benchmark features a setting where models are trained on a large supervised dataset MS-MARCO (Nguyen et al., 2016) and then tested on 18 heterogeneous datasets of 9 tasks.In this paper, we propose Large-scale Pretrained Dense Zero-shot Retriever (LaPraDoR), a fully unsupervised pretrained retriever for zero-shot text retrieval.While existing dense retrievers need large supervised data and struggle to compete with a lexical matching approach like BM25 (Robertson and Zaragoza, 2009) for zero-shot retrieval, we take a different approach by complementing lexical matching with semantic matching.Without any supervised data, LaPraDoR outperforms all dense retrievers on BEIR.LaPraDoR achieves state-of-the-art performance on BEIR with a further fine-tuning, outperforming re-ranking, despite being 22.5× and 42× faster on GPU and CPU, respectively.
Training LaPraDoR faces two challenges: (1) Training Efficiency.For large-scale pretraining, training efficiency can be important.In contrastive learning, more negative instances often lead to better performance (Giorgi et al., 2021;Wu et al., 2020;Gao et al., 2021b).However, traditional inbatch negative sampling is bottlenecked by limited GPU memory.To alleviate this problem, we propose Iterative Contrastive Learning (ICoL), which iteratively trains the query and document encoders with a cache mechanism.Compared to existing solutions MoCo (He et al., 2020) and xMoCo (Yang et al., 2021), ICoL does not introduce extra encoders and can solve the mismatching between representation spaces, thus demonstrating superior performance.
(2) Versatility.There are different types of downstream tasks from various domains in both BEIR and real-world applications.We use a large-scale multi-domain corpus, C4 (Raffel et al., 2020), to train our LaPraDoR model.To make LaPraDoR versatile, besides conventional querydocument retrieval, we also incorporate documentquery, query-query, and document-document retrieval into the pretraining objective.We further share the weights between the query and document encoders and obtain an all-around encoder that fits all retrieval tasks.
To summarize, our contribution is three-fold: (1) We train LaPraDoR, an all-around unsupervised pretrained dense retriever that achieves state-ofthe-art performance on the BEIR benchmark.

Related Work
Dense Retrieval DPR (Karpukhin et al., 2020) initializes a bi-encoder model with BERT (Devlin et al., 2019) and achieves better results than earlier dense retrieval methods.RocketQA (Qu et al., 2021) exploits a trained retriever to mine hard negatives and then re-train a retriever with the mined negatives.ANCE (Xiong et al., 2021) dynamically mines hard negatives throughout training but requires periodic encoding of the entire corpus.TAS-B (Hofstätter et al., 2021) is a bi-encoder trained with balanced topic-aware sampling and knowledge distillation from a cross-encoder and a Col-BERT model (Khattab and Zaharia, 2020), in addition to in-batch negatives.xMoCo (Yang et al., 2021) adapt MoCo (He et al., 2020), a contrastive learning algorithm that is originally proposed for image representation, to text retrieval by doubling its fast and slow encoders.Although these dense retrieval systems demonstrate effectiveness on some  datasets, the BEIR benchmark (Thakur et al., 2021) highlights a main drawback of these dense retrieval systems -failure to generalize to out-of-domain data.This motivates pretraining as a solution for better domain generalization (Gururangan et al., 2020).Dense retrieval has also been applied in many other tasks (Guo et al., 2019(Guo et al., , 2020)).
Pretraining for Retrieval Lee et al. (2019) first propose to pretrain a bi-encoder retriever with an Inverse Cloze Task (ICT), which constructs a training pair by randomly selecting a sentence from a passage as the query and leaving the rest as the document.Chang et al. (2020) propose two pretraining tasks for Wikipedia and attempt to combine them with ICT and masked language modeling (MLM).Guu et al. (2020) pretrain a retriever and a reader together for end-to-end question answering (QA).
Very recently, DPR-PAQ (Oguz et al., 2021) highlight the importance of domain matching by using both synthetic and crawled QA data to pretrain and then fine-tune the model on downstream datasets for dialogue retrieval.Condenser (Gao and Callan, 2021a) is a new Transformer variant for MLM pretraining.It exploits an information bottleneck to facilitate learning for information aggregation.On top of that, coCondenser (Gao and Callan, 2021b) adds an unsupervised corpus-level contrastive loss to warm up the passage embedding space.Different from these works, LaPraDoR is the first pretrained retriever that does not require fine-tuning on a downstream dataset and can perform zero-shot retrieval.

Dual-Tower Architecture
Two Encoders The dual-tower architecture, as illustrated in Figure 1, is widely used in dense retrieval systems (Lee et al., 2019;Karpukhin et al., 2020;Xiong et al., 2021).The dual-tower archi-tecture has a query encoder E Q and a document encoder E D , which in our work are both BERT-like bidirectional text encoders (Devlin et al., 2019).
Compared with cross-attention models (Reimers and Gurevych, 2019;Gao et al., 2020;MacAvaney et al., 2020), the dual-tower architecture enables pre-indexing and fast approximate nearest neighbor search (to be detailed shortly), thus is popular in production.
Dense Representation Given an input document (query Similarity Function After obtaining the representation for both the query q and the document d, we use the cosine function as a similarity function to measure the similarity between them: Approximate Nearest Neighbor In practice, for the dual-tower architecture, the documents are encoded offline and their dense representations can be pre-indexed by a fast vector similarity search library (e.g., FAISS, Johnson et al., 2021).The library can utilize GPU acceleration to perform approximate nearest neighbor (ANN) search in sublinear time with almost no loss in recall.Thus, compared to a cross-encoder (i.e., an encoder that accepts the concatenation of the query and every candidate document), a pre-indexed ANN-based retrieval system is at least 10 times faster (to be detailed in Section 4.2).

Constructing Positive Instances
In this section, we first introduce how we build the positive instances with two self-supervised tasks, namely Inverse Cloze Task (ICT) and Dropout as Positive Instance (DaPI).
Inverse Cloze Task (ICT) First introduced in Lee et al. (2019), ICT is an effective way to pretrain a text retrieval model (Chang et al., 2020).Given a passage p consisting of sentences p = {s 1 , . . ., s n }, we randomly select a sentence s k as query q and treat its context as document d = {s 1 , . . ., s k−1 , s k+1 , . . ., s n }.ICT is designed to mimic a text retrieval task where a short query is used to retrieve a longer document which is semantically relevant.Also, unlike some pretraining tasks, e.g., Wiki Link Prediction or Body First Selection (Chang et al., 2020), ICT is fast and does not rely on a specific corpus format (e.g., Wikipedia) thus can be scaled to a large multi-source corpus (e.g., C4, Raffel et al., 2020).
Dropout as Positive Instance (DaPI) DaPI is originally proposed in SimCSE (Gao et al., 2021c) as a simple strategy for perturbing intermediate representations and thus can serve as data augmentation.2A similar idea is also presented in Liu et al. (2021).We apply a dropout rate of 0.1 to the fullyconnected layers and attention probabilities in the Transformer encoders, as in BERT (Devlin et al., 2019).The same input is fed to the encoder twice to obtain two representations, of which one is used as the positive instance of the other.Gao et al. (2021c) conduct experiments and conclude that the dropout strategy outperforms all commonly-used discrete perturbation techniques including cropping, word deletion, masked language modeling and synonym replacement.Note that different from SimCSE, we only calculate gradients for one of the two passes.
In our experiments, we find that the addition of DaPI only increases the memory use by 2%, since it mostly reuses the computational graph for the ICT objective.

Iterative Contrastive Learning
Previous studies (Giorgi et al., 2021;Wu et al., 2020;Gao et al., 2021b) show that the number of negative instances is critical to the performance of the model.Since the batch size on a single GPU is limited, we propose Iterative Contrastive Learning (ICoL) to mitigate the insufficient memory on a single GPU and allow more negative instances for better performance.We illustrate LaPraDoR training in Figure 2.
Iterative Training We iteratively train the query encoder and document encoder.To be specific, we   first arbitrarily select an encoder to start training.
Here we assume to start with the query encoder E Q .The training loss consists of two terms.First, we calculate the loss for query-query retrieval with DaPI to optimize the negative log likelihood of the positive instance: e sim(q i ,q + i ) + n j=1 e sim(q i ,q − i,j ) where q i and q + i are the same query that are encoded by E Q with different dropout masks; {q − i,1 , ..., q − i,n } is a set of randomly sampled negative instances; sim(•, •) is the cosine similarity function defined in Equation 1.
The second term is to retrieve the corresponding document d + i with the query q i , where q i and d + i are a pair constructed with ICT.Similarly, we optimize the negative log likelihood of the positive instance by: } is a set of freshly sampled documents that are encoded at the current step i; } is a set of representations that are currently stored in the cache queue Q.Then, we optimize the sum of the two losses with a weight coefficient λ: Note that the query q i only needs to be encoded once and can be used for calculation of both L qd and L qq .
After a predefined number of steps, the E Q becomes frozen as the training for E D starts.Similarly, for d i , a document encoded by E D , we have the training objective: where d + i and q + i are positive instances constructed by DaPI and ICT, respectively; {d − i,1 , . . ., d − i,n } is a set of randomly sampled document negatives; {q − i,1 , . . ., q − i,n } is a set of freshly sampled queries encoded at step i; {q − Q,1 , . . ., q − Q,|Q| } are the cached query representations.To speed up training, we apply the in-batch negatives technique (Yih et al., 2011;Henderson et al., 2017;Gillick et al., 2019) that can reuse computation and train b queries/documents in a mini-batch simultaneously.
Cache Mechanism To enlarge the size of negative instances, we maintain a cache queue Q that stores previously encoded representations that can serve as negative instances for the current step, extending an earlier study (Wu et al., 2018).Our cache queue is implemented as first-in-first-out (FIFO) with a maximum capacity m, which is a hyperparameter set based on the GPU memory size.When training with multiple GPUs, Q can be shared across GPUs.Since the representations in the queue are encoded with a frozen encoder and thus do not require gradients, m can be set large to supplement the numbers of negative instances.When Q is full, the earliest cached representations will be dequeued.When we switch the training from one encoder to the other, the queue will be cleared to ensure that all representations in Q lie in the same hidden space and are encoded with the currently frozen encoder.
ICoL vs. MoCo Previously, similar to our method, MoCo (He et al., 2020) exploits a queue for storing encoded representations.Specifically, MoCo consists of a slow encoder and a fast encoder to encode queries and documents, respectively.The slow encoder is updated as a slow moving average of the fast encoder to reduce inconsistency of encoded document representations between training steps.A queue is maintained to allow the encoded document representations to be reused in later steps as negative instances.
However, we argue there are a two limitations that make MoCo not ideal for training a text retrieval model: (1) As pointed out by Yang et al. (2021), unlike the image matching task in the original paper of MoCo, in text retrieval, the queries and documents are distinct from each other thus not interchangeable.Yang et al. (2021) propose xMoCo, which incorporates two sets of slow and fast encoders, as a simple fix for this flaw.(2) The cached representations are in different hidden spaces.Although the fast encoders in both MoCo and xMoCo are updated with momentum, the already-encoded representations in the queue will never be updated.This creates a semantic mismatch between newly encoded and cached old representations and creates noise during training.In ICoL, all representations used for contrastive learning are aligned in the same hidden space.Besides, ICoL is more flexible than xMoCo since it does not introduce additional fast encoders and even the weights of its query encoder and document encoder can be shared.We conduct experiments to compare ICoL with MoCo and xMoCo in Section 4.2.1.

Lexicon-Enhanced Dense Retrieval
Although dense retrieval achieves state-of-the-art performance, its performance significantly degenerates on out-of-domain data (Thakur et al., 2021).On the other hand, BM25 (Robertson and Zaragoza, 2009) demonstrates good performance without training.Early attempts at combining lexical match with dense retrieval often formulate it to a reranking task (Nguyen et al., 2016).First, BM25 is used to recall the top-k documents from the corpus.Then, a cross-encoder is applied to further re-rank candidate documents.Recently, COIL (Gao et al., 2021a) highlights the importance of lexical match and incorporates exact lexical matching into dense retrieval.Different from these works, we propose a fast and effective way, namely Lexicon-Enhanced Dense Retrieval (LEDR) to enhance dense retrieval with BM25.The similarity score of BM25 is defined as: ) where TF t,d and TF t,q refer to term frequency of term t in document d and query q, respectively; IDF(t) is the inverse document frequency; b, k 1 and k 2 are hyperparameters.For inference, we simply multiply the BM25 score with the similarity score for dense retrieval: In this way, we consider both lexical and semantic matching.This combination makes LaPraDoR more robust on unseen data in zero-shot learning.

Experimental Setting
Benchmark We use BEIR (Thakur et al., 2021), a recently released benchmark for zero-shot evaluation of information retrieval models.BEIR includes 18 heterogeneous datasets, focusing on evaluating a retrieval system that works across different domains (bio-medical, scientific, news, social media, etc.).The benchmark uses Normalized Discounted Cumulative Gain (nDCG) (Järvelin and Kekäläinen, 2002) as the evaluation metric, which is a measure of ranking quality and often used to measure effectiveness of search algorithms or retrieval models.Details of the BEIR benchmark and the evaluation metric are included in Appendix A.
Model Settings In our preliminary experiments on Wikipedia (see Table 2), we find that sharing weights between the query encoder E Q and document encoder E D has no negative effect on downstream performance.For weight sharing between E Q and E D , we simply copy the weights of E Q to E D when switching to training of E D , vice versa.This design eliminates nearly half of the parameters.An additional benefit is that weight sharing makes the encoder versatile to handle not only query-document retrieval, but also query-query and document-document retrieval.
In our preliminary experiments on Wikipedia, we observed a diminishing return when increasing the model size from 6 layers to 12 layers, or 24 layers.Thus, we initialize our encoder with the 6-layer DistilBERT (Sanh et al., 2019), which has ∼67M parameters.For BM25, we use the implementation and default settings of Elastic Search 3 .BM25 scores after the top 1,000 retrieved text are 3 https://github.com/elastic/elasticsearch set to 0 to save computation.
Training Details For pretraining, we optimize the model with the AdamW optimizer with a learning rate of 2e-4.The model is trained with 16 Nvidia V100 32GB GPUs with FP16 mixed precision training.The batch size for each GPU is set to 256.The maximum lengths set for queries and documents are 64 and 350, respectively.Training switches between E Q and E D every 100 steps.The cache queue has a maximum capacity m of 100k.The loss weight hyperparameter λ is fixed to 1.For our main results, we train LaPraDoR on C4 (Raffel et al., 2020) for 1M steps, which takes about 400 hours.For the ablation study, since training on C4 is very costly, we train LaPraDoR on Wikipedia4 for 100k steps.When calculating the loss, we apply a re-scaling trick of multiplying the cosine similarity score by 20 for better optimization (Thakur et al., 2021).Our implementation of LaPraDoR is based on Hugging Face Transformers (Wolf et al., 2020) and Datasets (Lhoest et al., 2021).
We test LaPraDoR under two settings: (1) No supervised data at all.We directly use the pretrained model for zero-shot retrieval on BEIR.(2) Fine- tuning on MS-MARCO (Nguyen et al., 2016) and zero-shot transfer to the other datasets.This is the original setting for BEIR.We use BEIR's official script 5 to fine-tune LaPraDoR.The batch size is set to 75 per GPU and the learning rate is 2e-5.
Baselines For dense retrieval, we compare our model to the dual-tower models: DPR (Karpukhin et al., 2020), ANCE (Xiong et al., 2021), TAS-B (Hofstätter et al., 2021) and GenQ (Thakur et al., 2021).For lexical matching, we use the BM25 results reported in Thakur et al. (2021).We also consider a late interaction baseline ColBERT (Khattab and Zaharia, 2020).The model computes multiple contextualized embeddings for each token of queries and documents, and then maximizes a similarity function to retrieve relevant documents.For re-ranking, we use the BM25+CE baseline implemented in Thakur et al. (2021) that uses BM25 to retrieve top-100 documents and a cross-encoder model to further re-rank.As shown in Table 1, the latency for both lexical and dense retrieval is low whereas re-ranking introduces significantly higher latency, with late-interaction in-between.Details of the baselines can be found in Appendix B.

Experimental Results
We list the results of LaPraDoR on the BEIR benchmark in  outperforms the previous state-of-the-art for zeroshot dense retrieval, TAS-B (Hofstätter et al., 2021), on 13 tasks (out of 18) of BEIR with an average advantage of 0.042, though TAS-B applies additional query clustering and knowledge distillation.When further fine-tuned on MS-MARCO, LaPraDoR can outperform all baselines, including late interaction and re-ranking, whose latency on GPU is 17.5× and 22.5× higher than our method.Compared to dense retrieval, we only add 0.4 GB of BM25 indices and almost no additional latency.

Effect of Iterative Contrastive Learning
We set a baseline that only uses in-batch negatives and compare our proposed Iterative Contrastive Learning (ICoL) to MoCo (He et al., 2020) and xMoCo (Yang et al., 2021)   The "w/o ICT" variant is equal to the original SimCSE approach (Gao et al., 2021c).The pretraining is on Wikipedia.

Effect of Pretraining and Lexicon-Enhanced Dense Retrieval
We conduct an ablation study for both pretraining and Lexicon-Enhanced Dense Retrieval to verify the effectiveness of these designs.As shown in Table 3, Lexicon-Enhanced Dense Retrieval (LEDR) improves performance of dense retrieval on most tasks for both fully unsupervised and fine-tuned LaPraDoR.Furthermore, as illustrated in Table 4, we test the effectiveness of the two components in our loss function.We can see that both ICT and DaPI significantly contribute to the performance of our model (p < 0.01) while ICT has a large impact on the final performance.

Case Study
We conduct a case study to intuitively demonstrate the effectiveness of LaPraDoR.As shown in Figure 3, for Q1, the lexical method (i.e., BM25) can successfully find the corresponding document in its top-2 retrieved results.However, due to lower lexical overlap, the score of the ground truth is lower than that of the first document.Although the phrase "prepare for his departure" in the first document indicates that Aeneas has not left Carchage yet and provides strong evidence that this document is incorrect, BM25 fails to correctly rank the ground truth due to its lack of ability in semantic matching.By incorporating both lexical and semantic matching, LaPraDoR can successfully retrieve the ground truth.

Q1:
Where did Aeneas go when he left Carthage?

BM25 (Top 1):
Dido and Aeneas are accompanied by their train.… Dido and Aeneas are together within the activity … Aeneas is stopped by the Sorceress's elf, who is disguised as Mercury … Aeneas is to wait no longer in beginning his task of creating a new Troy on Latin soil.Aeneas consents to the wishes of what he believes are the gods, but is heart-broken that he will have to leave Dido.He then goes off-stage to prepare for his departure from Carthage.

BM25 (Top 2):
After the sojourn in Carthage, the Trojans returned to Sicily where Aeneas organized funeral games to honor his father, who had died a year before.… Aeneas descended into the underworld where he met Dido (who turned away from him to return to her husband) and his father, who showed him the future of his descendants and thus the history of Rome.

LaPraDoR (Top 1):
After the sojourn in Carthage, the Trojans returned to Sicily where Aeneas organized funeral games to honor his father, who had died a year before.… Aeneas descended into the underworld where he met Dido (who turned away from him to return to her husband) and his father, who showed him the future of his descendants and thus the history of Rome.For Q2, with the powerful semantic matching, LaPraDoR successfully retrieves the ground truth whereas BM25 fails to distinguish among the documents that contain both the keywords Mars and Sun.On the other hand, after removing lexical matching, LaPraDoR without LEDR suffers from noise: the key entity Sun does not appear in its top-1 retrieved document.LEDR helps filter out such noise and allows the dense retriever to focus on fine-grained semantic matching.Please find more cases from other datasets on Appendix C.

Conclusion and Future Work
In this paper, we introduce LaPraDoR, an unsupervised pretrained dense retriever that achieves state-of-the-art performance on the zero-shot text retrieval benchmark BEIR.We propose Iterative Contrastive Learning (ICoL) for efficiently training LaPraDoR and Lexicon-Enhanced Dense Retrieval (LEDR) to combine lexical matching with LaPraDoR.Our experiments verify the effectiveness of both ICoL and LEDR, shedding light on a new paradigm for unsupervised text retrieval.For future work, we plan to extend unsupervised LaPraDoR to multilingual and multi-modal retrieval.

Broader Impact
Ethical Concerns LaPraDoR is trained with web-crawled data, which may contain inappropriate content.However, due to the nature of text retrieval, our retriever has lower ethical risk compared to a generative auto-regressive language model (Bender et al., 2021).Meanwhile, our unsupervised retrieval model enables high-performance text retrieval for low-resource languages where there is no supervised query-document dataset.This contributes to equality and diversity of language technology.
Carbon Footprint To conduct all experiments in this paper, we estimate to have consumed 3,840 kWh of electricity and emitted 1,420.8kg (3,132.3lbs) of CO 2 .All emitted carbon dioxide has already been offset by the cloud service provider.
(2) We propose Iterative Contrastive Learning (ICoL) for training a retrieval model effectively.(3) We propose Lexicon-Enhanced Dense Retrieval as an efficient way for combining BM25 with a dense retriever, compared to the widely-used re-ranking paradigm.
y M 1 n o 0 3 4 3 0 6 u m T M d g 7 g D 4 y P H 9 x r l q 0 = < / l a t e x i t > (a) Query encoder training.t e x i t s h a 1 _ b a s e 6 4 = " f W N i B E q + E F u j F H y M P M y y e P o 3 n 2 0 = " > A A A B + X i c b V B P S 8 M w H E 2 m z j m d V n f 0 U h y C B x n t P O h x 4 M W D h w n u D 6 y l p G m 6 h a V p S d J B K f 0 m X j w o 4 t V v 4 s 0 P I 5 h u O + j m g 8 D j v d + P 3 8 v z E 0 a l s q w v W N n a 3 q n u 1 v b q + w e N w y P z 8 a b 8 T 4 d L R i z n X 3 4 A + P j B 8 i d l q A = < / l a t e x i t > (b) Document encoder training.

Figure 2 :
Figure2: Training of LaPraDoR with Iterative Contrastive Learning (ICoL).We iteratively train the query encoder and document encoder while freezing the other (marked with an ice cube icon ).For L qd and L dq , we obtain additional negative instances from the cache queue.For each batch of data, we enqueue the representation encoded by the frozen encoder into the cache queue as future negative instances.The cache queue is cleared when switching the encoder to train from one to the other.

Q2:
What's the distance between Mars and Sun?BM25 (Top 1):From an observation of a transit of Venus in 1032, the Persian astronomer and polymath Avicenna concluded that Venus is closer to Earth than the Sun.In 1672 Giovanni Cassini and Jean Richer determined the distance to Mars and were thereby able to calculate the distance to the Sun.…BM25 (Top 5):Mars's average distance from the Sun is roughly 230 million kilometres (143,000,000 mi), and its orbital period is 687 (Earth) days …LaPraDoR (Top 1):Mars's average distance from the Sun is roughly 230 million kilometres (143,000,000 mi), and its orbital period is 687 (Earth) days …LaPraDoR w/o LEDR (Top 1):Mars is the focus of much scientific study about possible human colonization.Mars' surface conditions and past presence of water, make it arguably the most hospitable planet in the Solar System besides Earth.Mars requires less energy per unit mass (delta-v) to reach from Earth than any planet, except Venus.

Table 1 :
(Thakur et al., 2021)on the BEIR benchmark(Thakur et al., 2021).The estimated average retrieval latency and index sizes are for a single query in DBPedia.The encoding speed is reported on a 8-core Intel Xeon Platinum 8168 CPU @ 2.70GHz and a single Nvidia V100 GPU, respectively."LaPraDoR FT" is a LaPraDoR model fine-tuned on MS-MARCO with the official BEIR training script.

Table 2 :
Comparison of different methods for contrastive learning.The models are trained on Wikipedia.

Table 4 :
Effect of ICT and DaPI in the loss function.