SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval

In this paper, we propose SimLM (Similarity matching with Language Model pre-training), a simple yet effective pre-training method for dense passage retrieval. It employs a simple bottleneck architecture that learns to compress the passage information into a dense vector through self-supervised pre-training. We use a replaced language modeling objective, which is inspired by ELECTRA (Clark et al., 2020), to improve the sample efficiency and reduce the mismatch of the input distribution between pre-training and fine-tuning. SimLM only requires access to an unlabeled corpus and is more broadly applicable when there are no labeled data or queries. We conduct experiments on several large-scale passage retrieval datasets and show substantial improvements over strong baselines under various settings. Remarkably, SimLM even outperforms multi-vector approaches such as ColBERTv2 (Santhanam et al., 2021) which incurs significantly more storage cost. Our code and model checkpoints are available at https://github.com/microsoft/unilm/tree/master/simlm .


Introduction
Passage retrieval is an important component in applications like ad-hoc information retrieval, opendomain question answering [17], retrieval-augmented generation [20] and fact verification [38]. Sparse retrieval methods such as BM25 were the dominant approach for several decades, and still play a vital role nowadays. With the emergence of large-scale pre-trained language models (PLM) [10], increasing attention is being paid to neural dense retrieval methods [42]. Dense retrieval methods map both queries and passages into a low-dimensional vector space, where the relevance between the queries and passages are measured by the dot product or cosine similarity between their respective vectors. Like other NLP tasks, dense retrieval benefits greatly from a strong general-purpose pre-trained language model. However, general-purpose pre-training does not solve all the problems. As shown in Table 1, improved pre-training techniques that are verified by benchmarks like GLUE [39] do not result in consistent performance gain for retrieval tasks. Similar observations are also made by Lu et al. [24]. We hypothesize that to perform robust retrieval, the [CLS] vector used for computing matching scores should encode all the essential information in the passage. The next-sentence

Related Work
Dense Retrieval The field of information retrieval (IR) [28] aims to find the relevant information given an ad-hoc query and has played a key role in the success of modern search engines. In recent years, IR has witnessed a paradigm shift from traditional BM25-based inverted index retrieval to neural dense retrieval [42,17]. BM25-based retrieval, though efficient and interpretable, suffers from the issue of lexical mismatch between the query and passages. Methods like document expansion [30] or query expansion [1] are proposed to help mitigate this issue. In contrast, neural dense retrievers first map the query and passages to a low-dimensional vector space, and then perform semantic matching. Popular methods include DSSM [16], C-DSSM [37], and DPR [17] etc. Inference can be done efficiently with approximate nearest neighbor (ANN) search algorithms such as HNSW [27].
Some recent works [4,32,36] show that neural dense retrievers may fail to capture some exact lexical match information. To mitigate this issue, Chen et al. [4] proposes to use BM25 as a complementary teacher model, ColBERT [18] instead replaces simple dot-product matching with a more complex token-level MaxSim interaction, while COIL [14] incorporates lexical match information into the scoring component of neural retrievers. Our proposed pre-training method aims to adapt the underlying text encoders for retrieval tasks, and can be easily integrated with existing approaches.
Pre-training for Dense Retrieval With the development of large-scale language model pre-training [11,6], Transformer-based models such as BERT [10] have become the de facto backbone architecture for learning text representations. However, most pre-training tasks are designed without any prior knowledge of downstream applications. Chang et al. [3] presents three heuristically constructed pre-training tasks tailored for text retrieval: inverse cloze task (ICT), body first selection (BFS), and wiki link prediction (WLP). These tasks exploit the document structure of Wikipedia pages to automatically generate contrastive pairs. Other related pre-training tasks include representative words prediction [25] and contrastive span prediction [26] etc.
Another line of research builds upon the intuition that the [CLS] vector should encode all the important information in the given text for robust matching, which is also one major motivation for this paper. Such methods include Condenser [12], coCondenser [13], SEED [24], DiffCSE [5], and RetroMAE  For pre-training, there is a collection of passages , where x denotes a single passage. Since our motivation is to have a general pre-training method, we do not assume access to any query or human-labeled data.
The overall pre-training architecture is shown in Figure 1. Given a text sequence x, its tokens are randomly replaced with probability p by two sequential operations: random masking with probability p denoted as x = Mask(x, p), and then sampling with an ELECTRA-style generator g denoted as Sample(g, x ). Due to the randomness of sampling, a replaced token can be the same as the original one. The above operations are performed twice with potentially different replace probabilities p enc and p dec to get the encoder input x enc and decoder input x dec .
x enc = Sample(g, Mask(x, p enc )) x dec = Sample(g, Mask(x, p dec )) We also make sure that any replaced token in x enc is also replaced in x dec to increase the difficulty of the pre-training task.
The encoder is a deep multi-layer Transformer that can be initialized with pre-trained models like BERT [10]. It takes x enc as input and outputs the last layer [CLS] vector h cls as a representation bottleneck. The decoder is a 2-layer shallow Transformer with a language modeling head and takes x dec and h cls as inputs. Unlike the decoder component in autoregressive sequence-to-sequence models, the self-attention in our decoder is bi-directional. The pre-training task is replaced language modeling for both the encoder and decoder, which predicts the tokens before replacement at all positions. The loss function is the token-level cross-entropy. The encoder loss L enc is shown as follows: Similarly for the decoder loss L dec . The final pre-training loss is their simple sum: L pt = L enc + L dec . We do not fine-tune the parameters of the generator as our preliminary experiments do not show any performance gain.
It is often reasonable to assume access to the target retrieval corpus before seeing any query. Therefore, we directly pre-train on the target corpus similar to coCondenser [13]. After the pre-training finishes, we throw away the decoder and only keep the encoder for supervised fine-tuning.
Since the decoder has very limited modeling capacity, it needs to rely on the representation bottleneck to perform well on the pre-training task. For the encoder, it should learn to compress all the semantic information and pass it to the decoder through the bottleneck.  Figure 2: Illustration of our supervised fine-tuning pipeline. Note that we only use SimLM to initialize the biencoder-based retrievers. For cross-encoder based re-ranker, we use off-the-shelf pre-trained models such as ELECTRA base .

Fine-tuning
Compared to training text classification or generation models, training state-of-the-art dense retrieval models requires a relatively complicated procedure. In Figure 2, we show our supervised fine-tuning pipeline. In contrast to previous approaches, our proposed pipeline is relatively straightforward and does not require joint training [34] or re-building index periodically [41]. Each stage takes the outputs from the previous stage as inputs and can be trained in a standalone fashion.
Retriever 1 Given a labeled query-passage pair (q + , d + ), we take the last-layer [CLS] vector of the pre-trained encoder as their representations (h q + , h d + ). Both the in-batch negatives and BM25 hard negatives are used to compute the contrastive loss L cont : Where N denotes all the negatives, and φ(q, d) is a function to compute the matching score between query q and passage d. In this paper, we use temperature-scaled cosine similarity function: . τ is a temperature hyper-parameter and set to a constant 0.02 in our experiments.
Retriever 2 It is trained in the same way as Retriever 1 except that the hard negatives are mined based on a well-trained Retriever 1 checkpoint.
Re-ranker is a cross-encoder that re-ranks the top-k results of Retriever 2 . It takes the concatenation of query q and passage d as input and outputs a real-valued score θ(q, d). Given a labeled positive pair (q + , d + ) and n − 1 hard negative passages randomly sampled from top-k predictions of Retriever 2 , we adopt a listwise loss to train the re-ranker: The cross-encoder architecture can model the full interaction between the query and the passage, making it suitable to be a teacher model for knowledge distillation.
Retriever distill Although cross-encoder based re-ranker is powerful, it is not scalable enough for first-stage retrieval. To combine the scalability of biencoder and the effectiveness of cross-encoder, we can train a biencoder-based retriever by distilling the knowledge from the re-ranker. The re-ranker from the previous stage is employed to compute scores for both positive pairs and mined negatives from Retriever 2 . These scores are then used as training data for knowledge distillation. With n − 1 mined hard negatives, we use KL (Kullback-Leibler) divergence L kl as the loss function for distilling the soft labels: where p ranker and p ret are normalized probabilities from the re-ranker teacher and Retriever distill student. For training with the hard labels, we use the contrastive loss L cont as defined in Equation 3. The final loss is their linear interpolation: L = L kl + αL cont .
Our pre-trained SimLM model is used to initialize all three biencoder-based retrievers but not the cross-encoder re-ranker. Since our pre-training method only affects model initialization, it can be easily integrated into other more effective training pipelines. Implementation Details For pre-training, we initialize the encoder with BERT base (uncased version). The decoder is a two-layer Transformer whose parameters are initialized with the last two layers of BERT base . The generator is borrowed from the ELECTRA base generator, and its parameters are frozen during pre-training. We pre-train for 80k steps for MS-MARCO corpus and 200k steps for NQ corpus, which roughly correspond to 20 epochs. Pre-training is based on 8 V100 GPUs. With automatic mixed-precision training, it takes about 1.5 days and 3 days for the MS-MARCO and NQ corpus respectively.

Experiments
For fine-tuning on the MS-MARCO dataset, we train for 3 epochs with a peak learning rate 2 × 10 −5 . Each batch consists of 16 queries, each query has 1 positive passage and 15 randomly sampled hard negatives. One shared encoder is used to encode both the query and passages. We start with the official BM25 hard negatives in the first training round and then change to mined hard negatives. During inference, given a query, we use brute force search to rank all the passages for a fair comparison with previous works.
For more implementation details, please check out the Appendix section B.  We list the main results in Table 2 and 3. For the MS-MARCO passage ranking dataset, the numbers are based on the Retriever distill in Figure 2. Our method establishes new state-of-the-art with MRR@10 41.1, even outperforming multi-vector methods like ColBERTv2. As shown in Table 4, ColBERTv2 has a 6x storage cost as it stores one vector per token instead of one vector per passage. It also requires a customized two-stage index search algorithm during inference, while our method can utilize readily available vector search libraries.

Main Results
The TREC DL datasets have more fine-grained human annotations, but also much fewer queries (less than 100 labeled queries). We find that using different random seeds could have a 1%-2% difference in terms of nDCG@10. Though our model performs slightly worse on the 2019 split compared to coCondenser, we do not consider such difference as significant.
For passage retrieval in the open-domain QA setting, a passage is considered relevant if it contains the correct answer for a given question. In Table 3, our model achieves R@20 84.3 and R@100 89.3 on the NQ dataset, which are comparable to or better than other methods. For end-to-end evaluation of question answering accuracy, we will leave it as future work. Though SimLM achieves substantial gain for biencoder-based retrieval, its success for re-ranking is not as remarkable. In Table 5, when used as initialization for re-ranker training, SimLM outperforms BERT base by 0.6% but still lags behind ELECTRA base . Next, we zoom in on the impact of each stage in our training pipeline. In Table 6, we mainly compare with coCondenser [13]. With BM25 hard negatives only, we can achieve MRR@10 38.0, which already matches the performance of many strong models like RocketQA [31]. Model-based hard negative mining and re-ranker distillation can bring further gains. This is consistent with many previous works [41,34]. We also tried an additional round of mining hard negatives but did not observe any meaningful improvement.
Based on the results of Table 6, there are many interesting research directions to pursue. For example, how to simplify the training pipeline of dense retrieval systems while still maintaining competitive performance? And how to further close the gap between biencoder-based retriever and cross-encoder based re-ranker? Besides our proposed replaced language modeling objective, we also tried several other pre-training objectives as listed below.

Variants of Pre-training Objectives
Enc-Dec MLM uses the same encoder-decoder architecture as in Figure 1 but without the generator. The inputs are randomly masked texts and the pre-training objective is masked language modeling (MLM) over the masked tokens only. The mask rate is the same as our method for a fair comparison, which is 30% for the encoder and 50% for the decoder. In contrast, RetroMAE [23] uses a specialized decoding mechanism to derive supervision signal from all tokens in the decoder side.
Condenser is a pre-training architecture proposed by Gao and Callan [12]. Here we pre-train Condenser with a 30% mask rate on the target corpus.
MLM is the same as the original BERT pre-training objective with a 30% mask rate.
Enc-Dec RTD is the same as our method in Figure 1 except that we use replaced token detection (RTD) [6] as pre-training task for both the encoder and decoder. This variant shares some similarity with DiffCSE [5]. The main difference is that the input for DiffCSE encoder is the original text, making it a much easier task. Our preliminary experiments with DiffCSE pre-training do not result in any improvement.

AutoEncoder attempts to reconstruct the inputs based on the bottleneck representation. The encoder input is the original text without any mask, and the decoder input only consists of [MASK] tokens and [CLS] vector from the encoder.
BERT base just uses off-the-shelf checkpoint published by Devlin et al. [10]. It serves as a baseline to compare against various pre-training objectives.
The results are summarized in Table 7. Naive auto-encoding only requires memorizing the inputs and does not need to learn any contextualized features. As a result, it becomes the only pre-training objective that underperforms BERT base . Condenser is only slightly better than simple MLM pretraining, which is possibly due to the bypassing effects of the skip connections in Condenser. Enc-Dec MLM substantially outperforms Enc-Dec RTD, showing that MLM is a better pre-training task than RTD for retrieval tasks. This is consistent with the results in Table 1. Considering the superior performance of RTD pre-trained models on benchmarks like GLUE, we believe further research efforts are needed to investigate the reason behind this phenomenon. In the experiments, we use fairly large replace rates (30% for the encoder and 50% for the decoder). This is in stark contrast to the mainstream choice of 15%. In Table 8, we show the results of pre-training with different replace rates. Our model is quite robust to a wide range of values with 30%-40% encoder replace rate performing slightly better. Similar findings are also made by Wettig et al. [40].

Effects of Replace Rate
One interesting extreme scenario is a 100% replace rate on the decoder side. In such a case, the decoder has no access to any meaningful context. It needs to predict the original texts solely based on the representation bottleneck. This task may be too difficult and has negative impacts on the encoder. Since pre-training can be costly in terms of both time and carbon emission, it is preferred to have an objective that converges fast. Our proposed method shares two advantages of ELECTRA [6]. First, the loss is computed over all input tokens instead of a small percentage of masked ones. Second, the issue of input distribution mismatch is less severe than MLM, where the [MASK] token is seen during pre-training but not for supervised fine-tuning. In Figure 3, our method achieves competitive results with only 10k training steps and converges at 60k, while MLM still slowly improves with more steps. For a typical retrieval task, the number of candidate passages is much larger than the number of labeled queries, and many passages are never seen during training. Take the NQ dataset as an example, it has 21M candidate passages but only less than 80k question-answer pairs for training. In the experiments, we directly pre-train on the target corpus. Such pre-training can be regarded as implicit memorization of the target corpus in a query-agnostic way. One evidence to support this argument is that, as shown in Table 7, simple MLM pre-training on target corpus can have large performance gains.

On the Choice of Pre-training Corpus
An important research question to ask is: will there be any benefits of our method when pre-training on non-target corpus? In Table 9, the largest performance gains are obtained when the corpus matches between pre-training and fine-tuning. If we pre-train on the MS-MARCO corpus and fine-tune on the labeled NQ dataset or the other way around, there are still considerable improvements over the baseline. We hypothesize that this is due to the model's ability to compress information into a representation bottleneck. Such ability is beneficial for training robust biencoder-based retrievers. To qualitatively understand the gains brought by pre-training, we show several examples in Table 10.

Case Analysis
The BERT base retriever can return passages with high lexical overlap while missing some subtle but key semantic information. In the first example, the retrieved passage by BERT base contains keywords like "boy", "Winnie the Pooh", but does not answer the question. In the second example, there is no routing number in the BERT base retrieved passage, which is the key intent of the query. Our proposed pre-training can help to learn better semantics to answer such queries. For more examples, please check out Table 14 in the Appendix.

Conclusion
This paper proposes SIMLM, a novel pre-training method for dense passage retrieval. It follows an encoder-decoder architecture with a representation bottleneck in between. The encoder learns to compress all the semantic information into a dense vector and passes it to the decoder to perform well on the replaced language modeling task. When used as initialization in a dense retriever training pipeline, our model achieves competitive results on several large-scale passage retrieval datasets. We also provide detailed ablation analyses to show the key ingredients behind its success.
For future work, we would like to increase the model size and the corpus size to examine the scaling effects. It is also interesting to explore other pre-training mechanisms to support unsupervised dense retrieval and multilingual retrieval.
negatives only. For BERT and RoBERTa, we use the same hyperparameters as discussed in Section 4.1. For ELECTRA, we train for 6 epochs with a peak learning rate 4 × 10 −5 since it converges much slower. The hyper-parameters for our proposed pre-training and fine-tuning are listed in Table 11 and 13, respectively. The generator is initialized with the released one by ELECTRA authors 3 , and its parameters are frozen during pre-training.

B Implementation Details
For fine-tuning on the NQ dataset, we reuse most hyper-parameters values from MS-MARCO training. A few exceptions are listed below. We fine-tune for 20k steps with learning rate 5 × 10 −6 . The maximum length for passage is 192. The mined hard negatives come from top-100 predictions that do not contain any correct answer.

C Variants of Generators
In the ELECTRA pre-training, the generator plays a critical role. Using either a too strong or too weak generator hurts the learnability and generalization of the discriminator. We also tried several variants of generators. In Table 12, "frozen generator" keeps the generator parameters unchanged during our pre-training, "joint train" also fine-tunes the generator parameters, and "joint train w/ random init" uses randomly initialized generator parameters. We do not observe any significant performance difference between these variants. In our experiments, we simply use "frozen generator" as it has faster training speed.