Condenser: a Pre-training Architecture for Dense Retrieval

Pre-trained Transformer language models (LM) have become go-to text representation encoders. Prior research fine-tunes deep LMs to encode text sequences such as sentences and passages into single dense vector representations for efficient text comparison and retrieval. However, dense encoders require a lot of data and sophisticated techniques to effectively train and suffer in low data situations. This paper finds a key reason is that standard LMs’ internal attention structure is not ready-to-use for dense encoders, which needs to aggregate text information into the dense representation. We propose to pre-train towards dense encoder with a novel Transformer architecture, Condenser, where LM prediction CONditions on DENSE Representation. Our experiments show Condenser improves over standard LM by large margins on various text retrieval and similarity tasks.


Introduction
Language model (LM) pre-training has been very effective in learning text encoders that can be finetuned for many downstream tasks (Peters et al., 2018;Devlin et al., 2019). Deep bidirectional Transformer encoder (Vaswani et al., 2017) LMs like BERT (Devlin et al., 2019) are the state-ofthe-art. Recent works fine-tune the CLS token to encode input text sequence into a single vector representation Karpukhin et al., 2020). The resulting model is referred to as dense encoder or bi-encoder. Finetuning associates with vector similarities some practical semantics, e.g., textual similarity or relevance, and therefore the vectors can be used for efficient text comparison or retrieval by inner product. Despite their efficiency, bi-encoders are hard to train. Even with sufficient data, bi-encoders still require carefully designed sophisticated methods to train effectively (Xiong et al., 2021;Qu et al., 2020;Lin et al., 2020). They can also take big performance hits in low data situations (Karpukhin et al., 2020;Thakur et al., 2020;. Another common use of deep LM is cross-encoder, pass compared text pair directly in and use attention overall tokens to do prediction. In contrast to biencoder, cross encoder trains easier and is effective in low data for similarity and ranking tasks (Devlin et al., 2019;.
Based on the same LM, however, bi-encoder and cross encoder have similar language understanding capabilities. To explain the difficulty in training bi-encoder not seen in cross-encoder, we look into the internal structure of pre-trained LM. We find LM like BERT directly out of pre-training has a non-optimal attention structure. In particular, they were not trained to aggregate sophisticated information into a single dense representation. We term effort during fine-tuning to adjust the LM internal activation to channel its knowledge out for the target task, structural readiness. We argue bi-encoder fine-tuning is inefficient due to the lacking structural readiness. Many updates are used to adjust model attention structure than learn good representation.
Based on our observations, we propose to address structural readiness during pre-training. We introduce a novel Transformer pre-training architecture, Condenser, which establishes structural readiness by doing LM pre-training actively CONdition on DENSE Representation. Unlike previous works that pre-train towards a particular task, Condenser pre-trains towards the bi-encoder structure. Our results show the importance of structural readiness. We experiment with sentence similarity tasks, and retrieval for question answering and web search. We find under low data setups, with identical test time architecture, Condenser yields sizable improvement over standard LM and shows comparable performance to strong task-specific pretrained models. With large training data, we find Condenser retriever optimize more easily, outperforming previous models trained with complicated techniques with a single round of negative mining.

Related Work
Transformer Bi-encoder LM pre-training followed by task fine-tuning has become one important paradigm in NLP (Howard and Ruder, 2018). SOTA models adopt the Transformer architecture (Devlin et al., 2019;Liu et al., 2019;Lan et al., 2020). One challenge for applying deep Transformer is their computation cost when used to retrieve text from large collections. Motivated by this, Reimers and Gurevych (2019) propose SBERT which trains biencoder from BERT and uses vector product for efficient sentence similarity comparison. Transformer bi-encoders were soon also adopted as dense retriever Karpukhin et al., 2020;Gao et al., 2021b).
Dense Retrieval Dense retrieval compares encoded query vectors with corpus document vectors using inner product. While there are works on efficient cross-encoder (Gao et al., 2020;MacAvaney et al., 2020), such models are still too costly for full corpus retrieval. By pre-encoding the corpus into MIPS (Johnson et al., 2017;Guo et al., 2020) index, retrieval can run online with millisecond-level latency. An alternative is the recently proposed contextualized sparse retrieval model (Gao et al., 2021a). In comparison, dense retrieval is easier to use and backed by more matured software like FAISS (Johnson et al., 2017).
Pre-train Bi-encoder  are among the first to show the effectiveness of Transformer bi-encoder for dense retrieval. They proposed to further pre-train BERT with Inverse Cloze Task (ICT). ICT uses pair of passage segment and full passage as pseudo training pair.  find ICT and other related tasks are "key ingredients" for strong bi-encoders. Their results also show that models without pre-training fail to produce useful retrieval results under low data setups. Guu et al. (2020) propose to pre-train retriever and reader together for end-to-end QA system. The aforementioned methods are specialized task specific solutions for improving bi-encoder training based on contrastive loss. This paper provides an explanation for the learning issue and presents an architecture that establishes a universal solution using general language model pre-training. We also note that language model and contrastive pretraining are orthogonal ideas. In a follow-up work, we show further improved performance adding contrastive learning to Condenser language model pretraining (Gao and Callan, 2021).
Effective Dense Retriever Karpukhin et al. (2020) found carefully fine-tuning BERT can produce better results than earlier pre-trained dense retrieval systems. To further improve the end performance of dense retrievers, later works look into better fine-tuning techniques. Using a learned retriever to mine hard negatives and re-train another retriever with them was found helpful (Karpukhin et al., 2020;Qu et al., 2020). ANCE (Xiong et al., 2021) actively mines hard negatives once after an interval during training to prevent diminishing gradients. It allocates extra resources to update and retrieve from the corpus retrieval index repetitively. (Gao et al., 2021b) proposed to jointly learn a pair of dense and sparse systems to mitigate the capacity issue with low dimension dense vectors. Beyond fine-tuning, using more sophisticated knowledge distillation loss to learn bi-encoders based on soft labels has also been found useful Lin et al., 2020). They first learn a teacher model and use its predictions at training time to optimize the dense retriever. These works all aim at producing better gradient updates during training, while Condenser aims at better initializing the model. We will also show the combined improvement of Condenser and hard negatives in experiments. Another line of works question the capacity of single vector representation and propose to use multi-vector representation (Luan et al., 2020). Capacity defines the performance upper bound and is one other issue than training (optimization), i.e. how to reach the upper bound.

Sentence Representation
We'd also like to make a distinction from works in universal sentence representation and encoder (Kiros et al., 2015;Conneau et al., 2017;Cer et al., 2018). They are featurebased methods rather than fine-tuning (Houlsby et al., 2019). In evaluation, they focus on using the learned embedding as universal features for a wide range of tasks (Conneau and Kiela, 2018). This paper considers task-specific fine-tuning of the entire model and focuses on the target task performance.

Method
This section discusses the motivation behind Condenser, its design, and its pre-training procedure.

Preliminaries
Transformer Encoder Many recent state-of-theart deep LM adopts the architecture of Transformer encoder. It takes in a text sequence, embed it and pass it through a stack of L self-attentive Transformer blocks. Formally, given input text x = [x 1 , x 2 , ...], we can write iteratively, Intuitively, Transformer blocks refine each token's representation conditioning on all tokens in the sequence to effectively embed them.
Transformer LM Pre-training Many successful Transformer Encoder LMs such as BERT are trained with masked language model (MLM) task. MLM masks out a subset of input tokens and requires the model to predict them. For a masked out token x i at position i, its corresponding final representation h L i is used to predict the actual x i . Training uses a cross-entropy loss, A special token, typically referred to as CLS is prepended and encoded with the rest of the text.

Issues with Transformer Encoder
Recall in Transformers, all tokens, including the CLS, receive information of other tokens in the sequence only with attention. Attention patterns, therefore, define how effective CLS can aggregate information. To understand the attentive behaviors of CLS, we borrow analysis of BERT from Clark et al. (2019): 1) in most middle layers, the CLS token has similar attention patterns as other text tokens and is not attended by other tokens, 2) until the last layer, CLS has unique broad attention over the entire sequence to perform NSP task. In other words, the CLS token remains dormant in many middle layers and reactivates only in the last round of attention. We argue that an effective bi-encoder should actively aggregate information of different granularity from the entire sentence through all layers, and this structure in standard pre-trained LM is not immediately ready for fine-tuning. We will verify this claim with experiments in section 4 and with quantitative analysis of attention of BERT, ICT, and the proposed Condenser in section 5.

Condenser
Building upon Transformer encoder LMs, which conditions on left and right context (Devlin et al., 2019), we present bi-encoder pre-training architecture Condenser, which CONdition actively on DENSE Representation in LM pre-training.
Model Design Like Transformer Encoder, Condenser is parametrized into a stack of Transformer blocks, shown in Figure 1. We divide them into three groups, L e early encoder backbone layers, L l late encoder backbone layers, and L h Condenser head Layers. Inputs is first encoded by backbone, We follow the masking scheme in Devlin et al.
Within Condenser, the late encoder backbone can further refine the token representations but can only pass new information through h late cls , the late CLS. The late CLS representation is therefore required to aggregate newly generated information later in the backbone, and the head can then condition on late CLS to make LM predictions. Meanwhile, skip connecting the early layers, we remove the burden of encoding local information and the syntactic structure of input text, focusing CLS on the global meaning of the input text. Layer numbers L e and L l control this separation of information.

Architecture of Condenser is inspired by Funnel
Transformer (Dai et al., 2020), which itself is inspired by U-net (Ronneberger et al., 2015) from computer vision. Funnel Transformer reduces sequence length by a factor of 4 during forward and uses a 2-layer Transformer to decode the length compressed sequence onto a skip-connected fulllength representation. Funnel Transformer was designed to speed up pre-training while our Condenser learns dense information aggregation.
Fine-tuning The Condenser head is a pre-train time component and is dropped during fine-tuning. Fine-tuning trains the late CLS h late cls and backpropagate gradient into the backbone. In other words, a Condenser reduces to its encoder backbone, or effectively becomes a Transformer encoder for fine-tuning; the head is only used to guide pre-training. During fine-tuning, Condenser has an identical capacity as a similarly structured Transformer. In practice, Condenser can be a drop-in weight replacement for a typical Transformer LM like BERT.

Condenser from Transformer Encoder
In this paper, we opted to initialize Condenser with pre-trained Transformer LM weight. This accommodates our compute budget, avoiding the huge cost of pre-training from scratch. This also gives us a direct comparison to the original LM. Given a pre-trained LM, we initialize the entire Condenser backbone with its weights and randomly initialize the head. To prevent gradient back propagated from the random head from corrupting backbone weights, we place a semantic constraint by perform-ing MLM also with backbone late outputs, The intuition behind this constraint is that encoding per-token representations h late and sequence representation h late cls share similar mechanism and will not interfere with each other. As a result, h late can still be used for LM prediction. The full loss is then defined as a sum of two MLM losses, The output projection matrix W is shared between the two MLM losses to reduces the total number of parameters and memory usage.

Experiments
In this section, we first describe details on how to pre-train Condenser from BERT. Our fine-tuning experiments then look into the impacts of Condenser under low and high data setup. To evaluate low data, we sample smaller training sets similar to , by sub-sampling the original train set. We keep dev/test sets unchanged across runs for direct comparison. We first validate our model with short sentence level tasks, then evaluate retrieval in open question answering and web search tasks following prior works Xiong et al., 2021). We will examine how swapping original BERT with Condenser improves performance, and how the improvements compare to various improved training techniques.

Pre-training
We initialize Condenser backbone layers from the popular 12-layer BERT base and only a 2-layer head from scratch. Pre-training runs with procedures described in subsection 3.4. We use an equal split, 6 early layers, and 6 late layers. We pre-train over the same data as BERT: English Wikipedia and the BookCorpus. This makes sure BERT and Condenser differ only in architecture for direct comparison. We train for 8 epochs, with AdamW, learning rate of 1e-4 and a linear schedule with warmup ratio 0.1. Due to compute budget limit, we were not able to tune the optimal layer split, head size or train hyperparameters, but leave that to future work. We train on 4 RTX 2080ti with gradient accumulation. The procedure takes roughly a week to finish. After pre-training, we discard the Condenser head, resulting in a Transformer model of the same architecture as BERT. All fine-tuning experiments share this single pre-trained weight.

Sentence Similarity
Dataset We use two supervised data sets: Semantic Textual Similarity Benchmark(STS-b; Cer et al. (2017)) and Wikipedia Section Distinction (Ein Dor et al., 2018) adopted in Reimers and Gurevych (2019). The former is a standard sentence similarity task from GLUE (Wang et al., 2018) with a small training set (∼6K). The latter is large(∼1.8M) and has an interesting objective, to determine if a pair of sentences are from the same Wikipedia section, very similar to the BERT NSP task. Lan et al. (2020) argue NSP learns exactly topical consistency on the training corpus, i.e. Wikipedia. In other words, NSP is a close pre-training, if not training, task for Wiki Section Distinction. We report test set Spearman correlation for STS-b and accuracy for Wiki Section Distinction.
Compared Systems We compare with standard BERT and on STS-b, with BERT pre-trained with multiple NLI data sets with a popular carefully crafted 3-way loss (Conneau et al., 2017) from Reimers and Gurevych (2019) 2 . Non-BERT baselines are also borrowed from it.
Implementation We use the sentence transformer software and train STS-b with MSE regression loss and Wiki Section with triplet loss (Reimers and Gurevych, 2019). The training follows the authors' hyper-parameter settings.
Results Table 1 shows performance on STS-b with various train sizes. NLI pre-trained BERT and Condenser consistently outperform BERT and has a much larger margin with smaller train sizes. Also, with only 500 training pairs, they outperform the best Universal Sentence Encoder(USE) baseline.
For Wiki Section, in Table 2 we observe almost identical results among BERT and Condenser models, which outperform pre-BERT baselines. Meanwhile, even when training size is as small as 1K, we observe only about 10% accuracy drop than training with all data. Without training with the NSP task, Condenser remains effective.

Retrieval for Open QA
In this section, we test bi-encoders with open QA passage retrieval experiments 2 These models are referred to as SBERT in the original paper. We use BERT for consistency with later discussions.

STS-b
Negatives can come from various sources: random, top BM25, hard negatives, or sophisticatedly sampled like ANCE. We conduct low data experiments with BM25 negatives to save compute and use mined hard negatives (HN) in full train experiments.
Dataset We use two query sets, Natural Question(NQ; Kwiatkowski et al. (2019)) and Trivia QA(TQA; Joshi et al. (2017)), as well as the Wikipedia corpus cleaned up and released with DPR. NQ contains questions from Google search and TQA contains a set of trivia questions. Both NQ and TQA have about 60K training data postprocessing. We refer readers to Karpukhin et al. (2020) (Guu et al., 2020) to do multiple rounds of hard negative mining during training. We also compare with RocketQA (Qu et al., 2020), which is trained with an optimized fine-tuning pipeline that combines hard negative, large (1024) batch, supervision from cross-encoder, and external data.
Implementation We train Condenser systems using the DPR hyper-parameter setting. We use a single RTX 2080ti and employ the gradient cache technique (Gao et al., 2021c) implemented in the GC-DPR toolkit 4 to perform large batch training with the GPU's limited memory. As DPR only released Natural Question hard negatives, we use theirs on Natural Question and mine our own with a Condenser retriever on TriviaQA.

Results
In Table 3, we record test set performance for NQ and TQA with low data. We observe ICT and Condenser both outperform vanilla BERT, by an especially large margin at 1K training size, dropping less than 10% compared to full-size training for Top-20 Hit and less than 5% for Top-100. The improvement is more significant when considering the gain over unsupervised BM25. ICT and Condenser show comparable performance, with 3 A detailed discussion of this choice of ICT is in A.3 4 https://github.com/luyug/GC-DPR ICT slightly better on NQ and Condenser on TQA. This also agrees with results from Lee et al. (2019), that ICT specializes in NQ. The results suggest general LM-trained Condenser can be an effective alternative to task-specific pre-trained model ICT.
In Table 4, we compare Condenser trained with full training data with other systems. On NQ, dense retrievers all yield better performance than lexical retrievers, especially those that use hard negatives. We see Condenser performs the best for Top-20 and is within 0.1 to RocketQA for Top-100, without requiring the sophisticated and costly training pipeline. On TQA, we see GAR, lexical with deep LM query expansion, perform better than all dense systems other than Condenser. This suggests TQA may require granular term-level signals hard to capture for dense retrievers. Nevertheless, we find Condenser can still capture these signals and perform better than all other lexical and dense systems.

Retrieval for Web Search
In this section, we examine how Condenser retriever performs on web search tasks. The setup is similar to open QA. One issue with web search data sets is that they are noisier, containing a large number of false negatives (Qu et al., 2020). We investigate if Condenser can help resist such noise.  As passage retrieval is the focus of the paper, we defer discussion of long document retrieval to A.4.

MS-MARCO
Dataset We use the MS-MARCO passage ranking dataset (Bajaj et al., 2018), which is constructed from Bing's search query logs and web documents retrieved by Bing. The training set has about 0.5M queries. We use corpus pre-processed and released with RocketQA. We evaluate on two query sets: MS-MARCO Dev 5 and TREC DL2019 queries. We report on Dev official metrics MRR@10 and Recall@1k, and report on DL2019 NDCG@10.
Implementation We train with the contrastive loss with a learning rate of 5e-6 for 3 epochs on a RTX2080ti. We pair each query with 8 passages as Luan et al. (2020) (Dai and Callan, 2019) and DocT5Qry (Nogueira and Lin, 2019), and dense systems, ANCE, TCT (Lin et al., 2020) and ME-BERT (Luan et al., 2020). TCT also aims at improving training like ANCE, but by replacing contrastive loss fine-tuning with knowledge distillation. ME-BERT uses BERT large variant as encoder, three times larger than LMs used in other systems, and represents passage with multiple vectors. It gets higher encoder and embedding capacity but has higher costs in train, inference, and retrieval.
Since the full RocketQA system uses data external to MS-MARCO, for a fair comparison, we include the variant without external data in the main result Table 6 and separately compare Condenser with all RocketQA variants in Table 7.

Results
In Table 5, we again find in low data, ICT and Condenser initialized retriever outperforms BERT by big margins. As it gets to 10K training data, 2% of the full training set, all dense retrievers outperform BM25, with ICT and Condenser retaining their margin over BERT. Condenser can already show comparable performance in recall and NDCG to BERT trained on the full training set. We also observe that Condenser can outperform ICT at various train size, suggesting that the general LM pre-training of Condenser help it better generalize across domains than task-specific ICT.
In Table 6, we compare full train performance of various system. We see various training techniques help significantly improve over vanilla fine-tuning. Condenser can further outperform these models by big margins, showing the benefits brought by pre-training. Without involving complex training techniques, or making model/retrieval heavy, Condenser can already show slightly better performance than RocketQA.  We further give a comparison with RocketQA variants in Table 7 to understand more costly strategies: very large batch, denoise hard negatives, and data augmentation. RocketQA authors find mined hard negatives contain false negatives detrimental to bi-encoder training as shown in the table and propose to use cross-encoder to relabel and denoise them, a process however thousands of times more costly than hard negative mining. They further employ a data augmentation technique, using a cross encoder to label external data. Here, we see Condenser trained with batch size 64 and BM25 negatives has better performance than RocketQA with 8192 batch size. More importantly, Condenser is able to resist noise in mined hard negatives, getting a decent boost training with mined hard negatives, unlike RocketQA whose performance drops a lot without denoise. We see that Condenser removes the need for many sophisticated training techniques: it is only outperformed by the RocketQA variant that uses external data (data augmentation).
Interestingly, our runs of BERT (DPR) + HN have decent performance improvement over BERT in all retrieval tasks, sometimes better than active mining ANCE on both QA and Web Search. This contradicts the finding in RocketQA that directly mined hard negatives hurts performance.
Recall our hard negatives are mined by Condenser retriever, which we conjecture has produced higher quality hard negatives. The finding suggests that mined hard negatives may not be retrieverdependent. There exist universally better ones, which can be found with a more effective retriever.

Attention Analysis
Condenser is built upon the idea that typical pretrained LM lacks proper attention structure. We already see that we can fix the issue by pre-training with Condenser in the last section. In this section, we provide a more in-depth attention analysis: we compare attention behaviors among pretrained/fine-tuned BERT, ICT, and Condenser. We use an analytical method proposed by Clark et al. (2019), characterizing the attention patterns of CLS by measuring its attention entropy. A higher entropy indicates broader attention and a lower more focused attention. Similar to Clark et al. (2019), we show CLS attention entropy at each layer, averaged over all heads, and averaged over 1k randomly picked Wikipedia sections.
In Figure 2, we plot attention from CLS of various models. We see in Figure 2a that BERT has a drastic change in attention pattern between pretrained and fine-tuned models. This again confirmed our theory that typical Transformer Encoder LMs are not ready to be fined-tuned into biencoder, but need to go through big internal structural changes. In comparison, we see in Figures 2b, 2c that task-specific pre-trained ICT and LM pretrained Condenser only have small changes, retaining general attention structure. In other words, ICT and Condenser both established structural readiness, but in very different ways. Both ICT and Condenser have broadening attention (increased entropy) in the later layers, potentially because the actual search task requires aggregating more highlevel concepts than pre-training. The results here again confirm our theory, that a ready-to-use structure can be easier to train; their structures only need small changes to work as an effective bi-encoder.

Conclusion
Fine-tuning from a pre-trained LM initializer like BERT has become a very common practice in NLP. In this paper, we however question if models like BERT are the most proper initializer for bi-encoder. We find typical pre-trained LM does not have an internal attention structure ready for bi-encoder. They cannot effectively condense information into a single vector dense representation. We propose a new architecture, Condenser, which establishes readiness in structure with LM pre-training. We show Condenser is effective for a variety of tasks, sentence similarity, question answering retrieval, and web search retrieval. With low data, Condenser shows comparable performance to task-specific pre-trained models. It also provides a new pretraining perspective in learning effective retrievers than fine-tuning strategies. With sufficient training, Condenser and direct fine-tuning can be a lightweight alternative to many sophisticated training techniques.
Positive results with Condenser show that structural readiness is a fundamental property in easyto-train bi-encoders. Our attention analysis reveals both Condenser and task-specific pre-trained model establish structural readiness, suggesting task-specific objective may not be necessary. Researchers can use this finding to guide the study of better LM for bi-encoder, for example, explore training Condenser with other LM objectives.
One big advantage of BERT is that after cumbersome pre-training for once, fine-tuning is easy with this universal model initializer. This is however not true for BERT bi-encoder, especially retriever, which needs careful and costly training. Condenser extends this benefit of BERT to bi-encoder. Practitioners on a limited budget can replace BERT with our pre-trained Condenser as the initializer to get an instant performance boost. Meanwhile, for those aiming at the best performance, training techniques and Condenser can be combined. As we have demonstrated the combined effect of hard negatives and Condenser, sophisticated but better techniques can be further incorporated to train Condenser.

A.2 Model Size
In our experiments, Condenser during fine-tuning has the same number of parameters as BERT base, about 100 M. Adding the head during pre-training, there are roughly 120 M parameters.

A.3 ICT Model
Our ICT model comes from . It is trained with a batch size of 4096. ICT's effectiveness in low data setup was verified and thoroughly studied by .  also introduces two other pre-training tasks Body First Selection and Wiki Link Prediction. They heavily depend on Wikipedia like structure and knowledge of the structure during pre-training and therefore does not apply in general situations. Meanwhile, adding them improves over ICT by only around 1% and  has not released their model checkpoints. Therefore we chose to use the ICT checkpoint.
Difficulties in reproducing these models come from the large batch requirement and the contrastive loss in ICT. Both  and  find it critical to use large batch:  uses a 4096 batch and  a 8192 batch. Both were trained with Google's cloud TPU. In comparison, our GPU can fit a batch of only 64. The contrastive loss uses the entire batch as the negative pool to learn the embedding space. Using gradient accumulation will reduce this pool size by several factors, leading to a bad pre-trained model. In comparison, our Condenser is based on instance-wise MLM loss and can naively use gradient accumulation.
We convert the original Tensorflow Checkpoint into Pytorch with huggingface conversion script. We don't use the linear projection layer that maps the 768 BERT embedding vector to 128 so that the embedding capacity is kept the same as retrievers in Karpukhin et al. (2020).

A.4 Document Retrieval
Recent works (Xiong et al., 2021;Luan et al., 2020) explored retrieving long documents with the MS-MARCO document ranking dataset (Bajaj et al., 2018). There are several issues with this data set. The training set is not directly constructed but synthesizing from the passage ranking data set label. Xiong et al. (2021) find that the judgment in its TREC DL2019 test set biased towards BM25 and other lexical retrieval systems than dense retrievers. Meanwhile, Luan et al. (2020) find single vector representation has a capacity issue in encoding long documents. To prevent these confounding from affecting our discussion, we opted to defer the experiment to this appendix. Here we use two query sets, MS-MARCO Document Dev and TREC DL2019. We report official metrics MRR@100 on Dev and NDCG@10 on DL2019. Results are recorded in Table 8. Condenser improves over BERT by a large margin and adding HN also boosts its performance. Condenser + HN performs the best on Dev. On the other hand, we see ANCE is the best on DL2019. We conjecture the reason is that use of BM25 negatives in many systems is not favorable towards DL2019 labels that favor lexical retrievers. The multi rounds of negative mining help ANCE get rid of the negative effect of BM25 negatives.

A.5 Engineering Detail
We implement Condenser (from BERT) in Pytorch (Paszke et al., 2019) based on the BERT implementation in huggingface transformers package (Wolf et al., 2019). As our adjustments go only into the model architecture and the LM objective is kept unchanged, we only need to modify the modeling file and reuse the pre-training pipeline from huggingface.

A.6 Link To Datasets
Sentence Similarity Cleaned up version can be found in the sentence transformer repo https://github.com/UKPLab/ sentence-transformers.