Less is More: Pretrain a Strong Siamese Encoder for Dense Text Retrieval Using a Weak Decoder

Dense retrieval requires high-quality text sequence embeddings to support effective search in the representation space. Autoencoder-based language models are appealing in dense retrieval as they train the encoder to output high-quality embedding that can reconstruct the input texts. However, in this paper, we provide theoretical analyses and show empirically that an autoencoder language model with a low reconstruction loss may not provide good sequence representations because the decoder may take shortcuts by exploiting language patterns. To address this, we propose a new self-learning method that pre-trains the autoencoder using a weak decoder, with restricted capacity and attention flexibility to push the encoder to provide better text representations. Our experiments on web search, news recommendation, and open domain question answering show that our pre-trained model significantly boosts the effectiveness and few-shot ability of dense retrieval models. Our code is available at https://github.com/microsoft/SEED-Encoder/.


Introduction
Recently, Dense Retrieval (DR) has progressed to more important roles in many language systems, for example, web search (Xiong et al., 2021), question answering (Karpukhin et al., 2020), and news recommendation (Wu et al., 2020b). In the firststage retrieval of these scenarios, DR models generally employ a Siamese/Dual-Encoder architecture in practice. The encoder model first separately encodes the user side (query, browsing history, or question) and the corpus side (document or passages) as individual embeddings in a learned representation space , where retrieval with simple similarity metrics are conducted effectively (Johnson et al., 2017;Guo et al., 2020).
A popular choice of text encoders in DR is the Transformer network pre-trained by language modeling (e.g., BERT) (Reimers and Gurevych, 2019a). It is unexpected that, unlike in other language tasks where pre-trained models simply excel, directly fine-tuning BERT in DR often underperforms unsupervised sparse retrieval, e.g., BM25. Some complicated procedures are almost necessary to effectively fine-tune pre-trained Transformers in dense retrieval (Karpukhin et al., 2020;Luan et al., 2021;Xiong et al., 2021). One observation is that the pre-trained language models are not effective at encoding the semantics of the entire text sequence in one embedding, especially in dense retrieval where text sequences are mostly longer than 128 tokens (Luan et al., 2021).
In some other modalities, autoencoders have been widely used to obtain high-quality data representations (Vincent et al., 2010;Kingma and Welling, 2013). They pair a decoder on top of the encoder, trains the decoder to reconstruct the data solely from the encoder's encodings, thus enforce an information bottleneck on the data encodings for better representation quality. Recently, autoencoders have been brought in language pre-training.  stacks a GPT-2 decoder on top of the BERT encoder and trains the autoencoder via a conditional language modeling task. Their learned encoder, Optimus, provides better text encodings for GLUE and language generation tasks, but, as shown in our empirical study, does not provide better encodings for dense retrieval.
This phenomenon inspires us to investigate why the standard setup of autoencoders in language modeling falls short in dense retrieval. We first notice that in the auto-regressive decoder, the model takes not only the CLS encoding but also the previous tokens as input. Our mathematical analysis shows that the decoder can exploit natural language patterns using its access to previous tokens and bypass the dependency on the encoder, especially when the sequence is long and the decoder is strong, e.g., GPT-2. As a result, the autoencoder achieving a low reconstruction loss value does not necessarily provide better text sequence encodings.
Our analyses lead to a quite simple solution: we present a new autoencoder pre-training strategy, which pairs the BERT-style encoder with a weak decoder by restricting its parameter capacity and attention flexibility. This way, our SEED-Encoder, "Strong tExt Encoder by training with weak Decoder", creates an information bottleneck in the autoencoder and forces the encoder to provide better text representations. In our experiments on three real-world applications, we confirm that SEED-Encoder produces better pre-trained checkpoints that seed dense retrieval models with higher accuracy and better few-shot ability.

Related work
Pre-training Language Models. Masked Language Modeling (MLM) (Devlin et al., 2018) is one of the most effective ways to learn text representations. It first randomly masks some tokens in a sequence and then pre-trains a Transformer to recover them (Joshi et al., 2020;Liu et al., 2019;Clark et al., 2020). There are also attempts to design sequence-level tasks during pre-training. The next sequence prediction task proposed in Devlin et al. (2018) trains the model to predict whether two sequences are contiguous. Liu et al. (2019) showed this task is not effective and can be removed. In , more sequence-level tasks are developed, such as predicting whether two segments are from the same document. Our learning framework architecture is close to , which trains an encoder and a decoder for both language understanding and generation. We will discuss its detail and show how it motivates our work.
Dense Retrieval with Text Encoders. Dense-Retrieval systems often use the Siamese/Dual Encoder architecture, where two sequences are encoded by the Transformer separately, and their similarity is calculated upon their sequence embeddings. Reimers and Gurevych (2019b) is among the first to study how to use BERT in a Siamese architecture and found that the CLS representation does not perform as well as expected. Recent research (Karpukhin et al., 2020;Xiong et al., 2021) demonstrated that applying pre-trained models in dense text retrieval is not as straightforward. Karpukhin et al. (2020) use BM25 to find negative samples to better fine-tune pre-trained models for dense retrieval. Xiong et al. (2021) performs global noise constructive estimation and finds global negatives using the DR model for the DR model.

Method
In this section, we first recap preliminaries in language pre-training and autoencoder. Then we discuss the drawbacks of using strong decoders in autoencoder and address them with SEED-Encoder.

Preliminary
In a standard setup of pre-training language models, e.g., BERT (Devlin et al., 2018), the neural network to be pre-trained is a multi-layer bidirectional Transformer encoder (Vaswani et al., 2017), which takes a sequence of tokens x = (x 1 , ..., x n ) from the vocabulary V , and produces their contextualized representations h = (h 1 , ..., h n ): where CLS is a special token added in the first position, its contextual representation h 0 is often used as the representation of the sequence. The parameters of the Transformer θ enc are typically pre-trained using Masked Language Modeling (MLM) (Devlin et al., 2018), which masks a fraction of the input sequence and trains the model to predict the original tokens. For ease of reference, we denoted the loss as L MLM (x, θ enc ).
As there is no informative training target at the CLS position in token level pre-training tasks, it is not formally guaranteed that the contextual representation at CLS contains enough information for any sequence-level downstream tasks.  introduces the autoencoder setup in language model pre-training, which adds a reconstruction loss on top of the CLS token's h 0 : where h 0 is viewed as a latent variable. The decoder θ dec , which is another deep Transformer model GPT-2, receives h 0 and generates the original input autoregressively. The (variational) decoder loss is defined as : where x <t are all previous tokens before t.

Effects of Using a Strong Decoder
One would expect the autoencoder to provide good representations if the decoder can well recover the input. However, we found that a typical model stacking a standard autoregressive decoder on a standard BERT-style encoder doesn't work well in dense retrieval tasks. For example, we finetune the pre-trained checkpoint of Optimus, which stacks GPT-2 on top of BERT on MS MARCO and compare it with BERT. We use Mean Reciprocal Rank(mrr) and recall as evaluation metrics. The detailed experimental setting can be found in Section 4.3, and the results are shown in Figure 1 (a). The performance of Optimus on dense retrieval tasks is worse than standard BERT, a sharp contrast with Optimus's effectiveness on other language tasks, e.g., in GLUE benchmarks. Note that one difference between data in GLUE and MS MARCO is the sequence length. In most GLUE tasks, the sequence length is short, e.g., average 14 tokens in SST-2, while the average passage length in MS MARCO is more than 450. Also, recent research shows that long sentences are hard to represent via single embedding vectors from pre-trained models (Luan et al., 2021).
To confirm this, We randomly select sequence pairs of different lengths and calculate the cosine similarity of their CLS embeddings provided by Optimus. The results are shown in Figure 1 (b). The representations of long sequences (256 or 512 tokens) from Optimus are quite similar; the cosine similarities of random long sequence pairs are around 0.8. The model yields cluttered representations for long text sequences. When fine-tuned for dense retrieval in MS MARCO, it does not separate relevant documents for a query from those irrelevant ones. All of those representations might be similar to each other and require dedicated finetuning to realign their encodings.

Theoretical Analysis
Next, we mathematically show why the encoder may fail to learn good sequence representations using a strong decoder.
In Eqn. 2, at each time step t, the prediction of x t not only depends on the CLS encoding h 0 but also the previous tokens x <t . Thus a lower reconstruction loss may not be contributed by more informative h 0 : for a large t in a long text sequence, the model may directly predict x t from x <t if the decoder is strong. The quality of the representation at the CLS is not guaranteed as a low decoding loss may not reflect much about h 0 .
To further understand the requirements for informative sequence representations, we investigate the relationship between the reconstruction loss, h 0 , and the language sequence in their mathematical form. First, we decompose the expectation of the loss L dec into two terms: a Kullback-Leibler divergence and a conditional-entropy term, according to the following fact in information theory: We have X as a random variable defined in the sequence space X , where each sequence x is sampled from data distribution P D , X <t as the truncate of X at position t, and P θ dec as the sequence distribution generated by the decoder. For simplicity, we assume all the sequences are of length n. The expected reconstruction loss can be rewritten as The above equation shows that the loss consists of two terms, a K-L term D KL (·) (Eqn. 6 and Eqn. 7) describing the difference between two distributions, and a conditional-entropy term H D (·) (Eqn. 8) reflecting the strength of language patterns. As we discuss next, both terms can achieve low values even with random h 0 .  The first K-L term characterizes how P θ dec (X t |X <t , h 0 ), the decoder generated sequence distribution, aligns with the ground truth distribution P D (X t |X <t , h 0 ). Even with a meaningless θ enc , if the decoder has sufficient capacity, e.g., a very deep Transformer, it can still approximate the ground truth distribution well and thereby reduce the K-L term. In theory, Transformers with arbitrary width and depth can approximate any sequence-level functions and may reach a low K-L loss using little information from h 0 (Yun et al., 2019).
The second term H D (X t |X <t , h 0 ) characterizes the strength of language patterns: The stronger the correlation between X t with X <t , the lower the second term is. In natural language, the correlation becomes stronger with larger t as there is more information from the previous tokens. There is not a strong need for a good text encoder h 0 because a strong decoder can capture the natural language patterns by itself.

SEED-Encoder
Our analysis shows that to obtain a stronger text encoder and a better h 0 , we can not make the decoder too strong: we need to constrain its capacity and also the available language context to reduce the correlation between X t and X <t , so that it has to rely on the information in the encoder CLS to reconstruct the text sequence.
In the rest of this section, We introduce SEED-Encoder which adopts these designs. The model structure is illustrated in Figure 2. Making a language model weaker is easier than making it stronger. We simply modify Eqn. 2 to weaken the decoder: • Using a shallower Transformer θ weak dec with fewer layers (e.g., three); • Restricting its access to previous context, i.e., limit model attention to previous k tokens.
This leads to the following reconstruction loss: where k is the window size of the restricted attention. Through these modifications, we enforce the information bottleneck between the encoder and the decoder, thereby forcing the decoder to rely on the CLS representation of the encoder, and pushing the encoder to learn a more informative representation. Similar to , the pre-training of SEED-Encoder uses the combination of the encoder's standard MLM loss and the decoder's reconstruction loss: The encoder and decoder are trained together. After pre-training, the decoder is discarded, and the encoder is used in downstream applications.

Experiments
In this section, we present various experimental analyses to evaluate the SEED-Encoder on dense retrieval tasks. More results on other language tasks are in Appendix A.2.

Pre-training Details
All our models are pre-trained from scratch, following the setup of BERT-base (Devlin et al., 2018): pre-training on English Wikipedia and BookCorpus (Zhu et al., 2015) (roughly 16GB texts) for 1M steps, with batch size 256, maximum sequence length 512, and 15% masks. We follow the preprocessing steps and use 32,768 sub-word tokens in Ke et al. (2020). We remove the next sentence prediction task following Liu et al. (2019).  (Devlin et al., 2018) 0.317 0.310 0.929 Optimus  0.300 0.244 0.880 ELECTRA (Clark et al., 2020) 0.300 0.258 0.854 ERNIE2.0  0 We use Adam (Kingma and Ba, 2014) as the optimizer, and set its hyperparameter to 1e-6 and (β1, β2) to (0.9, 0.999). The peak learning rate is set to 1e-4 with a 10k-step warm-up stage. After the warm-up stage, the learning rate decays linearly to zero. We set the dropout probability to 0.1, gradient clip norm to 1.0, and weight decay to 0.01. All codes are implemented based on fairseq  in PyTorch (Paszke et al., 2017). All models are run on 8 NVIDIA Tesla V100 GPUs with mixed-precision (Micikevicius et al., 2017).
Our encoder architecture is the same with BERTbase: 12 Transformer layers, eight attention heads, and 768 hidden dimensions (110M parameters). We use a three-layer Transformer as the decoder, restrict its attention to the previous two tokens (attention span k = 2), and keep all else the same with the encoder. The decoder is only used in pretraining and is dropped during fine-tuning. There is no additional cost in fine-tuning nor inference.

Fine-tuning Siamese/Dual-Encoders
Fine-tuning SEED-Encoder in the Siamese architecture on the dense retrieval tasks is the same as other pre-trained models. Here we show how finetuning in a typical sentence pair matching task with binary labels can be done with Triplet loss.
The training data include triples of query x q and its positive/negative labeled sequence (x d+ , x d− ). The scoring of the sequence pairs s(x q , x d ) is done by simple similarity functions, such as cosine and dot product, on their CLS encodings. More advanced fine-tuning strategies (Karpukhin et al., 2020;Xiong et al., 2021) can also be used as SEED-Encoder is an alternative for other pre-trained encoders.

Experiments on Web Search
Our first application, web search ,.uses the MS MARCO (Bajaj et al., 2016) dataset, the largest public search benchmark to date. It includes two tasks, passage ranking and document ranking. We focus on the first stage retrieval step, which is to find relevant passages/documents from the entire corpus. We also show the results in the reranking setting where all models rerank a pregiven set of candidate documents (Top 100 from BM25) for reference. More details of MARCO are in Appendix A.1.
Our pre-trained encoders are fine-tuned with ANCE negative sampling strategy (Xiong et al., 2021). In document retrieval, we use ANCE (FirstP) which uses the first 512 tokens of the long document and cut-off the rest. We also evaluate with another negative sampling strategy, BM25 Neg, which uses top 100 BM25 retrieved results as negatives samples and performs similar to DPR (Karpukhin et al., 2020) on MARCO.
Baselines: The main baseline is our run of BERTbase (Devlin et al., 2018;Liu et al., 2019), which we pre-trained and fine-tuned in the exact setting with SEED-Encoder. We use the permutation test and p < 0.05 as the statistical significance test between SEED-Encoder and BERT (Ours). Besides BERT, we evaluate two other pre-trained language models in the same setting: ELECTRA (Clark et al., 2020) and ERNIE2.0 . ELECTRA is one of the most effective pre-trained encoders on the GLUE benchmark (Clark et al., 2019). ERNIE2.0 uses various token-level tasks and sentence-level tasks, including an IR Relevance Task. We use the MARCO passage benchmark to showcase the performance of these two pre-trained models.
In addition, we also list the task-specific first stage retrieval baselines that were published recently or submitted to the leaderboard, although they barely outperform our vanilla BERT baseline. For passage ranking, the classic sparse retrieval baselines include the standard BM25, Best TREC Sparse Retrieval with tuned query expansion, and Best DeepCT, all from TREC DL 2019 official evaluation (Craswell et al., 2020). These three approaches represent the standard sparse retrieval,  best classical sparse retrieval, and the latest method of using BERT to improve sparse retrieval. For document ranking, BM25 (Craswell et al., 2020) and the enriched traditional IR baseline are standard sparse retrieval baselines. The enriched traditional IR baseline uses pre-defined IR features, including BM25, to rank the documents. BM25 + doc2query-T5 expansion uses Doc2query model (Nogueira et al., 2019), expanding the documents with predicted queries that are related to or representative of the documents' content. The queries are predicted by a sequence-to-sequence model taking the document terms as input. Both DE-hybrid and ME-hybrid (Luan et al., 2021) use dense features from BERT and hand-craft sparse features. DE-hybrid takes the CLS representations of document and query as the dense feature and calculates the dot product similarity. This similarity score is further combined with sparse retrieval scores as the final score for ranking. Different from DE-hybrid, ME-hybrid uses max-pooling over multiple contextual embeddings as dense features.

Results:
The results of SEED-Encoder and baselines in MARCO Passage retrieval and Doc retrieval are listed in Table 1 and Table 2. SEED-Encoder outperforms all existing baselines on all benchmarks. By simply switching its fine-tuning starting checkpoint from BERT to SEED-Encoderwithout changing any architectures nor fine-tuning strategies-the accuracy is significantly improved on these two large-scale benchmarks.
In comparison, on MARCO Passage retrieval, switching from BERT to ELECTRA or ERNIE2.0 does not improve the retrieval accuracy. Pretraining models optimized for other scenarios are not necessarily better for dense retrieval.
On MARCO document retrieval, ANCE (FirstP) only uses one vector per document from its first   (Xiong et al., 2021).

Experiments on News Recommendation
Our second application is news article recommendation, another important real-world task that connects users with information. We use the recently released MIcrosoft News Dataset (MIND) benchmark (Wu et al., 2020b). The task is to rank a given set of candidate news articles based on the user's previous click history on MSN news articles. The evaluation uses the user's click as the positive label. We use the public MIND Dev and its official metrics: AUC, MRR, NDCG@5, and NDCG@10. More details of MIND are in Appendix A.1. We follow MIND's official setting and use a standard dense retrieval model to rerank the pre-given candidate news articles. Our DR model represents each user's history by concatenating all the titles they clicked on the MSN site, with [SEP] tokens in between, and using as many recent titles as possible within the 512 length limit. The candidate articles are represented by the concatenation of their titles and snippets. Then it encodes the user history and candidate articles with SEED-Encoder, and matches them with dot-products.
Baselines: MIND is a relatively new benchmark. The most recent baselines are those in Wu et al. (2020a). Based on Transformer (Vaswani et al., 2017), Transformer-XL  uses relative positional encodings that integrate contentdependent positional scores and a global positional
score in the self-attention layer. TENER (Yan et al., 2019) uses direction-aware sinusoidal relative position embeddings in a similar way as in Transformer-XL. Different from Transformer-XL and TENER, DA-Transformer (Wu et al., 2020a) directly rescales the attention weights based on the mapped relative distances instead of using sinusoidal position embeddings. Similar to the web search experiments, we also compare SEED-Encoder with BERT (Ours).

Results:
The results of SEED-Encoder and baselines in MIND are listed in Table 3. SEED-Encoder outperforms the recent state-of-the-art DA-Transformer, which employs various architecture improvements specifically designed for recommendation (Wu et al., 2020a). A better self-learning strategy to leverage unsupervised data can be as effective as, if not better than, task-specific architecture changes while avoiding all the engineering hustles.

Experiments on Open QA
Our third application is dense retrieval in opendomain question answering. This task often leverages a two-stage framework: first uses a context retriever to select a small set of passages that may contain the answer to the question; and then uses a machine reader which thoroughly examines the retrieved passages and identifies the correct answer (Karpukhin et al., 2020). We focus on the first stage, i.e., dense retrieval to select relevant passages. We use Natural Question query set (Kwiatkowski et al., 2019) and the Wikipedia passages prepared and shared in DPR (Karpukhin et al., 2020). More details of the NQ dataset are in Appendix A.1. We follow the evaluation metrics used in DPR, hit accuracy of Top-20 and Top-100. Models are fine-tuned using DPR fine-tuning strategy as in Karpukhin et al. (2020), which uses a dual-encoder architecture and samples negatives in the mini-batch. We also experiment with the ANCE fine-tuning strategy as in Xiong et al. (2021) which dynamically samples hard negatives.
Baselines: We take BM25, BERT as baselines as in Karpukhin et al. (2020). Consistent with the web search tasks and news recommendation tasks, we also compare SEED-Encoder with BERT (ours).

Results:
The results of SEED-Encoder and the baselines on NQ benchmark are in Table 4. Again, SEED-Encoder outperforms all other baselines with DPR negative sampling or ANCE negative sampling. We do not change any architectures nor fine-tune strategies and only simply switch the BERT checkpoint to SEED-Encoder, but bring significant improvements on the large-scale benchmark.

Discussion and Analysis
In this section, we conduct more analysis to understand the advantages of the SEED-Encoder. For simplicity, all experiments are run on the MS MARCO document retrieval tasks.

Ablation study
In the experiments above, we use a three-layer Transformer decoder and restrict the attention span to be two. One may wonder whether such constraints are essential for learning good sentence representations. In this section, we try various decoder configurations with different numbers of layers and attention window sizes. From the results in Figure 3, we can see that the SEED-Encoder with the stronger decoder, 5layer Transformer with full attention (All), performs worse than those with weaker decoders in dense retrieval. The retrieval accuracy correlated well with the decoder capacity of the corresponding SEED-Encoder. So unlike typical multi-task settings where tasks share lower-level representa-   tions and correlate in accuracy, in SEED-Encoder, the decoder is to force the encoder to capture more information in its sequence embeddings: A weak decoder leads to a stronger encoder.
To further understand the relationship of encoder's CLS embedding and the decoder, in Figure 4 we plot the cosine similarity between the decoder's token representations in its last layer and the encoder's CLS. The impact of restricting attention is significant: with full attention (Figure 4(a)), the decoder may depend heavily on the encoder's CLS in the beginning but quickly drops the dependency when sufficient context information is available; with restricted access to context, the decoder is forced to attend more on the encoder's CLS representation in all token positions, as shown in the consistent cosine similarity in different positions in Figure 4 (b). This confirms that when the decoder is weak (restricted attention), it depends more on the encoder's CLS, thus pushes the encoder to learn more informative representations. Also, the results in Figure 4 (a) suggest that when using a powerful encoder, the CLS embedding will encode the first several words in the sentence but might ignore others. This can be one of the reasons that Optimus performs worse than BERT in dense retrieval in Figure 1(a).

Document Representation Quality
In Section 3.2, we empirically show that using a standard autoencoder learning framework, the similarity between sequence representations grows to be large for long sequences. In this section, we first study whether SEED-Encoder improves the representation diversity. Similar to Figure 1(b), we collect randomly sampled sentence pairs and calculate the cosine similarity of their CLS encodings generated by SEED-Encoder.
Results in Figure 5 shows that, the CLS embedding generated by SEED-Encoder is more diverse. The average CLS cosine similarity is only 0.48 even when the sentence length is 512. This result shows that SEED-Encoder can well differentiate sentences during pre-training.
Few-shot effectiveness Note that diverse representations don't necessarily mean high-quality. To figure out the effectiveness of the representation, we conduct few-shot learning experiments for SEED-Encoder. In particular, we record the dev performance during the fine-tuning stage and check how many training iterations and how many samples are required for the model to achieve a reasonably good performance.
In Figure 6(a) and 6(b), we plot the retrieval accuracy at different fine-tuning steps. Starting from SEED-Encoder instead of BERT, both the vanilla Siamese and ANCE achieve higher retrieval accuracy in the very beginning and maintain their advantages throughout the fine-tuning process. For example, Siamese (BM25 Neg) only requires 30k fine-tuning iterations with SEED-Encoder to reach BERT's best performance at 140k iterations. With ANCE (First P), it takes 150K iterations with SEED-Encoder versus 750K with BERT.
In Figure 6(c) and 6(d), we plot the retrieval accuracy with different fractions of training data. Compared with BERT, with fewer training labels, SEED-Encoder always reaches better accuracy. When only using 10% training labels, SEED-Encoder (MRR 0.318 in Figure 6(c)) is still competitive with BERT using all training labels (MRR 0.32).
These results indicate that the representation learned by SEED-Encoder is better than that learned by BERT. The reduction in fine-tuning cost helps democratize the benefits of pre-training models, especially in applications where computing resources or task-specific supervision is restricted. Figure 6: MS MARCO passage retrieval accuracy of Siamese (BM25 Neg) and ANCE (FirstP) when fine-tuned from BERT (Ours) and SEED-Encoder. (a) and (b) are their accuracy at different fine-tuning steps (x-axes, in 100K). (c) and (d) are their accuracy with a fraction (x-axes) of training labels in the few-shot setting.  Case Study We further showcase some winning examples of SEED-Encoder in Table 5. The error made by BERT correlated with our observation in Figure 4(a), where the encoder's representation is more related to those tokens at the beginning of the text sequences, which is quite related to the query. Only when the model captures the information of the entire text can it find the correct documents. For example, in the first case, SEED-Encoder captures "winter hiking" at the back of the document while BERT only pays attention to some of the keywords at the beginning of the document even if the overall semantics does not match, and in the second case, BERT missed the "party" part in the query.

Conclusion
In this paper we present SEED-Encoder, a selftraining framework dedicated to pre-training language models for dense text retrieval. We pre-train an auto-encoder that employs a weak decoder with restricted capacity and attention span following our mathematical derivation. The weak decoder helps SEED-Encoder capture more context information and generate better text representation. In our experiments on web search, news recommendation, and question answering, SEED-Encoder initialized dense retrieval models achieve state-of-the-art accuracy compared to several strong baselines. Future work along this direction includes exploring more self-learning tasks and network architectures for sequence matching in dense retrieval scenarios.    (Bajaj et al., 2016) is the largest available search benchmark to date. It includes two tasks, document ranking and passage ranking. Both are to find and rank relevant documents/passages from a web corpus for a web query from Bing. The dataset statistics are summarized in Table 6.
More Details of MIND Dataset MIcrosoft News Dataset (MIND) (Wu et al., 2020b) is a largescale recommendation dataset that collects about 160k English news articles and more than 15 million user impression logs from MSN news. Each news article contains the title, abstract, body, and category. Each impression log includes the user's click behavior on the page and her historical news click behaviors. The task is to rank a given set of candidate news articles, e.g., those from an early stage of their recommendation pipeline, based on the user's previous click history. The dataset statistics are summarized in Table 7.
More Details of NQ Dataset For OpenQA experiments we use the Natural Question query set (Kwiatkowski et al., 2019), in which the queries are mined from real Google search queries and the corresponding answers are spans in Wikipedia articles identified by annotators. We use the Wikipedia passages preprocessed and shared in DPR (Karpukhin et al., 2020), which includes 21, 015, 324 passages. More detailed data such as the number of queries can be found in Karpukhin et al. (2020)

A.2 GLUE
We also consider the GLUE benchmark (Wang et al., 2018) which contains nine datasets for general language understanding. Here we select MNLI, QQP, QNLI and SST-2 from the GLUE benchmark, and compare the performance of SEED-Encoder with BERT (Ours) and Optimus on these tasks. We follow the fine-tuning schedule in Devlin et al. (2018), and the results are shown in Table 8. We can see that on these GLUE tasks, SEED-Encoder is not worse than BERT and Optimus. This shows that while SEED-Encoder can generate higherquality representations that well fit the Siamese network, the performance on GLUE will not become worse.