RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder

Despite pre-training’s progress in many important NLP tasks, it remains to explore effective pre-training strategies for dense retrieval. In this paper, we propose RetroMAE, a new retrieval oriented pre-training paradigm based on Masked Auto-Encoder (MAE). RetroMAE is highlighted by three critical designs. 1) A novel MAE workflow, where the input sentence is polluted for encoder and decoder with different masks. The sentence embedding is generated from the encoder’s masked input; then, the original sentence is recovered based on the sentence embedding and the decoder’s masked input via masked language modeling. 2) Asymmetric model structure, with a full-scale BERT like transformer as encoder, and a one-layer transformer as decoder. 3) Asymmetric masking ratios, with a moderate ratio for encoder: 15 30%, and an aggressive ratio for decoder: 50 70%. Our framework is simple to realize and empirically competitive: the pre-trained models dramatically improve the SOTA performances on a wide range of dense retrieval benchmarks, like BEIR and MS MARCO. The source code and pre-trained models are made publicly available at https://github.com/staoxiao/RetroMAE so as to inspire more interesting research.


Introduction
Dense retrieval is important to many web applications.By letting semantically correlated query and document represented as spatially close embeddings, dense retrieval can be efficiently conducted via approximate nearest neighbour search, such as PQ (Jegou et al., 2010;Xiao et al., 2021Xiao et al., , 2022a) ) and HNSW (Malkov and Yashunin, 2018).Recently, large-scale language models have been widely used as the encoding networks for dense retrieval (Karpukhin et al., 2020;Xiong et al., †.The two researchers make equal contributions to this work and are designated as co-first authors.* .Corresponding author.2020; Luan et al., 2021).The mainstream models, e.g., BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), T5 (Raffel et al., 2019), are usually pre-trained by token-level tasks, like MLM and Seq2Seq.However, the sentence-level representation capability is not fully developed in these tasks, which restricts their potential for dense retrieval.
Given the above defect, there have been increasing interests to develop retrieval oriented pretrained models.One popular strategy is to leverage self-contrastive learning (Chang et al., 2020;Guu et al., 2020), where the model is trained to discriminate positive samples from data augmentation.However, the self-contrastive learning can be severely limited by the data augmentation's quality; besides, it usually calls for massive amounts of negative samples (He et al., 2020a;Chen et al., 2020).Another strategy relies on auto-encoding (Gao and Callan, 2021;Lu et al., 2021;Wang et al., 2021), which is free from the restrictions on data augmentation and negative sampling.The current works are differentiated in how the encoding-decoding workflow is designed, and it remains an open problem to explore more effective auto-encoding framework for retrieval oriented pre-training.
We argue that two factors are critical for the autoencoding based pre-training: 1) the reconstruction task must be demanding enough on encoding quality, 2) the pre-training data needs to be fully utilized.We propose RetroMAE (Figure 1), which optimizes both aspects with the following designs.
• A novel MAE workflow.The pre-training follows a novel masked auto-encoding workflow.The input sentence is polluted twice with two different masks.One masked input is used by encoder, where the sentence embedding is generated.The other one is used by decoder: joined with the sentence embedding, the original sentence is recovered via masked language modeling (MLM).full-scale BERT, which is able to generate discriminative embedding for the input sentence.In contrast, the decoder follows an extremely simplified structure, i.e., a single-layer transformer, which is learned to reconstruct the input sentence.
• Asymmetric masking ratios.The encoder's input is masked at a moderate ratio: 15∼30%, which is slightly above its traditional value in MLM.However, the decoder's input is masked at a much more aggressive ratio: 50∼70%.
The above designs of RetroMAE are favorable to the pre-training effectiveness thanks to the following reasons.Firstly, the auto-encoding is made more demanding on encoding quality.The conventional auto-regression may attend to the prefix during the decoding process; and the conventional MLM only masks a small portion (15%) of the input tokens.By comparison, RetroMAE aggressively masks most of the input for decoding.As such, the reconstruction will be not enough to leverage the decoder's input alone, but heavily depend on the sentence embedding.Thus, it will force the encoder to capture in-depth semantics of the input.Secondly, it ensures training signals to be fully generated from the input sentence.For conventional MLM-style methods, the training signals may only be generated from 15% of the input tokens.Whereas for RetroMAE, the training signals can be derived from the majority of the input.Besides, knowing that the decoder only contains onesingle layer, we further propose the enhanced decoding on top of two-stream attention (Yang et al., 2019) and position-specific attention mask (Dong et al., 2019).As such, 100% of the tokens can be used for reconstruction, and each token may sample a unique context for its reconstruction.
RetroMAE is simple to realize and empirically competitive.We merely use a moderate-amount of data (Wikipedia, BookCorpus, MS MARCO corpus) for pre-training, where a BERT base scale encoder is learned.For the zero-shot setting, it produces an average score of 45.2 on BEIR (Thakur et al., 2021); and for the supervised setting, it may easily reach an MRR@10 of 41.6 on MS MARCO passage retrieval (Nguyen et al., 2016) following standard knowledge distillation procedures.Both values are unprecedented for dense retrievers with the same model size and pre-training conditions.We also carefully evaluate the impact introduced from each of the components, whose results may bring interesting insights to the future research.

Related works
Dense retrieval is widely applied to web applications, like search engines (Karpukhin et al., 2020), advertising (Lu et al., 2020;Zhang et al., 2022) and recommender systems (Xiao et al., 2022b).It encodes query and document within the same latent space, where relevant documents to the query can be efficiently retrieved via ANN search.The encoding model is critical for the retrieval quality.Thanks to the recent development of large-scale language models, e.g., BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), and T5 (Raffel et al., 2019), there has been a major leap-forward for dense retrieval's performance (Karpukhin et al., 2020;Luan et al., 2021;Lin et al., 2021).
The large-scale language models are highly differentiated in terms of pre-training tasks.One common task is the masked language modeling (MLM), as adopted by BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), in which the masked tokens are predicted based on their context.The basic MLM is extended in many ways.For example, tasks like entity masking, phrase masking and span masking (Sun et al., 2019;Joshi et al., 2020) may help the pre-trained models to better support the sequence labeling applications, such as entity resolution and question answering.Besides, tasks like auto-regression (Radford et al., 2018;Yang et al., 2019) and Seq2Seq (Raffel et al., 2019;Lewis et al., 2019) are also utilized, where the pre-trained models are enabled to serve NLG related scenarios.However, most of the generic pre-trained models are based on token-level tasks, where the sentence representation capability is not effectively developed (Chang et al., 2020).Thus, it may call for a great deal of labeled data (Nguyen et al., 2016;Kwiatkowski et al., 2019) and sophisticated fine-tuning methods (Xiong et al., 2020;Qu et al., 2020) to ensure the pre-trained models' performance for dense retrieval.
To mitigate the above problem, recent works propose retrieval oriented pre-trained models.The existing methods can be divided as the ones based on self-contrastive learning (SCL) and the ones based on auto-encoding (AE).The SCL based methods (Chang et al., 2020;Guu et al., 2020;Xu et al., 2022) rely on data augmentation, e.g., inverse cloze task (ICT), where positive samples are generated for each anchor sentence.Then, the language model is learned to discriminate the positive samples from the negative ones via contrastive learning.However, the self-contrastive learning usually calls for huge amounts of negative samples, which is computationally expensive.Besides, the pre-training effect can be severely restricted by the quality of data augmentation.The AE based methods are free from these restrictions, where the language models are learned to reconstruct the input sentence based on the sentence embedding.The existing methods utilize various reconstruction tasks, such as MLM (Gao and Callan, 2021) and auto-regression (Lu et al., 2021;Wang et al., 2021;Li et al., 2020), which are highly differentiated in terms of how the original sentence is recovered and how the training loss is formulated.For example, the auto-regression relies on the sentence embedding and prefix for reconstruction; while MLM utilizes the sentence embedding and masked context.The auto-regression derives its training loss from the entire input tokens; however, the conventional MLM only learns from the masked positions, which accounts for 15% of the input tokens.Ideally, we expect the decoding operation to be demanding enough, as it will force the encoder to fully capture the semantics about the input so as to ensure the reconstruction quality.Besides, we also look forward to high data efficiency, which means the input data can be fully utilized for the pre-training task.

Methodology
We develop a novel masked auto-encoder for retrieval oriented pre-training.The model contains two modules: a BERT-like encoder Φ enc (•) to generate sentence embedding, and a one-layer trans-former based decoder Φ dec (•) for sentence reconstruction.The original sentence X is masked as Xenc and encoded as the sentence embedding h X .The sentence is masked again (with a different mask) as Xdec ; together with h X , the original sentence X is reconstructed.Detailed elaborations about RetroMAE are made as follows.

Encoding
The input sentence X is polluted as Xenc for the encoding stage, where a small fraction of its tokens are randomly replaced by the special token [M] (Figure 2. A).We apply a moderate masking ratio (15∼30%), which means the majority of information about the input will be preserved.Then, the encoder Φ enc (•) is used to transform the polluted input as the sentence embedding h X : (1) We apply a BERT like encoder with 12 layers and 768 hidden-dimensions, which helps to capture the in-depth semantics of the sentence.Following the common practice, we select the [CLS] token's final hidden state as the sentence embedding.

Decoding
The input sentence X is polluted as Xdec for the decoding stage (Figure 2. B).The masking ratio is more aggressive than the one used by the encoder, where 50∼70% of the input tokens will be masked.The masked input is joined with the sentence embedding, based on which the original sentence is reconstructed by the decoder.Particularly, the sentence embedding and the masked input are combined into the following sequence: In the above equation, e x i denotes the embedding of x i , to which an extra position embedding p i is added.Finally, the decoder Φ dec is learned to reconstruct the original sentence X by optimizing the following objective: where CE is the cross-entropy loss.As mentioned, we use a one-layer transformer based decoder.Given the aggressively masked input and the extremely simplified network, the decoding becomes challenging, which forces the generation of high-quality sentence embedding so that the original input can be recovered with good fidelity. "

Enhanced Decoding
One limitation about the decoding process is that the training signals, i.e., the cross-entropy loss, can only be derived from the masked tokens.Besides, every masked token is always reconstructed based on the same context, i.e., H Xdec .We argue that the pre-training effect can be further enhanced providing that 1) more training signals can be derived from the input sentence, and 2) the reconstruction task can be performed based on diversified contexts.To this end, we propose the enhanced decoding inspired by two-stream self-attention (Yang et al., 2019) and position-specific attention mask (Dong et al., 2019).Particularly, we generate two input streams: H 1 (query) and H 2 (context), for the decoding operation (Figure 2. C): where h X is the sentence embedding, e x i is the token embedding (no token is masked in this place), and p i is the position embedding.We introduce the position-specific attention mask M ∈ R L×L , where the self-attention is performed as: (5) The output A, together with H 1 (because of the residual connection) are used to reconstruct the original input (other operations, like layer-norm and FFN, are omitted from our discussion).Finally, the following objective will be optimized: Knowing that the decoder only consists of one single transformer layer, each token x i is reconstructed based on the context which are visible to the i-th row of matrix M. In this place, the following rules are applied to generate the position specific attention mask matrix M: The sampled tokens, s(X =i ), and the 1st position (except for the 1st row) will be visible when reconstructing x i .The diagonal elements, i.e., x i for the i-th row, will always be excluded, which means they will always be masked; as a result, each token cannot attend to itself during the reconstruction.We summarize the pre-training workflow with the enhanced decoding as Algorithm 1.Note that the original masked language modeling task in BERT is kept in encoder.The corresponding loss, denoted as L enc , is added with the decoder's loss, which formulates the final training loss.The following features need to be emphasized for our pretraining workflow.Firstly, the reconstruction task is demanding given the aggressive masking ratio and the extremely simplified network of decoder.Secondly, we may derive abundant pre-training signals from the unsupervised corpus since all tokens within each input sentence can be used for the reconstruction.Finally, the pre-training is simple Algorithm 1: RetroMAE to realize: 1) there are no requirements on sophisticated data augmentation and negative sampling, and 2) the computation cost is maintained to be similar with the conventional BERT/RoBERTa style pre-training given the simplicity of decoder.

Experimental Studies
We evaluate the retrieval performance of the sentence embedding generated by RetroMAE's pretrained encoder, where two major issues are explored.1) RetroMAE's impact on zero-shot and supervised dense retrieval, in comparison with the generic pre-trained models and the retrieval oriented pre-trained models.2) The impact from the four technical factors in RetroMAE: the enhanced decoding, the decoder size, the decoder's masking ratio, and the encoder's masking ratio.

Experiment Settings
The following datasets are utilized for the pretraining and evaluation of RetroMAE.
• Pre-training.We reuse the same pre-training corpora as the ones utilized by BERT (Devlin et al., 2019): English Wikipedia and BookCorpus.Corresponding datasets are also frequently used by the previous works on retrieval oriented pre-training, such as SEED (Lu et al., 2021) and Condenser (Gao and Callan, 2021).Following coCondenser (Gao and Callan, 2022), we also use MS MARCO corpus to analyze the effect of in-domain pre-training for dense retrieval (which we find critical to the performance on MS MARCO but unnecessary to the performances on other datasets).
• Evaluation.We use two datasets to evaluate the retrieval performance after supervision.1) MS MARCO (Nguyen et al., 2016), which contains queries from Bing Search.We use its passage retrieval task, which contains 502,939 training queries and 6,980 evaluation queries (Dev).
The answer needs to be retrieved from 8.8 million candidate passages.2) Natural Questions (Kwiatkowski et al., 2019), which consists of queries from Google.There are 79,168 training queries, 8,757 dev queries, and 3,610 testing queries.The answer is retrieved from 21,015,324 Wikipedia passages.We evaluates the zero-shot retrieval performance on BEIR benchmark (Thakur et al., 2021).It fine-tunes the pre-trained model with MS MARCO queries, and evaluate the zeroshot transferability on the other 18 datasets.The evaluation data covers dense retrieval tasks across different domains, such as question answering, fact checking, bio-medical retrieval, news retrieval, and social media retrieval, etc.
We consider three types of baseline methods in our experimental studies1 .
• Generic models.The following generic pretrained language models are used in our experiments.1) BERT (Devlin et al., 2019), which is the most popular pre-trained language model in practice.It is also frequently fine-tuned as the encoding network for dense retrievers (Karpukhin et al., 2020;Xiong et al., 2020).2) RoBERTa (Liu et al., 2019), which is an enhanced replication of BERT with substantially enlarged training data and improved training settings.3) DeBERTa (He et al., 2020b), which further improves BERT and RoBERTa with disentangled attention mechanism and an enhanced mask decoder; it is one of the strongest pre-trained models on NLU benchmarks, such as GLUE and SuperGLUE.
• Retrieval oriented models.We consider two types of retrieval oriented pre-trained models in our experiments.One is based on self-contrastive learning, where the following methods are included.1) SimCSE (Gao et al., 2021), in which the language model is learned to discriminate different views of the anchor sentence augmented by drop-out.2) LaPraDoR (Xu et al., 2022), an enhancement of conventional ICT pre-training (Guu et al., 2020;Chang et al., 2020); it proposes to train the query and document encoder iteratively so that the scale of negative samples can be expanded.3) DiffCSE (Chuang et al., 2022), which enhances SimCSE with the jointly utilization of self-contrastive learning and conditional difference prediction.The other one is based on auto-encoding, in which the following methods are included.4) Condenser (Gao and Callan, 2021), where the sentence embedding is joined with the intermediate hidden-states from encoder for MLM. 5) SEED (Lu et al., 2021), where the sentence embedding is used to recover the original input via auto-regression.
• Implementation details.RetroMAE utilizes bi-directional transformers as its encoder, with 12 layers, 768 hidden-dim, and a 30522-token vocabulary (same as BERT base).The decoder is a onelayer transformer.The default masking ratios are 0.3 for encoder and 0.5 for decoder.The model is trained for 8 epochs, with AdamW optimizer, batch-size 32 (per device), learning rate 1e-4.The training is on a machine with 8× Nvidia A100 (40GB) GPUs.The models are implemented with PyTorch 1.8 and HuggingFace transformers 4.16.We adopt the official script 2 from BEIR to prepare the models for their zero-shot evaluation.For supervised evaluations, we use DPR (Karpukhin et al., 2020) and ANCE (Xiong et al., 2020) to fine-tune the pre-trained models.We also evaluate the models' performance on MS MARCO with standard knowledge distillation.Particularly, we train one BERT base scale cross-encoder over hard negatives returned by the ANCE-finetuned bi-encoder; then, we further finetune the bi-encoder by minimizing the KL-divergence with the cross-encoder.

Main Results
We analyze the zero-shot performances in Table 1, where RetroMAE achieves remarkable advantages: it produces the best empirical results on most of the datasets, and it surpasses the strongest baseline by +4.5% in total average.The leap-forward of zero-shot performance is much larger than any of the improvements made by the baseline models.
The improvement is indeed thrilling knowing that it is not from the increasing of model scale or the enrichment of pre-training data, but purely from the upgrade of pre-training algorithm.We further demonstrate the supervised evaluations in Table 2 and 3, where the pre-trained models are finetuned with DPR and ANCE.The baselines are partitioned into two groups according to whether they are generic pre-trained models or retrieval oriented ones.It can be observed that RetroMAE maintains notable advantages over the baselines.With DPR fine-tuning, it outperforms the strongest baselines by +1.96% (MRR@10) and +1.48% (Recall@10) on the two datasets; with ANCE fine-tuning, corresponding advantages become +1.42% (MRR@10) and +1.41% (Recall@10).
As for RetroMAE's pre-trained models over MS MARCO corpus.With ANCE fine-tuning (Table 4), our method outperforms its peer approach coCondenser (Gao and Callan, 2022) by +1.1% on MRR@10 (which is of the same model size and same pre-training data).While under the knowl-

MS MARCO Natural Question
Methods MRR@10 MRR@100 R@10 R@100 R@1000 R@10 R@20 R@30 R@50 R@100   edge distillation scenario (Table 5), RetroMAE surpasses a series of strong baselines proposed in the recent period, including the models with highly sophisticated distillation methods: AR2 (Zhang et al., 2021)  To summarize, the above results verify Retro-MAE's effectiveness from two aspects.1) It substantially improves the pre-trained model's transferability, which helps to result in superior zeroshot performances on out-of-domain datasets.
2) It provides a strong initialization of dense retriever; after fine-tuned with in-domain data, it gives rise to a high-quality supervised retrieval performance in the corresponding scenario.Besides the primary results, we may also have the following interesting observations.Firstly, the advanced pre-training methods in generic areas do not necessarily contribute to the dense retrieval performances.Particularly, both RoBERTa and DeBERTa are major improvements of BERT; however, none of them is able to achieve better performances than BERT as they did on general NLU benchmarks.This observation further supports the motivation to develop retrieval oriented pre-trained models.
Secondly, the auto-encoding style pre-training (adopted by SEED, Condenser, and RetroMAE) is empirically proved to be much more favorable for dense retrieval, given its dominance over the generic and self-contrastive learning based pre-

Factor
Setting MRR@10 MRR@100 R@100 R@1000 MRR@10 MRR@100 R@100 R@1000 trained models in both zero-shot evaluation and supervised evaluation.Thirdly, the self-contrastive learning based pretrained models bring very little improvements over the generic ones when fine-tuning is made available, as their performances are close to each other in both supervised and zero-shot evaluations.In fact, there is no supervise about this observation considering the similar results reported by recent studies on dense retrieval (Gao and Callan, 2021) (BERT against ICT) and image processing (El-Nouby et al., 2021) (BEiT (Bao et al., 2021) against MoCo/DINO).That is the pre-trained models from self-contrastive learning tend to have comparatively limited potential for fine-tuning.Therefore, we argue that the retrieval-oriented pre-training algorithm should be properly designed according to the condition of fine-tuning in specific scenarios.

Ablation Studies
We ablate RetroMAE with its supervised performance (DPR fine-tuning) in Table 6, where the following factors are analyzed: 1) decoding method, 2) decoder's size, 3) decoder's masking ratio, 4) encoder's masking ratio.We may have the following observations from our results.
Firstly of all, we analyze the impact from the decoding method, i.e., whether the enhanced decoding is used or not.It can be observed that the enhanced decoding (w.) notably outperforms the basic decoding (w.o.).Such an empirical advantage can be attributed to the improved data efficiency of the enhanced decoding.Under the default masking ratio, the basic decoding merely samples 50% of the input tokens for reconstruction, all of which are predicted based on the same context.With the enhanced decoding (Section 3.3), all of the input tokens can be utilized for reconstruction, and each of the tokens is reconstructed based on a unique context sampled as Eq. 7. As such, the enhanced decoding may obtain more sufficient and diversified training signals from the input data.
Secondly, we analyze the impact from decoder size, with the number of transformer layers increased from 1 to 3. Knowing that the enhanced decoding is only applicable for single-layer transformers, it is disabled for these experiments.Despite higher computation costs, the enlarged decoders do not bring any empirical gains.Besides, considering that the enhanced has to be disabled for large decoders (#layer>1), which will severely harm the retrieval performances, the onelayer decoder is proved to be the best option.
Thirdly, we make an analysis for different masking ratios of decoder, whose value is increased from 0.15 to 0.9.We introduce two sets of experiments: one with the enhanced decoding (w.), and the other one without using enhanced decoding (w.o.).For both experiments, we observe substantial improvements of retrieval quality resulted from the aggressive masking ratios.For enhanced decoding (w.), the optimal performance is achieved at 0.5; without using enhanced decoding (w.o.), the optimal performance is reached at 0.7.This minor difference is probably because: unlike "w."where all input tokens can be reconstructed, "w.o."only reconstructs the masked tokens; thus, it may trade a larger ratio for the increasing of training signals.
Lastly, we study the encoder's masking ratio.It is quite interesting that a slightly improved masking ratio of 0.3 also improves the empirical performances, compared with the commonly used value 0.15.For both encoder and decoder, the increased reconstruction difficulty can be the common reason why the increased masking ratios benefit the retrieval quality.However, different from decoder, a too aggressive ratio of encoder, e.g., 0.9, will severely harm the retrieval performance.This is because a too large masking ratio will prevent the generation of high-quality sentence embedding, considering that most of the useful information of the input sentence will be discarded.
The ablation studies are concluded as follows: 1) RetroMAE's performance can be notably improved by the enhanced decoding; 2) the one-layer transformer is the best for decoder; 3) the retrieval quality can be improved from an aggressive masking ratio of decoder, and a moderately improved masking ratio of encoder.

Conclusion
We propose RetroMAE, a novel masked autoencoding framework to pre-train retrieval oriented language models: the input sentence is randomly masked for encoder and decoder, and the sentence embedding is joined with the decoder's masked input to reconstruct the original input.We introduce an asymmetric model structure (full-scale encoder and single-layer decoder) and asymmetric masking ratios (a moderate ratio for encoder and an aggressive one for decoder), which makes the reconstruction sufficiently demanding.We also introduce the enhanced decoding, which makes the full utilization of the pre-training data.Our experiments on BEIR, MS MARCO, and Natural Question validate RetroMAE's effectiveness, as significant improvements on both zero-shot and supervised evaluations can be achieved over the existing methods.

Limitations
So far, our empirical studies are performed based on BERT base scale transformers; meanwhile, merely a moderate amount of pre-training data is used (mainly due to the limitations on computation resources).Despite the demonstrated effectiveness, it's still necessary to explore the impact from enlarged networks and increased pre-training data, as both factors were found to be important in recent works (Ni et al., 2021).

Figure 2 :
Figure 2: RetroMAE pre-training workflow.(A) Encoding: the input is moderately masked and encoded as the sentence embedding (the green rectangle).(B) Decoding: the input is aggressively masked, and joined with the sentence embedding to reconstruct the masked tokens (the shadowed tokens).(C) Enhanced encoding: all input tokens are reconstructed based on the sentence embedding and the visible context in each row (defined in Eq. 7); the main diagonal positions are filled with −∞ (grey), and positions for the visible context are filled with 0 (blue).

•
Asymmetric structure.RetroMAE adopts an asymmetric model structure.The encoder is a as Eq.5;

Table 2 :
Supervised evaluation results based on DPR fine-tuning.

Table 3 :
Supervised evaluation results based on ANCE fine-tuning.

Table 4 :
RetroMAE vs. coCondenser on MS MARCO.Both models are fine-tuned by ANCE.

Table 5 :
RetroMAE vs. the recent dense retrievers; all models are fine-tuned by knowledge distillation.