Should You Mask 15% in Masked Language Modeling?

Masked language models (MLMs) conventionally mask 15% of tokens due to the belief that more masking would leave insufficient context to learn good representations; this masking rate has been widely used, regardless of model sizes or masking strategies. In this work, we revisit this important choice of MLM pre-training. We first establish that 15% is not universally optimal, and larger models should adopt a higher masking rate. Specifically, we find that masking 40% outperforms 15% for BERT-large size models on GLUE and SQuAD. Interestingly, an extremely high masking rate of 80% can still preserve 95% fine-tuning performance and most of the accuracy in linguistic probing, challenging the conventional wisdom about the role of the masking rate. We then examine the interplay between masking rates and masking strategies and find that uniform masking requires a higher masking rate compared to sophisticated masking strategies such as span or PMI masking. Finally, we argue that increasing the masking rate has two distinct effects: it leads to more corruption, which makes the prediction task more difficult; it also enables more predictions, which benefits optimization. Using this framework, we revisit BERT’s 80-10-10 corruption strategy. Together, our results contribute to a better understanding of MLM pre-training.


Introduction
Pre-trained language models have transformed the landscape of natural language processing (Devlin et al., 2019;Raffel et al., 2020;Brown et al., 2020, inter alia). They are trained on vast quantities of text data and acquire rich and versatile language representations. Compared to autoregressive models, which always predict the next token in a sequence, masked language models * The first two authors contributed equally. 1 Our code and pre-trained models are publicly available at https://github.com/princeton-nlp/DinkyTrain. BERT chooses a 15% masking rate, based on the reasoning that models cannot learn good representations when too much text is masked, and the training is inefficient when too little is masked. Surprisingly, this important choice has been underexplored since 15% masking is used ubiquitously by BERT's successors Joshi et al., 2020;Lan et al., 2020;He et al., 2021;Levine et al., 2021;Izsak et al., 2021), regardless of model sizes, masking strategies and optimization recipes. 2 In this work, we aim to understand the impact of masking rates. We hypothesize that the optimal masking rate is not universally 15%, but should depend on other factors. First, we consider the impact of model sizes and establish that indeed larger models should adopt higher masking rates ( §3). Specifically, we find that under an efficient pre-training recipe (Izsak et al., 2021), 40% outperforms 15% for BERT-large size models when fine-tuning on GLUE and SQuAD.
Interestingly, we observe that large models can still learn good representations even for very high masking rates: if we mask as much as 80% of input tokens and pre-trained models have a perplexity of more than 1000, the learned representations can still preserve more than 95% of fine-tuning performance on downstream tasks, compared to the default 15% masking (Table 1), and show considerable performance in linguistic probing ( §4). This challenges common intuitions about masking rates and what models learn in MLM pre-training.
We then focus on the strategy of which tokens to mask as an additional factor to the optimal masking rate of MLMs ( §5). We find that different masking rates should be used with different masking strategies, and the default uniform masking bene- Random initialization 61.5 ↓22.7 60.9 ↓30.0 10.8 ↓77. 2   Table 1: Masked examples, validation perplexity (calculated in the same way as Devlin et al., 2019) of different masking rates on the one billion word benchmark (Chelba et al., 2013), and downstream task development performance (SQuAD: F1; accuracy for others). All the pre-trained models have a BERT-large architecture and are trained with the efficient pre-training recipe ( §2.2). Full results are provided in Table 7. fits more from higher masking rates than more sophisticated masking strategies such as span (Joshi et al., 2020;Raffel et al., 2020) and PMI masking (Levine et al., 2021); when all methods are considered at their optimal masking rate, uniform masking achieves competitive performance. Finally, we propose to dissect the masking rate into two factors ( §6): the corruption rate-how much of the context is corrupted (masked)-and the prediction rate-how much of the tokens the model predicts on. In MLMs, both are set to the masking rate. However, these two factors have opposing effects: higher prediction rates generate more training signals and benefit the optimization, while higher corruption rates make the prediction task more challenging by providing less context. To study the two factors independently, we design ablation experiments to disentangle corruption and prediction rates. Thus, we can verify that models benefit from higher prediction rates and suffer from more corruption. Using this framework, we also discuss BERT's practice of predicting on original or random tokens (the 80-10-10 rule), and we find that models usually perform worse under this corruption strategy ( §7).
Together, our results demonstrate the overlooked impact of the masking rate in MLM pre-training and our analysis disentangles its opposing effects of corruption and prediction. We conclude by discussing the relation to work in other models and modalities ( §8) and by highlighting several new avenues for efficient MLM in the future ( §9).

Masked Language Modeling
We focus on the widely popular masked language modeling (Devlin et al., 2019), a form of denoising-autoencoding, where a model is trained to restore a corrupted input sequence. Specifically, masked language models make independent predictions on the subset of masked tokens: where one masks m (masking rate, typically 15%) percentage of tokens from the original sentence x and predicts on the masked token set M given the corrupted contextx (the masked version of x  Levine et al. (2021) sample words and spans with high pointwise mutual information (PMI). These advanced sampling strategies are adopted to prevent models from exploiting shallow local cues from uniform masking.
MLMs can encode bidirectional context while autoregressive language models can only "look at the past", and thus MLMs are shown to be more effective at learning contextualized representations for downstream use (Devlin et al., 2019). On the other hand, MLMs suffer a significant computational cost because it only learns from 15% of the tokens per sequence, whereas autoregressive LMs predict every token in a sequence. In this work, we focus on MLMs and study the effects of different masking rates on downstream performance.

Experiment Setup
We build most of our experiments on a recent efficient pre-training recipe-the 24hBERT recipe from Izsak et al. (2021)-by using which models can match BERT-base performance 6× faster (tested on 8×Titan-V). This efficient pre-training recipe allows us to run a large amount of experiments in an academic setup. Izsak et al. (2021)  make the pre-training faster by using a BERT-large architecture, a larger learning rate (2e-3), a larger batch size (4,096), a shorter sequence length (128) 4 , and fewer training steps. We deviate from the 24hBERT with a few simple changes: 1. We adopt RoBERTa's BPE tokenizer (Sennrich et al., 2016; rather than BERT's tokenizer for it performs better in our preliminary experiments (see Appendix C).
2. Instead of adopting BERT's 80-10-10 token corruption strategy, we simply replace all the masked tokens with [MASK] by default. We find that the 80-10-10 corruption strategy does not perform better for most downstream tasks, as discussed in §7.
Following 24hBERT, we also do not perform next sentence prediction during pre-training, which was shown to hurt performance . We show hyperparameters for the efficient pre-training recipe and a comparison to other recipes (Devlin et al., 2019; in Appendix A. For models of different sizes, masking rates, and masking strategies, we follow the same recipe as our preliminary experiments show that it still performs the best.
We use fine-tuning downstream task performance as the measurement of how good the MLMs are, since fine-tuning is the predominant way to use pre-trained MLMs in downstream use. As evident from Table 1, pre-training metrics like perplexity do not correlate well with the downstream performance. We describe our downstream fine-tuning setting and hyperparameters in Appendix A.
4 Izsak et al. (2021) only evaluate on GLUE tasks instead of SQuAD because of the short sequence length. We further train the model with 512 tokens for SQuAD in Table 1. 5 For each task and each model size, normalized performance is calculated by where x 15% is the performance of 15% masking rate and σ is the standard deviation across all masking rates. Relative F1 is the F1 score subtracted by the 15% model F1.  Figure 2: Impact of masking rates on different model sizes (large>base>medium). 5 We see that larger models favor larger optimal masking rates.

Larger Models Can Benefit From
Higher Masking Rates Devlin et al. (2019) choose the mysterious masking rate of 15%, for the belief that masking more leads to insufficient context to decode the tokens, and masking fewer makes the training inefficient, and this masking rate has been viewed as a constant across different model sizes. In this section, we train models of size large (354M parameters), base (124M parameters), and medium (51M parameters) for masking rates varying from 15% to 50%. The model configurations are listed in Appendix E.
Optimal masking depends on model sizes. The impact of masking rate across the model sizes is summarized by Figure 2, with detailed results given in Appendix E. We see that larger models possess higher optimal masking rates: on average, under the efficient pre-training recipe, large models take 40% as the optimal masking rate; base models take 20% and medium models take 15%. This shows that larger MLM models favor higher masking rates. We hypothesize that the additional capacity allows the large MLM to "handle" the more challenging task of predicting many tokens given less context.  Figure 3: Impact of masking rates on large models with the efficient pre-training recipe. We see that on most tasks, higher masking rates outperform 15%. 40% is the optimal masking rate overall.
large model. First, we plot how the downstream task performance changes with different training steps in Figure 1. For most tasks, we see that 40% masking outperforms 15% consistently during the course of training, such that on QNLI and QQP, the 40% model can achieve the same performance as the 15% baseline with only half the training time. We also report the test results in Table 2, where again masking 40% outperforms 15% with our efficient pre-training recipe. However, the optimal masking rate can be task-dependent, as SST-2 performs better with 15% masking at the end of training. We acknowledge that the optimal masking rate may also depend on the training recipe.
Since the efficient pre-training recipe uses a relatively small number of training steps, we explore training for over 4× more steps, as well as training with a more expensive recipe from RoBERTa , and we find in Appendix D that using a 40% masking rate still performs well, achieving similar performance to the 15% masking rate. The experiments in the remaining sections of this paper are all based on large models.

MLMs in High-Masking Regimes
The success of masking 40% over 15% motivates us to explore what happens at even larger masking rates. Therefore, we pre-train additional large models with masking rates of up to 80%. We consider the question of what representations an MLM can learn with such limited input as the last masked sentence in Table 1, which is hard to decipher even for a human. While He et al. (2022) recently pioneered such high masking rates in the vision domain, and they reason that images are natural signals with heavy redundancy, while language is highly semantic and information-dense. To our knowledge, nobody has examined such high masking rates in masked language modeling before.
MLMs learn with extreme masking. We first confirm in Table 1   tremely high (>1,000), which suggests that the MLM is unable to reconstruct corrupted inputs with independent token predictions. Therefore our setting differs from vision, where good reproductions are possible with high masking rates (He et al., 2022). Nevertheless, we find that MLMs can surprisingly still learn good representations: Figure 3 shows the performance of the models finetuned on a range of tasks, and we observe that pretraining with an 80% masking rate can retain 95% of fine-tuning performance, which is substantially better than fine-tuning from a random initialization, which is reported in Appendix B.
We hypothesize that MLMs at such high masking rates may be understood as a powerful skipgram model , e.g., masking 80% of a 128 token sequence still learns skip-grams of length up to 26. Furthermore, when compared to the simple word2vec model, our Transformer models have access to positional information for each context token and prediction.
Analysis of linguistic probing. Besides downstream performance, we study the models' linguistic abilities by evaluating them on the BLiMP benchmark (Warstadt et al., 2020). We employ zero-shot pseudo log-likelihood scoring (Salazar et al., 2020), where a score is computed by masking each token individually, which is a greater distri-   Phang et al. (2018). For SQuAD v1.1, we take the same setting as Table 1. butional shift from higher masking rates. We show our results in Figure 4. We find that most linguistic phenomena are acquired evenly across masking rates from 15% to 60%, but they are still captured well by an MLM trained with 80% maskingwhich on average preserves 90% of the probing accuracy of the 15% model baseline.  Levine et al., 2021). The argument for adopting advanced masking is that uniform masking enables models to exploit shallow local cues (Levine et al., 2021). An example is given by " [MASK] Kong": the model can easily predict "Hong" without using more context. However, all the previous studies used a constant 15% masking rate regardless of masking strategies, which raises the question of whether the conclusions still hold with a higher masking rate. We experiment with multiple masking strategies as an additional factor for the optimal masking rate in large models. Figure 5 shows the results of uniform masking, T5-style span masking (Raffel et al., 2020) 6 , and PMI masking (Levine et al., 2021) under masking rates from 15% to 40%. We see that (1) for all masking strategies, the optimal masking 6 Span maskings in Raffel et al. (2020) Figure 6: Higher masking rates increase the probability that an entire PMI span is masked (left) under different masking strategies. Uniform masking with a 40% rate masks as many PMI spans as regular PMI masking at 15%. Masks form longer spans for higher masking rates in uniform sampling, while the average length is fixed at 3 for T5-style span masking (which cannot be enforced for very high masking rates). rates are higher than 15%; (2) the optimal masking rates for span masking and PMI masking are lower than that of uniform masking; (3) when all strategies adopt the optimal masking rates, the uniform masking achieves similar and even better results compared to the advanced strategies. We also remark that, when masking with 15%, simply increasing the masking rate can be a more effective way to increase performance on SQuAD than switching from uniform masking to another more advanced strategy. More fine-grained results with these masking strategies are included in Appendix E.

MNLI-m/mm QNLI QQP RTE SST-2 MRPC CoLA STS-B SQuAD
Interestingly, higher masking rates naturally increase the chance of masking neighbouring co-  Table 3: Corruption vs. prediction. We take 40% masking as the baseline model (standard deviation reported), disentangle m corr and m pred , and manipulate each independently. The trend is clear: more prediction helps and more corruption hurts.
occuring tokens, similar to the effect of the advanced masking strategies. We consider the masked tokens over one epoch of training, and count the number of PMI n-grams (e.g., "Hong Kong") that were completely covered by different masking strategies. Figure 6 shows that raising the masking rate from 15% to 40% results in an 8-fold increase in the chance of masking a PMI n-gram under uniform masking and gives a value comparable to PMI masking at 15% masking rate. Similarly, higher masking rates also make the masked tokens form longer spans. However, at a given masking rate, uniform masking remains an easier task than span masking or PMI masking-it appears reasonable for uniform masking to admit a higher optimal masking rate for a given model capacity.

Understanding Masking As Corruption and Prediction
In this section, we analyze how masking rates affect the pre-training process of MLMs, through two distinct perspectives: task difficulty and optimization. We identify that the masking rate m determines two import aspects of the pre-training problem: the corruption rate m corr and the prediction rate m pred . m corr is the proportion of tokens that are erased from the input sequence-typically by substituting [MASK]. m pred is the proportion of tokens that the models predict, and each of those tokens contributes to the cross-entropy loss. In Eq.
(1), m corr controls how much content is corrupted inx compared to the original sentence x, and m pred controls the number of predictions in the set M. Usually, both the corruption and the prediction rates are tied to the masking rate, i.e., m corr = m pred = m, but they may impact representation quality differently. m corr controls task difficulty. Masked language modeling attempts to learn a conditional probability distribution over the vocabulary given the corrupted context p(· |x) during pre-training. If a larger proportion of the input is corrupted, a token prediction is conditioned on fewer context tokens, making predictions harder and more uncertain. m pred affects optimization. Predicting more means the model learns from more training signals, so higher prediction rates boost the model performance. From another perspective, each prediction at each masked token leads to a loss gradient, which is averaged to optimize the weights of the model. Averaging across more predictions has a similar effect to increasing the batch size, which is proved to be beneficial for pre-training .
Experiments. In masked language modeling, both m corr and m pred are determined by the overall masking rate. To study how m corr and m pred affect the downstream performance independently, we design a simple ablation experiment to disentangle them: 1. If m pred < m corr , we mask m corr of tokens and only make predictions on m pred of the tokens. This can be implemented without additional cost. For example, with m corr = 40% and m pred = 20%, we mask 40% and only predict on 20% tokens.
2. If m pred > m corr , we duplicate each sequence m pred mcorr times and mask disjoint sets of m corr of the tokens in different sequences. For example, with m corr = 20% and m pred = 40%, for each sentence, we do twice 20% masking on different tokens and predict on all the masked tokens-this leads to a 20% corruption but a 40% prediction on each sequence. Note that this ablation takes m pred mcorr times longer because we do multiple passes on every sequence, and is not efficient in practice. Table 3 shows the ablation results with disentangled m corr and m pred . We see that (1) fixing the m corr as 40%, lowering the m pred from 40% to 20% results in a consistent drop on downstream tasks, showing that more predictions lead to better performance; (2) fixing the m pred as 40%, lowering the m corr leads to consistently better performance, suggesting that lower corruption rates make the pre-training task easier to learn and are better for pre-training. Though we see that the performance gain by lowering m corr from 10% to 5% is much smaller than that by lowering m corr from 40% to 20%, suggesting a diminishing marginal  Table 4: Impact of substituting masks with random/same tokens. "+5% same": do extra 5% same token predictions. "w/ 5% rand": use mask for 35% mask tokens and random tokens for 5% . "w/ 80-10-10": for the 40% masked tokens, 10% are same token predictions and 10% are random token corruptions.
return of reducing the corruption rate.
(3) comparing m corr = 20%, m pred = 20% and m corr = 40%, m pred = 40%, we see that the gain brought by more predictions transcends the drawback of more corruption, leading to better performance. The ablation shows that when we tune the masking rate, we are tuning the corruption rate and the prediction rate together, which have antagonistic effects. The final outcome is decided by which rate weighs more-the model benefits from higher masking rates if the hindrance brought by high corruption is surpassed by the advantage from predicting more. Many factors may affect the balance between the two-for example, model sizes and masking strategies as we discussed in §3 and §5.

Revisiting BERT's Corruption Strategy
Devlin et al. (2019) suggest that it is beneficial to replace 10% of [MASK] tokens with the original token (same token predictions) and 10% with random tokens (random token corruptions). Since then, this 80-10-10 rule has been widely adopted in almost all the MLM pre-training work Joshi et al., 2020;He et al., 2021). The motivation is that masking tokens create a mismatch between pre-training and downstream fine-tuning, and using original or random tokens as an alternative to [MASK] may mitigate the gap. With our corruption and prediction framework, we revisit the two kinds of mask replacements in the 80-10-10 rules and empirically verify whether they are beneficial to downstream performance.
Same token predictions. The loss from same token predictions is very small and should be regarded as an auxiliary regularization. Thus, same token predictions should neither count towards the corruption nor to the prediction-they do not corrupt the input and contribute little to learning.

Random token corruptions.
Replacing with random tokens contribute to corruption and prediction rate, as the input is corrupted and the prediction task is non-trivial. In fact, we find that the loss is slightly higher on random tokens compared to [MASK], as (1) the model needs to decide for all tokens whether the information at the input is from a corruption or not, and (2) predictions need to be invariant to large changes in the input embeddings.
Ablation experiments. We adopt the m = 40% model using only [MASK] replacements as the baseline, on top of which we add three models: 1. "+5% same": we mask 40% of tokens but predict on 45% of tokens. Adding same token predictions does not change m corr or m pred .
3. "80-10-10": the original BERT recipe. Due to same token predictions, m corr = m pred = 36%. Table 4, we observe that same token predictions and random token corruptions deteriorate performance on most downstream tasks. The 80-10-10 rule performs worse than simply using all [MASK]-with the exception of SST-2, where same token predictions are beneficial. Overall, our results suggest that in the fine-tuning paradigm, the model can adapt to full, uncorrupted sentences, regardless of the use of alternative corruption strategies in pre-training. Therefore, we suggest to use only [MASK] for MLM pre-training. We also present an analysis based on information flow (Voita et al., 2019) in Appendix G.

Related Work
Masking rates and masking strategies. There exist a few works on studying the impact of masking rates, among which Liao et al. (2020) show that dynamically sampling the masking rate from 0% to 100% for each sequence can improve MLM's downstream performance as well as the ability as a generation model. On the other hand, masking strategies are heavily explored for both pre-training (Joshi et al., 2020;Raffel et al., 2020;Levine et al., 2021) and intermediate pre-training (Ye et al., 2021) without considering the effect of masking rates.
"Unrealistic" MLM training. A recent line of work shows that linguistically implausible MLM objectives can achieve competitive or non-trivial downstream performance, e.g., training with shuffled word order (Sinha et al., 2021), with randomly generated sequences (Krishna et al., 2021), or predicting only the first character of masked tokens (Yamaguchi et al., 2021;Alajrami and Aletras, 2022). These studies echo our findings that even an "unrealistical" high masking rate can still lead to good downstream results.
Masking in other language models. Besides MLMs, there are other pre-training schemes, namely autoregressive language models (Radford et al., 2018;Brown et al., 2020) and sequenceto-sequence (seq2seq) language models (Raffel et al., 2020;Lewis et al., 2020). Similar to MLMs, seq2seq models corrupt text with a masking rate, but they predict with an autoregressive decoder and are fine-tuned in different ways; Song et al. (2019) also point out that masking rates control whether seq2seq models are closer to encoder-only MLMs (masking less) or decoder-only autoregressive LMs (masking more). Thus, we expect the masking rate studies in seq2seq models to draw a different conclusion from ours (Raffel et al., 2020;Tay et al., 2022b). Besides, Tay et al. (2022a) show that pretraining metrics are not correlated with downstream performance, echoing our findings that perplexity does not correlate with fine-tuning results.
ELECTRA (Clark et al., 2020) uses a smaller MLM to fill in 15% of the blanks and trains a model to distinguish whether a token was generated by the MLM or not. Despite the complicated training procedure, the main motivation of ELECTRA is to improve the training efficiency by predicting on 100% of tokens. Interestingly, we find that the corruption rate in ELECTRA becomes very low towards the end of training-the average corruption rate is roughly only 7%, but the replacements are "hard" negatives generated by the smaller MLM. We leave the study of its connection to corruption and prediction rates as future work.
Masking in other modalities. Recently, a number of works extend MLM training to images and videos and demonstrate strong pre-training results (He et al., 2022;Zhou et al., 2022;Feichtenhofer et al., 2022;Tong et al., 2022) . They adopt extremely high masking rates (e.g., 75% on images and 90% on videos) compared to their language counterparts, with the argument that images and videos are highly information redundant. Baevski et al. (2020) propose a similar style masked model in speech and adopt a masking rate of around 50%.

Conclusion & Discussion
In this work, we conduct a comprehensive study on the masking rates of MLMs. We discover that 15% is not universally optimal, and larger models should adopt a higher masking rate. We also find that masking strategies should be considered together with masking rates, and uniform masking needs a higher masking rate than more sophisticated masking strategies. We gain a better understanding of masking rates by disentangling them as corruption rates and prediction rates and analyze the 80-10-10 corruption strategy that are widely used in BERT models. Based on our findings, we discuss the implications of high masking rates and future directions of efficient MLM pre-training: Implications on higher masking rates. A direct takeaway from our findings is that larger models may adopt higher masking rates for better sample efficiency. Figure 1 shows that a large model with 40% masking can achieve comparable results to a 15% baseline on several tasks with half the training time. Larger models also exhibit faster convergence for a given computational budget: Li et al. (2020) suggest it is more efficient to train larger models for fewer steps, as opposed to training smaller models for longer. This can be combined with higher masking rates for better sample efficiency.
Separating masked and unmasked tokens. The training efficiency can potentially benefit from encoding masked and unmasked tokens separately, where masked tokens use a much lighter-weight module. If a high masking rate is taken, this can significantly reduce the training cost due to the shorter input to the encoder. A similar approach has been explored by masked autoencoders in vision (He et al., 2022), where 75% of the input patches are masked and removed from the input of the heavy encoder to achieve a 4.1× speedup. Recently, Liao et al. (2022) have applied these architectural improvements to natural language pre-training, and together with a high masking rate can accelerate MLM by a third of the pre-training budget.
Disentangling corruption and prediction. Models perform better when trained with lower corruption rates and higher prediction rates. However, in standard MLMs, those two factors are always tied to the masking rate. Methods which can encode a sequence once and then efficiently predict many small sets of masks, for example by manipulating the attention, could substantially accelerate masked language modeling pre-training.

Limitations
(1) Our analysis of masking rates applies to a specific type of pre-training method, masked language modeling. We are also interested in studying masking rates in other pre-trained methods, e.g., seq2seq models and ELECTRA, and leave it for future work.
(2) While we have shown how the optimal masking rate depends on model size and masking strategy, there may be additional factors, such as the vocabulary size, pre-training corpus or language family. In particular, our experiments focus on English, but languages with different structural and morphological features may have lower or even higher optimal masking rates, or rely more on advanced masking strategies. (3) We consider a well-established yet relatively small set of downstream tasks, which do not benchmark domain-specific knowledge or more advanced reasoning skills. (4) Due to the expensive nature of our pre-training experiments, we were not able to train multiple pre-trained models over multiple seeds. (5) Finally, our findings point out several promising directions but the paper primarily aims to study and understandthe impact of masking rates with respect to different factors. We leave exploring better architectures and methods for efficient pre-training to future work.

Ethical Considerations
Large language models can exhibit various kinds of stereotypes, as they capture societal biases encoded in the training data. These associations are not detected by standard GLUE or SQuAD evaluation. We do not expect that simple modifications of masking rates can make progress towards solving these problems. Language model pre-training is also computationally expensive, which comes at a significant environmental cost. Furthermore, it makes re-production and follow-up research difficult within an academic context. We reduce the computational requirements by following and promoting an efficient pre-training recipe and our findings point to future research for efficient MLM.

A.1 Pre-training
We implement our pre-training work based on fairseq . To further speed up pre-training, we integrate the DeepSpeed (Rasley et al., 2020) Transformer kernel for speedup. We keep the other setting the same as the 24hBERT (Izsak et al., 2021), except that we use the RoBERTa tokenizer  and we do not adopt the 80-10-10 rule. We train our model on the English Wikipedia and BookCorpus (Zhu et al., 2015). We want to emphasize that using prelayernorm (Shoeybi et al., 2019) is essential for the high learning rate in Izsak et al. (2021) to work. The hyperparameters for the efficient pre-training recipe are shown in Table 5. We train with 8 Nvidia GTX 2080 GPUs and use gradient accumulation to achieve the large batch sizes.

A.2 Downstream Task Evaluation
We fine-tune our model on the GLUE benchmark (Wang et al., 2019), including SST-2 (Socher et al., 2013), CoLA (Warstadt et al., 2019), MNLI (Williams et al., 2018), QNLI (Rajpurkar et al., 2016), RTE (Dagan et al., 2005;Bar Haim et al., 2006;Giampiccolo et al., 2007;Bentivogli et al., 2009), MRPC (Dolan andBrockett, 2005), QQP 7 and STS-B (Cer et al., 2017), and the SQuAD v1.1 (Rajpurkar et al., 2016) dataset. For each dataset we run three random seeds and average the results. We apply grid search for the GLUE datasets, as shown in Table 6. For SQuAD, we use a learning rate of 1e-4, a batch size of 16, and train for 2 epochs. For both GLUE and SQuAD we use a linear scheduling for learning rates. For all the results in the paper, we report accuracy for MNLI, QNLI, RTE, SST-2; we report F1 score for QQP, MRPC, and SQuAD; we report Matthew's correlation for CoLA and Spearman's correlation for STS-B.  For the SQuAD results in Table 1 and Table 2, we further train the models for 2300 steps (10% of the training) with a sequence length of 512, a learning rate of 5e-4, and a warmup rate of 10%. For other tables and figures, we present the SQuAD results without further pre-training, and the absolute numbers are lower because of the short pre-training sequence length. For some of the figures in the paper, we only show the results of MNLI, QNLI, QQP, STS-B, SST-2, and SQuAD due to limited space. Those tasks are selected because they have larger training set and the results are more reliable. We always show the development results in all our figures and tables except Table 2, where we report the test numbers for GLUE tasks. Table 7 shows the performance of 15%, 40% and 80% masked models on all GLUE tasks and SQuAD. We can see that 80% masking largely preserves the downstream performance and 40% outperforms 15% on most tasks. Table 9 shows the performance of different tokenizers on downstream tasks. We see that on most tasks RoBERTa tokenizer is better than BERT tokenizer.     To see that how the different masking rates perform with longer training, we modify the efficient pre-training recipe for longer steps. We also experiment with a recipe used in the RoBERTa paper . Since the final RoBERTa models use more training data, we refer to the recipe used in RoBERTa's ablation in its Table 3. Table 10 shows the hyperparameters for the longer training, as well as a comparison to the RoBERTa's recipe. The major difference is that we train with much larger learning rate and only a sequence length of 128.

C Tokenizer Comparison
We train the models with 15% and 40% masking rates longer and evaluate them on downstream tasks. Figure 7 shows the results. We see that on most of the tasks, the trend that 40% is better than 15% still holds, though the 40% has a larger advantage when the training steps are limited.
We also train the model using a recipe from RoBERTa and present the results in Table 8. We see that (1) on most tasks 40% achieves comparable results compared to 15%; (2) our "train longer" results, which uses shorter sequences and larger learning rates, are comparable to the RoBERTa recipe results though with much shorter time.

E Results of Different Model Sizes and Masking Strategies
We show the configurations of different model sizes in Table 11. Figure 8 and Figure 9 show the results of the base model and the medium model, which serve as complementary materials for Figure 2. Figure 10 shows the performance of uniform masking, T5-style span masking, and PMI masking on downstream tasks. This serves as a complementary material for Figure 5.

F Results on French MLM
To validate our conclusions in a new setting, we conduct experiments on MLM on a corpus in French. Similar to Izsak et al. (2021), we pretrain on 2020 French Wikipedia and fine-tuned on French XNLI. We report accuracy averaged over 4 seeds, and make the observation that 40% is better than 15%.

XNLI-fr valid test
Masking 15% 78.3 77.3 Masking 40% 78.9 77.5 Table 12: We pre-train on 2020 French Wikipedia and fine-tuned on French XNLI. We report accuracy averaged over 4 seeds.  Table 4 for details on models.

G Information Flow Analysis
To visualize the effect of these corruption strategies (the 80-10-10 rule), we follow Voita et al. (2019)'s analysis of measuring mutual information between an input token and its intermediate representations. Figure 11 shows that each model initially loses some information about the source token while acquiring information from the surrounding context. Using same token predictions during pretraining leads to a "reconstruction" stage in the last few layers, as observed by Voita et al. (2019), whereby information about the source token is restored from the context. However, this second stage is not present when same token predictions tokens are ablated: the [MASK]-only baseline propagates contextual features only-and no reconstruction occurs. This is more pronounced with random token corruption, where source information (that was less reliable during pre-training) is lost at a greater rate. One consequence is that information about the input tokens can be more easily extracted when pre-training with same token predictions. However, the reconstruction of the source tokens does not appear to be as important in the fine-tuning setting, as shown in our experiments in Table 4.