Learning Better Masking for Better Language Model Pre-training

Masked Language Modeling (MLM) has been widely used as the denoising objective in pre-training language models (PrLMs). Existing PrLMs commonly adopt a Random-Token Masking strategy where a fixed masking ratio is applied and different contents are masked by an equal probability throughout the entire training. However, the model may receive complicated impact from pre-training status, which changes accordingly as training time goes on. In this paper, we show that such time-invariant MLM settings on masking ratio and masked content are unlikely to deliver an optimal outcome, which motivates us to explore the influence of time-variant MLM settings. We propose two scheduled masking approaches that adaptively tune the masking ratio and masked content in different training stages, which improves the pre-training efficiency and effectiveness verified on the downstream tasks. Our work is a pioneer study on time-variant masking strategy on ratio and content and gives a better understanding of how masking ratio and masked content influence the MLM pre-training.


Introduction
Pre-trained language models (PrLMs) have played an essential role in many natural language processing tasks (Radford et al., 2018;Devlin et al., 2019;Bao et al., 2020;Guu et al., 2020;Yu et al., 2021;Zhang et al., 2022). Generally speaking, PrLMs can be seen as an automatic denoising encoder and may be conveniently obtained through a self-supervised learning way. Masked Language Modeling (MLM) pioneered by BERT (Devlin et al., 2019) is a widely used denoising method for language model pre-training (Lan et al., 2020;Clark et al., 2020). In MLM pre-training, a subset of tokens in a sequence is masked with a certain masking ratio, and the masked sequence is fed to the PrLM, which is required to predict the masked tokens.
Masking in MLM is a process in terms of sampling masked tokens from a huge data space to generate training batches, in which MLM may be heavily controlled by two main factors, masking ratio and masked contents. So far, only a few studies have ever considered optimal settings for better MLM from quite limited perspectives. Especially, all known works only take time-invariant MLM setting into account despite the huge time variance of the model during a lengthy pre-training. For example, carefully considered masked units like ngram, entity and span (Sun et al., 2019;Joshi et al., 2020;Levine et al., 2021;Li and Zhao, 2021) are adopted throughout the entire pre-training. Another example is that exploring a good enough (but still fixed) masking ratio has also been considered in (Wettig et al., 2022). Given the circumstances that time-invariant masking applied in most MLM is not adaptive to the changeable process of language model pre-training, time-invariant setting hardly hopefully reaches an optimal outcome. This motivates us to explore the influence of masking ratio and masked content in MLM pre-training and propose time-variance MLM setting to verify our hypothesis for better PrLMs.
• Masking Ratio. Masking ratio controls the ratio between the number of tokens to predict and the left corrupted context. It determines the corruption degree that may affect the difficulty of restoring the masked tokens; that is, the larger the ratio is, the more masked contents model has to predict with less non-masked context. Our hypothesis is that at different training stages, the model may benefit from different masking ratios to balance the training from samples with different difficulties compared to the fixed ratio.
We first explore the influence of different masking ratios on downstream tasks at different stages throughout the entire pre-training instead of the only final stage. We find that a high masking ratio gives better performance for downstream tasks at the early stage, while a low ratio has a faster training speed. Thus we choose a higher ratio as the starting point and decay the masking ratio to a lower value during the pre-training, namely Masking Ratio Decay (MRD), which can significantly outperform the performance of the fixed ratio. MRD indicates that MLM benefits from a time-variant masking ratio at different training stages.
• Masked Content. When placing all words with an equal and fixed opportunity throughout the entire pre-training for prediction learning, it may be unnecessary for some 'easy' words and insufficient for some 'difficult' words at the same time. Table  1 shows an intuitive example that the sequence with masked non-function words containing less information is much harder to predict compared to masked function words. Though in the very beginning, all words are unfamiliar to the models. As time goes on, the relative difficulties of words will vary when the pre-training status changes. We show that the losses of function words converge much faster than non-function words, which means non-function words are much harder for models to learn. Therefore, the high proportion of function words in a sequence leads to inefficiency if Random-Token Masking is applied.
To handle training maturity for different types of words, we propose POS-Tagging Weighted (PTW) Masking to adaptively adjust the masking probabilities of different types of words according to the current training state. PTW masking makes the model have more chance to learn 'difficult' types of words and less chance for 'easy' ones from the perspective of part-of-speech. By introducing this adaptive schedule, we show that MLM benefits from learning mostly non-function words and especially has great improvement in the QA and NLI (Natural Language Inference) tasks.
Our contributions are three folds: 1) We analyze the insufficiency of current masking strategies from the perspectives of masking ratio and masked content and give a better understanding of MLM pre-training in terms of masking. 2) To our best knowledge, this is a pioneer study to analyze the impact of time-variant masking both in masking ratio and masked content in MLM pre-training.
3) Our analysis shows that the time-variant masking schedules can significantly improve training efficiency and effectiveness. Our sources will be publicly available.

Preliminary Experiments
This section presents our preliminary experiments that motivate us to explore time-variant masking schedules. We train BERT-base (Devlin et al., 2019) with the widely-used English Wikipedia corpus to observe the influence of masking during pre-training by measuring the downstream performance on the SQuAD v1.1 dataset (Rajpurkar et al., 2016) (more experimental details will be given in Section 4). The experiments aim to study how the language model learns from the masked tokens when using conventional Random-Token Masking from the perspectives of the masking ratio and masked content.

Preliminaries of Masked Language Model
In general, Masked Language Modeling (MLM) is a denoising auto-encoding approach that is widely used in language model pre-training by reconstructing the corrupted sequences. To be specific, given a sequence x = {x i , x 2 , . . . , x n }, we use a certain masking strategy P to replace p% tokens with special mask tokens. Accepting the corrupted sequence as input, a language model parameterized by θ is trained to predict the original tokens from masked ones in x using the pretraining objective stated below: wherex is the reconstructed sequence that the language model samples from the hidden states, and M denotes the index set of masked tokens where the loss will be calculated.

Masking Ratio: Influence of Pre-training Masking Ratio
The masking ratio determines the corruption degree of a whole sequence for model training. We first conduct a simple experiment to explore the impact of different masking ratios on downstream tasks at different training stages of entire pre-training. We train BERT-base for 1M steps with a masking ratio of 15%, 25%, and 35% respectively, as shown in Figure 1a. During pre-training, checkpoints are saved every 50k steps, and finetuning is performed on the SQuAD v1.1 to observe changes in downstream performance. We find that the models using masking ratios of 25% and 35% have a gap of more than +1% F1 score in SQuAD compared with the model using a masking ratio of 15% at the beginning of training. However, in the second half training stage, the model with the masking ratio of 15% catches up with models with higher ratios in downstream performance.

Masked Content: Influence of Different Types of Words
In this section, we observe the influence of masked content by finding which kinds of words are more beneficial to pre-training. In terms of part-ofspeech, words can be roughly divided into three categories: non-function words, function words, and the others (punctuations, symbols, etc.). If we mask all the function words and punctuations of a sentence, we can still infer roughly what the sentence is about. Instead, by masking all nonfunction words, we can hardly get any information from the sentence, as shown in Table 1.
To further explore the part-of-speech, we can speculate that, for the language model, masking different types of words leads to different difficulties for pre-training.
With the help of POS-tagging tools 1 , we classify the words in the corpus into m categories 2 when doing pre-processing. In pre-training, for each type of words, we calculate the corresponding cumulative loss˜ k, t at t steps as follows: where k ∈ C denotes the word type k in set C of m categories and β ∈ (0, 1) is a coefficient to balance the exponential weighted average. We use exponential weighted average to smooth the losses because temporary losses of different batches vary greatly, leading to corresponding weights jittering (more details will be discussed in Section 3.2).
We train BERT-base for 200k steps with a fixed masking ratio of 15%. We record the cumulative losses of different types of words separately every 10 steps and observe the changes in losses. We find that the language model does have higher losses for masked non-function words and lower losses for masked function words. The latter quickly converges to very small values from the start, as shown in Figure 1b.

Analysis
Masking Ratio: Why Time-invariant Masking Ratio Is Not the Best Choice? From the experimental results in Figure 1a, there is such an empirical law: at the beginning, the downstream performance with a high masking ratio has a higher starting point but grows at a relatively slower speed and is caught up with the model with masking ratio of 15%. That is, the model with the masking ratio of 15% has a low starting point but boosts performance faster in the later stage. Given this observation, we show that we can apply a relatively high masking ratio to train models to get a better model using less time. On the other hand, we apply a lower masking ratio to train models, which obtains better downstream performance if we train for enough time. But if we use a decaying masking ratio instead of a fixed one, we can absorb the advantages of both high and low masking ratios.
Masked Content: Why Random-Token Masking Is Suboptimal? For a sentence, the numbers of non-function words and function words are quite similar. Therefore, for Random-Token Masking, the model pays equal attention to learning from these two kinds of words. However, the experimental results in Figure 1b show that the language model dissipates its effort to model some function words, of which losses have been very Step SQuADv1.1 F1 Score (a) SQuAD v1.1 performance using different masking ratios.
Step Loss (log scale) ADJ   low. Meanwhile, Random-Token Masking lets the model less likely learn those supposed-to-belearn-more non-function words, which surely gives suboptimal pre-training consequences.

Time-variant Masking Strategies
In this section, we will present our exploration of time-invariant masking on masking ratio and masked content inspired by our findings above. The overview of the time-variant masking is presented in Figure 2.

Masking Ratio Decay (MRD)
According to the observation in Section 2.2, we design an optimized Masking Ratio Decay (MRD) Schedule. At the beginning of pre-training, we use a high masking ratio and decay the masking ratio using certain strategies, which is similar to learning rate decay without warmup. Assuming that the model generally adopts a fixed masking ratio p% for training, we use a very high masking ratio (about 2p%) as the starting point and a very low masking ratio (nearly zero) as the ending point in MRD.

Implementation of Two Decay Methods
We have tried two kinds of MRD to dynamically adjust the masking ratio, namely linear decay and cosine decay as follows: M cosine (t) = (1 + cos( π T t)) · p% + 0.02, (4) where M(t) is the current masking ratio at training step t and T is the total training step. Linear decay starts at 2p% and decays to 0, while cosine decay starts at 2p% + 0.02 and decays to 0.02, as shown in Figure 3.

Details of Design Intention
We choose the starting point of 2p% and ending point of 0 because the model using MRD can learn almost the same number of masked tokens as the baseline using a fixed masking ratio due to the central symmetry of linear and cosine functions for fair comparisons. The reason why we add 0.02 to the cosine decay is that the value of cosine function (masking ratio) is nearly 0 in the final 5% steps, which means there are no masked tokens for model to learn (and loss diminishes to 0). Thus we set a small number (0.02) to make the model keep training in the final stage.

Analysis for How MRD Works
MRD reminds us of the Simulated Annealing (SA) algorithm (Kirkpatrick et al., 1983), which is a greedy algorithm for optimization. In the SA algorithm, the degree of acceptance of suboptimal solutions depends on the annealing temperature T according to the Metropolis algorithm (Metropolis et al., 1953). That is, the higher the temperature parameter T is, the larger the solution space allowed to be explored. Thus, the model can easily jump out the local minima if the T is large. As the model converges and the annealing temperature decreases, the intolerance of suboptimal solution rises and a better local optimal solution can be found. In MRD, the magnitude of annealing  temperature T can be analogous to the masking ratio p%. A high masking ratio means less information in the input sequence, allowing the model to explore more possibilities on coarsegrained task. On the other hand, a low masking rate allows the model to focus on finding a better solution close to global minima on the fine-grained task.

POS-Tagging Weighted (PTW) Masking
In this section, on the basis of the discussion of Section 2.3, we present the POS-Tagging Weighted (PTW) Masking, making the models have more chance to train on the difficult words according to the current training state.
Firstly, in the data pre-processing part, we perform word-level tokenization for the sequences in the corpus and use POS-tagging tools to label the whole words with corresponding part-of-speech. We then use WordPiece Tokenizer to perform tokenlevel tokenization and align tokens with their partof-speech tags while ignoring the special tokens [CLS], [SEP], and [PAD].
Before training batch starts, we first apply PTW Masking to corrupt the sequences, while the masking ratio remains unchanged. When the model is trained at t steps, according to Equation 2, we can obtain cumulative loss vectorL MLM = {˜ 1 ,˜ 2 , . . . ,˜ k , . . . ,˜ m }, k ∈ C for m categories of words, and the cumulative loss vectorL MLM is converted into the corresponding weight vector W POS = {w 1 , w 2 , . . . , w k , . . . , w m }, w k ∈ (0, 1) by the following equation: where µ is a coefficient to adjust the input for sigmoid function. We apply this weight vector W POS to the masking probabilities. Equation 5 is based on Equation 2, where the process of smoothing gives the weights changing relatively stably for masking. We set β = 0.95 and do not use bias correction in Equation 2, which enables the cumulative losses for each kind of words to grow from zero. That is, in the very beginning, the W POS is initialized with the same value for each type of words and weights the probabilities for masking equally.
Specifically, the masking probability of each word is weighted by its corresponding part-ofspeech, so that words with higher losses are more likely to be masked. We show that nonfunction words tend to have much higher losses than function words, so the language model learns to model non-function words most of the time, but fewer function words and punctuation, as shown in Figure 1c. In special case, PTW Masking is similar to Named Entities Masking (Sun et al., 2019) if only proper nouns have a weight of 1 and the others are 0.
In addition, we have tried to apply the PTW Masking to Random-Token Masking, and found no significant increase in the performance of downstream tasks. We suspect that although the subwords in content words are masked more frequently, some common subwords (such as ##ly, ##er, ##ion, etc.) are also overlearned.

Pre-training
For pre-training, we use the BERT-base model as the representative of MLMs for training. For the dataset, we train BERT on English Wikipedia using WordPiece Tokenizer for tokenization. We only use the MLM task as the training objective and discard the Next Sentence Prediction task, as it has been shown to be redundant in previous studies (Liu et al., 2019;Joshi et al., 2020). Following the model configuration of BERT, we train with a sequence length of 512, and a batch size of 256. We use a peak learning rate of 2e-4 and adopt learning rate decay with a warmup of 10k steps. We also train with gradient clipping of 1.0 and weight decay rate of 0.01. We train the models for 200k steps from scratch.

Finetuning
We finetune our models on GLUE  and SQuAD v1.1 (Rajpurkar et al., 2016) to evaluate the performance of downstream tasks. Following the common finetuning practice, we do not use any additional training strategies. We train both SQuAD v1.1 and GLUE 5 times respectively and report the average scores.  (Nangia et al., 2017) and RTE (Bentivogli et al., 2009). We use a learning rate of 1e-4, batch size of 32. We finetune our models in RTE and STS-B for 10 epochs and other subtasks for 3 epochs.
SQuAD The Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016) is a commonly used benchmark for question answering. The task is to predict the text span of an answer from a given passage-question pair. In this work, we use SQuAD v1.1 for finetuning. We finetune with a learning rate of 5e-5, batch size of 128, for 3 epochs.

Implementation Details
For the implementation of MRD, we train the model using Random-Token Masking with a fixed masking ratio of 15% as the baseline of MRD. Then we apply our MRD to Random-Token Masking with an average masking ratio of 15% using linear and cosine decay, respectively.
Because there is the parameter T in Equation 3, the number of training batches learned by the models under a certain masking ratio will differ if total training step T is different. As shown in Figure4, compared to the model with 200k steps, the models with 1M steps are trained for large masking ratio for longer time at the early stage of pre-training. This question will not be raised if we use the time-invariant masking ratio as we usually do. But in MRD, though both masking ratios decay in relatively the same way, the absolute difference of masking ratio in early stage may affect the performance of downstream task. Thus, we want to explore if training on large masking ratio longer time or decaying faster at early stage is more beneficial to pre-training.
Step Masking ratio cosine 200k cosine 1M Figure 4: Different total training steps cause an absolute difference in masking ratio in the early stage of pre-training though same MRD strategies are applied.
For the implementation of PTW Masking, for equal comparison on word-level masking, the baseline of PTW Masking adopts the Whole-Word Masking with a fixed masking ratio of 15%. We then train the model with our PTW Masking using the same fixed ratio of 15% to weight the probabilities of different types of words to be masked according to their cumulative losses.

Results
We evaluate downstream task performance of MRD and PTW Masking compared to baselines on SQuAD and GLUE in Tables 2-3 and Figure 5. From the results, we have various observations that will be discussed in the following sections.

Results of MRD
The experimental results show that MRD greatly improves the downstream task performance and pre-training efficiency.

Decaying Masking Ratio vs Fixed Ratio
In Figure 5, we show the SQuAD performance for every 50k checkpoint during pre-training. We observe that the large masking ratio gives a better downstream performance at the start and the decaying mechanism continues to boost the downstream performance, which takes the advantages of high masking ratio and low masking ratio discussed in Section 2.4. The model using cosine decay at 650k steps has obtained a competitive SQuAD v1.1 F1 score to the baseline at 1M steps, thus reducing the training time by 35%. Step SQuADv1.1 F1 Score fixed 1M cosine 1M fixed 200k cosine 200k Figure 5: Comparison between fixed ratio and cosine decay strategy on SQuAD performance during pretraining. We evaluate the saved checkpoints for every 50k steps on SQuAD v1.1 dev set following the same experimental setup.

Influence of Masking Ratio at Early Stage
We further explore the absolute difference (mentioned in Section 4.2) with different training steps in masking ratio using the same MRD strategies. As shown in Table 4, for GLUE tasks, MRD training for 1M steps has more obvious advantages than 200k steps. The model trained with 1M steps performs well above baseline on all subtasks using MRD, with an average increase of 1+ on GLUE, which has a larger increment compared to 200k steps. The comparison shows that models benefit from training for longer time with large masking ratio from the start, especially on GLUE. Because the subtasks in GLUE are mainly sequence-level, which focus on global semantics. For a higher masking ratio, the model tries to train on a coarsegrained task, inferring global semantics from fewer words, which is more suitable for GLUE. Therefore, in the training of 200k steps, the masking ratio decays too fast, resulting in insufficient training on coarse-grained task. In contrast, the model with 1M steps can maintain the training at a high masking ratio for a longer time and thus perform better in the sequence-level tasks of GLUE.

Comparison Between Different MRD Strategies
Compared with linear decay, cosine decay has better downstream task performance in most subtasks in GLUE. The difference between these two is that cosine decay keeps a higher masking ratio in the early stage and decays more quickly, which is consistent with the analysis mentioned above. To move forward, it is necessary to maintain a high masking ratio in the early stage of pre-training. According to Section 3.1.3's empirical analysis, the model can explore a larger global solution space by using a higher masking ratio in the early training period so as to better converge to the optimal global minima when the masking ratio decreases later.

Results of PTW Masking
Results in Table 3 show that PTW Masking has achieved significant improvement in SQuAD. Compared to baseline using Whole-Word masking, for subtasks of GLUE (shown in Table 2), PTW Masking has better performance on MNLI and QNLI, which both are NLI (Natural Language Inference) tasks but especially has a great decline in MRPC, which is a sequence-level classification task in terms of sentence paraphrasing.

What Skills Models Have Learnt if
Trained with Mostly Non-function Words with PTW Masking?
Both SQuAD and NLI (QNLI, MNLI) tasks can be concluded to the same kind of task of extracting certain information in given context. As shown in Figure 1c, models with PTW Masking learn function words only in the very beginning and mostly non-function words afterward. We show that PTW Masking makes the model sensitive to the words with more semantic information, which is consistent with the goal of information extraction. On the other hand, for other subtasks in GLUE, most of them focus on the understanding of a whole sequence. For example, MRPC needs model to judge if a sentence can be paraphrased into another sentence. We can see that model with PTW Masking has fair or even worse performance on such tasks, which means that it struggles to capture the global semantics. It makes us rethink the roles of function words in the sequence that function words help to explain the global semantics of the whole sequence.

Related Work
The pre-processing of the MLM is to replace a subset of the tokens in the input with [MASK] tokens, which has two considerations to optimize: how many tokens to mask (masking ratio) and what tokens to mask (masked content).
• Masking Ratio. Masking ratio is a very important hyperparameter that affects the pretraining of MLM, which is relatively seldom studied. In BERT, the masking ratio of 15% is the most commonly used value and is also applied in other MLMs. The generator of ELECTRA (Clark et al., 2020) is a MLM, using 15% for base-sized models and 25% for large-sized models. However, considering the cooperation with the discriminator, it is difficult to judge the effect of 25% on MLM. In a recent study, (Wettig et al., 2022) suggests that a masking ratio of 40% performs better than 15% in downstream tasks of RoBERTa-large (Liu et al., 2019) model. T5 (Raffel et al., 2020) uses an MLMstyle pre-training method and also experiments on the influence of different masking ratios. They find that the masking ratio has a limited effect on the model's performance except for 50% and use 15% as the final choice. To our best knowledge, most studies on masking ratio compare the performances of downstream tasks at the end of pre-training (Raffel et al., 2020), but few studies pay attention to the dynamic influence of masking ratio during pre-training, which is very interesting. We record the changes in the performance of downstream tasks under different masking ratios and therefore propose the MRD according to the empirical law we observe. Instead of using a fixed masking ratio, we dynamically decay the ratio and find that the performance of MLM can be greatly improved.
• Masked Content. Previous studies have explored strategies for masked content to further improve the Random-Token Masking, though nearly all of them focus on how to select coherent enough masked units. (Devlin et al., 2019) proposes Whole-Word Masking, which forces the model to predict complete words instead of WordPiece tokens. Furthermore, SpanBERT (Joshi et al., 2020), n-gram Masking (Levine et al., 2021;Li and Zhao, 2021) and LIMIT-BERT  take into account the continuous mask of multiple word combinations, making model predict tokens using the context with long dependencies. ERNIE (Sun et al., 2019) improves pre-training performance by especially masking named entities. Different from all existing MLM improvements, our proposed PTW method lets different types of words correspondingly receive the matched learning intensity, which pioneers a new technical line for the concerned MLM.

Conclusion
Masked language model pre-training can be generally defined by two main factors, masking ratio and masked contents. The Random-Token Masking scheme adopted by existing studies treats all words equally and maintains a fixed ratio throughout the entire pre-training, which has been shown suboptimal in our analysis. To better unleash the strength of MLM, we explore two kinds of time-variant masking strategies, namely, Masking Ratio Decay (MRD) and POS-Tagging Weighted (PTW) Masking. Experimental results verify our hypothesis that MLM benefits from time-invariant setting both in masking ratio and masked content according to dynamic training states. Our further analysis show that these two time-variant masking schedules greatly improve pre-training efficiency and the performance of downstream tasks.

A Additional Investigation on MRD
A.1 Magnitude of Masking Ratio in MRD When using MRD, we explore the influence of the much higher masking ratios, which affect the downstream performance of the model. Previous studies (Raffel et al., 2020;Wettig et al., 2022) have shown that a much higher fixed masking ratio ( 40%) will cause significant degradation in the model performance because the model can only infer from a small amount of known information resulting in quickly converging to local minima. In MRD, we show that the design of the decaying mechanism can mitigate the impact of the high masking ratio. For the BERT-base model, starting from a high ratio (30%) and a much higher ratio (55%), both can outperform the baseline with a similar margin. We show that higher masking ratios in early pre-training stage help downstream performance, and MRD prevents the high masking ratios from destroying pre-training in later stage.

A.2 MRD Interacts with Learning Rate
Moreover, we show a subtle relationship between MRD and learning rate decay. When the masking ratio is low, using a relatively high learning rate will cause a huge decline in model performance. Therefore, in MRD, the masking ratio and the learning rate both adopt the same type of decay strategy except that the learning rate has an additional warmup stage. For example, cosine masking ratio decay use cosine learning rate decay.

A.3 Other Simple Schedules in MRD
Based on the same experiment setup, we train the models with other simple schedules (shown in Figure 6) for 200k steps using the linear learning rate decay and finetune them on the SQuAD v1.1. The results on SQuAD v1.1 dev set are presented in Table 5. We find that cosine is the best compared with those alternatives.

Model
SQuAD v1  Step Masking ratio Figure 6: Other simple schedules of adjusting the masking ratios.