UvA-DARE (Digital Academic Repository) Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token

,


Introduction
Large-scale pre-trained MLMs, like BERT (Devlin et al., 2019) and its variants (Liu et al., 2019;Lan et al., 2020;Clark et al., 2020;Song et al., 2019;Lewis et al., 2020), have achieved great success in various NLP tasks, such as machine translation (Liu et al., 2020;Zhu et al., 2020), general language understanding (Wang et al., 2019b,a), question answering (Rajpurkar et al., 2016), summarization (Liu, 2019) and claim verification (Soleimani et al., 2020).To make the pre-trained model generalize well to a wide range of tasks, MLMs tend to have a large number of parameters, even in the billion scale (Shoeybi et al., 2019), and are trained with plenty of data.This is prohibitively expensive and generates significant amounts of CO 2 emissions 1 Code at https://github.com/BaohaoLiao/3ml x 1 m 2 m 3 x 4 x 5 A sentence, {x 1 , x 2 , x 3 , x 4 , x 5 }, is corrupted by replacing x 2 and x 3 with a virtual token m indexed with the corresponding positions.
2 (Strubell et al., 2019;Patterson et al., 2021).How to make the pre-training of MLMs more efficient while retaining their superior performance is a critical research question.
Various attempts on efficient pre-training have obtained effective results.Hou et al. (2022) and Wu et al. (2021) applied a prior knowledge extracted from the MLM itself to make the prediction more inclined to rare tokens.Shoeybi et al. (2019) and You et al. (2020) made use of mixed-precision and distributed training to speed up the pre-training.Data-efficient pre-training objectives (Clark et al., 2020;Lan et al., 2020) and progressively stacking technique (Gong et al., 2019) also work quite well.Orthogonal to these directions, we dive deeply into the information flows of the vanilla MLM, trying to split different types of information flows into multiple stages and making the model spend more computation on the complex flow.
The information transferred among tokens can be split into: position information and nonpositional information (termed as token information in this paper).As shown in Figure 1, the information flows for a corrupted sentence during training consist of: (1) Position and token information among unmasked tokens; (2) Position and token information from unmasked tokens to [MASK]s; (3) Position information among [MASK]s; (4) Position information from [MASK]s to unmasked tokens.These information flows happen in each Transformer block (Vaswani et al., 2017), more specifically in the self-attention module.In addition, the 4 th flow brings no additional information given the 1 st one, since the positions information from [MASK]s can be inferred implicitly given the positions of the unmasked tokens.We ignore the 4 th flow for the following discussion.
Intuitively, the amount of information transferred in each flow is not at the same level.The 3 rd flow contains the least information.We make an assumption about the first two flows.
The Information Flow Assumption.Position and token information among unmasked tokens (1 st flow) are more difficult to learn than the transfer of this knowledge to [MASK]s (2 nd flow).
Since the difficulty of information transfer varies among different flows, it makes sense to divide the flows into multiple stages, forcing the model to spend more computation on the more complex one.We propose a two-stage learning method.For the early layers of an MLM, we detach [MASK]s and only input the embedding of unmasked tokens.So the model firstly focuses on the most complex (1 st ) information flow.At an intermediate layer, we append the embedding of [MASK]s with their corresponding position information back to the sequence.Then the remaining layers of the MLM fuse all information.In this way, the sequence length for the early layers becomes shorter due to excluding [MASK]s.We further reduce the sequence length by increasing the masking rate for higher efficiency.We call our method mask more and mask later (3ML), since [MASK]s are appended later and we have a larger masking rate.
In this work, we introduce two models designed for 3ML ( §2), empirically show two prerequisites of efficiency for 3ML hold ( §4), conduct extensive experiments to select an optimal setting for high performance and efficiency ( §5), and finally compare 3ML's results to strong baselines ( §6).
Our main contributions are summarized as: (1) We introduce a simple, intuitive but effective method for the efficient pre-training of MLM; (2) We prove two prerequisites that are important for 3ML; (3) On the GLUE benchmark, 3ML achieves the same performance as RoBERTa-base and RoBERTa-large (Liu et al., 2019) with only 78% and 68% of the original computation budget, and outperform them with the same budget.

Model
In this section, we first discuss the information flows in a vanilla MLM, i.e.BERT, then introduce two architectures designed for 3ML (Figure 2).

Vanilla Masked Language Model
Masked language models reconstruct a sequence with corrupted information.Given a sequence of tokens x = {x t } T t=1 with t denoting the token's position, the corrupted version x is generated by randomly setting a portion of x to a special symbol [MASK].MLM is trained to learn the distribution p(x|x) with a loss function: where δ x t ≠x t is a Kronecker Delta function: As shown in Figure 1, position and token information are transferred among different tokens in the model.We can cluster the flows into four types: (1) From unmasked tokens to unmasked tokens (within the blue area): MLM transfers position and token information among unmasked tokens, generating uncorrupted contextualized information; (2) From unmasked tokens to [MASK]s (from the blue area to the yellow area): MLM transfers the uncorrupted contextualized information to [MASK]s; (3) From [MASK]s to [MASK]s (within the yellow area): MLM transfers the position information among [MASK]s.Since all masked tokens have the same token embedding, there is no transfer of token information; (4) From [MASK]s to unmasked tokens (from the yellow area to the blue area): MLM transfers position information from [MASK]s to unmasked tokens.with a decoder consisting of both self-and cross-attention layers (3ML cross ).We achieve efficient pre-training by discarding [MASK]s for the encoder and applying a small decoder for the whole sequence length.For fine-tuning on downstream tasks, the decoder is removed.

Mask More and Mask Later
Since different flows of the vanilla MLM contain different amounts of information ( §1), we propose a two-stage training method where more computation is allocated to the most complex flow, the one among unmasked tokens.At a later stage, we fuse all flows together as vanilla MLM.In this way, we aim to improve the efficiency by reducing the sequence length for the first stage by discarding [MASK]s.Even though at the later stage we still need to fuse all information flows together, back to the original sequence length, we only need to apply a few layers for that, since the 1 st flow is already learned quite well during the first stage.Combining a large masking rate and a small number of layers for the second stage together, we can achieve an efficient pre-training.In short, two prerequisites contribute to our efficient pre-training: we can mask more and mask later (test in §4).
We design two architectures, 3ML self and 3ML cross , that only differ from each other on the decoder.As shown in Figure 2, we first input the unmasked tokens to both models.At an intermediate layer, we input the token embedding of [MASK] and fuse all information flows together.With our method, the token embedding of [MASK] is disentangled from the original word embedding space and located in a latent space.
3ML self This architecture is inspired by a computer vision method (He et al., 2021) designed for two-stage learning.3ML self has an encoder and a decoder, with a self-attention Transformer block as the base layer for both.A prediction layer for masked tokens is located at the end of the decoder.Only unmasked tokens are fed into the encoder.So the encoder only transfers information among unmasked tokens (1 st flow).After the encoder, we append the token embedding of [MASK] back to the sequence with a new positional embedding and input to the decoder.Since the input sequence to the decoder consists of the representations of unmasked tokens and the token embedding of [MASK]s, the decoder fuse all information together.In addition, the token embedding of [MASK] is in the same space as the latent representations of unmasked tokens.
3ML cross Both 3ML self and 3ML cross share the same encoder that only receives unmasked tokens as input.In contrast to 3ML self , we use a Transformer block with both self-& cross-attention as the base layer for the 3ML cross decoder.Two sequences are given as input to the decoder, the latent representations of unmasked tokens and a sequence of [MASK] tokens and their position embeddings.The information flow from unmasked tokens to [MASK]s (2 nd flow) are conducted by the cross-attention module.The information among [MASK]s (3 rd flow) is transferred by the selfattention module.But there is no further information transfer among unmasked tokens (1 st flow) in the decoder, different from 3ML self .We use the same prediction layer for the randomly replaced, unchanged and masked tokens.Noticeably, we predict the randomly replaced and unchanged tokens after the encoder.
For both 3ML self and 3ML cross , if the hidden dimensions of the encoder and decoder are not identical, one extra fully-connected layer is added in between for projection.The prediction layer of vanilla MLM contains two fully-connected layers.The last one shares the weight from the embedding layer that has the same hidden dimension as the encoder.It projects the hidden dimension to the vocabulary size for prediction.If the hidden dimensions of the encoder and decoder are different, the first prediction layer projects the hidden dimension of the decoder back to the encoder dimension, so we can still share the weight from the embedding layer.More details are in Appendix D.
Fine-tuning and Inference After pre-training, the decoder of both architectures is removed.We only fine-tune the encoder on downstream tasks.We implement fine-tuning in this way because we want to: (1) speed up the inference; (2) keep our architecture for downstream tasks the same as standard MLMs and make it convenient for various applications without modifying their frameworks.However, for some tasks that require the [MASK] embedding, like the mask-infilling task, it might be beneficial to keep the decoder, since the token embedding of [MASK] doesn't locate in the same space as other tokens.
3 Experimental Setup
We report accuracy for SST-2, MNLI, QNLI, and RTE, both F1 score and accuracy for MRPC and QQP, Matthew's correlation for CoLA, both Pearson and Spearman correlation for STS.By default, we use the same calculation as the GLUE leaderboard, i.e. the average of MNLI-m and MNLImm for MNLI, the average of F1 and accuracy for MRPC and QQP, and the average of Pearson and Spearman correlation for STS.We finally report the macro average of all tasks.For ablation experiments, we only evaluate the models on MNLI, QNLI, and QQP and report their accuracy, since these three tasks have the largest amount of training and validation sets, resulting in more stable fine-tuning scores than the others.In addition, the weighted average accuracy for MNLI from MNLIm and MNLI-mm is shown rather than the macro average for ablation studies4 .

Baselines
We compare 3MLs to the following baselines: • Google BERT The results of Google BERT (Devlin et al., 2019) on the GLUE development set are not shown in the original paper, we borrow the BERT-base's and BERT-large's results from Xu et al. (2020) and Clark et al. (2019), respectively.
• Our RoBERTa RoBERTa-base's GLUE results (trained on BooksCorpus and English Wikipedia) are not fully shown in the original paper (Liu et al., 2019).We re-implement it with its original hyperparameters.
We don't include the encoder-decoder architectures (like BART (Lewis et al., 2020) and MASS (Song et al., 2019)) here because RoBERTa outperforms them (Lewis et al., 2020) on GLUE tasks.

Implementation.
The re-implementation of baselines and our pretraining methods are conducted on fairseq (Ott The default masking strategy for 3ML stays the same as BERT.That is for all corrupted tokens, 80% of them are replaced by [MASK], 10% are replaced by random tokens from the vocabulary and 10% stay unchanged.We borrow the same fine-tuning hyperparameters from 24hBERT for all 3MLs (Table 5 in Appendix B).
Architectures The encoder of our large or base model shares the same settings as Google BERT.By default, we use a two-layer decoder with half of the hidden dimension of the encoder.3ML self uses Transformer (Vaswani et al., 2017) encoder layers for its decoder, while 3ML cross uses Transformer decoder layers with both self-attention and cross-attention layers without causal masking for its decoder.Since the hidden dimensions of the encoder and decoder are not the same, there is a linear layer in between to project the output from the encoder to the dimension of the decoder.3MLs All models are pre-trained with a masking rate of 40%.
have two untied learnable positional embeddings for the encoder and decoder.Like 24hBERT, we implement pre-layer normalization (pre-LN) (Shoeybi et al., 2019) for 3MLlarge.It makes the pre-training more stable and achieves better performance with a large learning rate.For 3ML self -base, post-LN is slightly better than pre-LN.We still use pre-LN for 3ML cross -base for stable pre-training.Post-LN doesn't work for 3ML cross .We leave the investigation of this problem to future work.For fine-tuning GLUE tasks, the 3ML decoder is removed.So its inference time on downstream tasks stays the same as the original BERT.More details of the 3ML architectures are in Table 4 (see Appendix A).

Two Prerequisites for Efficiency
In this section, we show that the two prerequisites, masking more and masking later, for efficiency hold for 3ML, which is also an empirical test of our information flow assumption.

Mask More
As shown in Figure 2, if masking more is possible, the sequence length of the input to 3ML's encoder becomes shorter.Since most trainable parameters are located in the encoder (the 3ML self -large encoder contains 98% of the parameters, excluding the embedding layer and the prediction layer) and the computation time of a self-attention module scales quadratically with the sequence length, this significantly reduces the computation time.
However, the motivation of our work is to maintain MLM's performance with higher efficiency.We don't want to lose performance significantly.Therefore, we conduct an experiment on a vanilla MLM, 24hBERT, to check whether masking more is possible.
As shown in Figure 3, the default masking rate (15%) from BERT is not optimal.With an increasing masking rate, the performance of all three tasks increases first and then decreases.The optimal masking rate is 30% for MNLI, 20% for QNLI, and 20% for QQP.The performance of a masking rate in (15, 45)% is consistently better than the one of 15%.On average, a masking rate of 40% works the best.Wettig et al. (2022) also obtained a similar result.In short, masking more is not only possible but also offers higher performance.

Mask Later
Masking later and masking more are complementary to achieve higher efficiency.For vanilla MLM, masking more offers better performance.But it doesn't guarantee that we can disentangle [MASK]s from the word embedding and append them at an intermediate layer.Instead of directly showing the performance of 3ML, we try to answer the following two questions for checking the possibility of masking later: (1) How good are 3MLs at masked language modeling compared to BERT? (2) How fast do the models recover the identity of the [MASK] tokens?
To answer the first question, we compare the MLM perplexity for [MASK]s between 24hBERT and 3MLs on the validation set.The results in Table 1 show: While 3ML self and 3ML cross have comparable perplexities, the perplexity of 24hBERT is significantly lower, indicating that 24hBERT performs better at the MLM task.However, this does not necessarily correlate with better performance on downstream tasks due to the mismatch between pre-training and fine-tuning (further discussion in §5.2).One could even argue that the increasing difficulty of the pre-training task forces 3MLs to learn better hidden representations of the unmasked tokens.
To address the second question, we measure the mutual information between the hidden representations at each layer at the masked positions and the original tokens at these positions.We follow the strategy proposed by Voita et al. (2019), and take hidden representations at masked positions corresponding to the 1000 most frequent tokens.For each layer, we gather 5M hidden representations and cluster them using mini-batch k-means into 10,000 clusters.For 24hBERT, we do this for each of the 24 encoder layers.For the 3ML models, the masked tokens are only fed in the decoder, values are only calculated for two decoder layers.
The results for 24hBERT and 3MLs are shown in Figure 4.For vanilla MLM, the largest amount of information on the identity of masked tokens is restored after the first few layers.The information is further gradually restored over the remaining layers.The results for 3ML self and 3ML cross are similar.After the first decoder layer, the information on the masked token identity is already restored to a similar level as in the last layer of 24hBERT.A possible explanation is that the decoder of 3MLs can already access the high-level representation from the encoder that facilitates the reconstruction.
The results from Figure 4 also empirically test our information flow assumption.The 1 st flow contains the most significant amount of information.We can achieve higher efficiency by specifically allocating more computation to this flow and spending less computation on the others, rather than focusing on all flows at once like vanilla MLM.

Efficient Setting
Section 4 shows that two prerequisites for higher efficiency hold for 3MLs.In this section, we further explore different choices of 3ML architecture and the masking rate, trying to select an optimal setting with the trade-off between performance and efficiency.

Decoder Architecture
By default, we set the 3ML's encoder with the same hyper-parameter setting as Google BERT.We leave the exploration of the encoder architecture to future work.Since both [MASK]  are fed into the 3ML's decoder, it's necessary to select a small decoder for high efficiency.
We explore 3ML's decoders with different numbers of layers and hidden dimensions in Tables 2a and 2b.As shown in Table 2a, both 3ML self and 3ML cross have a similar but surprising finding: A large decoder with 8 layers works the worst, while a small decoder with 2 or 4 layers works the best.We argue that 3ML with a deeper decoder doesn't work well because of our fine-tuning setting.Recapping that 3ML's decoder is removed for fine-tuning, throwing a deeper decoder away means that more pre-trained parameters are removed.For the following experiments, 3ML with a two-layer decoder is the default setting.
Table 2b shows 3ML's performance with different hidden dimensions.3ML is less sensitive to the hidden dimension of the decoder: the performance of all settings is very similar.By default, we choose a decoder with half of the encoder's dimension for the following experiments.
Both Tables 2a and 2b show that a small 3ML decoder is enough.It suggests that fusing all information flows after the encoder is easy, again empirically testing our information flow assumption.

Masking Strategy
The default masking strategy of vanilla MLM is 80-10-10.I.e.80% of the corrupted tokens are replaced by [MASK]s, 10% are replaced by random tokens and 10% are kept unchanged.Ideally, we can achieve higher efficiency with the 100-0-0 setting, since we can further reduce the sequence length for 3ML's encoder given the same masking rate as 80-10-10.
We repeat the experiments on masking strategy as BERT in Table 2c, checking whether we have different findings for our 3ML.The default masking strategy 80-10-10 for both 3MLs works much better than 100-0-0.It is also the best strategy for 3ML cross .80-0-20 works similarly to 80-10-10 for 3ML self .This finding suggests that keeping prediction on some original (unchanged) tokens is necessary, decreasing the gap between pre-training and fine-tuning.The original BERT paper (Devlin et al., 2019) had a similar finding.By default, we apply the 80-10-10 masking strategy.
We can further observe the mismatch between pre-training and fine-tuning with the MLM perplexity scores.The prediction of [MASK]s is the hardest for both 3MLs, with the highest perplexity.The prediction of unchanged tokens is the easiest for 3ML self .Surprisingly, the prediction of randomly replaced tokens is the easiest for 3ML cross , which is contradictory to our intuition that unchanged tokens should be easiest to predict.

Masking Rate
A large masking rate is a necessary prerequisite for the efficiency of 3ML.However, a too-large masking rate hurts the performance as shown in Figure 3.In this section, we study 3ML's trade-off between efficiency and performance.From Figure 5, we achieve higher speedups with increasing masking rates.With a masking rate of 50%, we can obtain near 1.5 times speedup, saving 1/3 of the computation budget.
Both 3ML self and 3ML cross share a similar trend as 24hBERT: better accuracy is obtained with a 1484

Comparison with Previous Work
In this section, we compare 3ML with strong baselines on the development set of GLUE.We list the results of both efficient and longer pre-training recipes in Table 3.

Result of Efficient Pre-training Recipe
The first block of Table 3 shows results from the efficient pre-training recipe.We train all models with the same number of updates and an identical learning rate.So the improvement shown here is not obtained by seeing more data or doing more extensive gradient updates.Similar to Figure 3, 24hBERT with a masking rate of 40% performs consistently better than the one with 15% on all GLUE tasks, achieving 0.7% absolute improvement on average.It further suggests that masking more is possible and favorable.
3ML self with a masking rate of 40% and 50% has the same average score as 24hBERT-40%.But it requires less computation, to be precise only 75% and 68% of the original computation budget.3ML cross achieves a slightly worse result than 24hBERT-40%, losing 0.2% performance on average, but being slightly more efficient than 3ML self with the same masking rate.
Compared to the number of trainable parameters of 24hBERT, the increasing parameters for 3ML due to an extra decoder are negligible, only accounting for 2% of all 24hBERT parameters.In addition, 3ML's decoder is discarded for fine-tuning.We believe that the well-performed 3MLs benefit from our two-stage training method rather than the slightly added parameters.

Result of Longer Pre-training Recipe
3ML behaves well and efficiently on a limited computation budget.We are also interested in its scaling behavior which is critical for a language model.As we normally train a language model on large data for better generalization on a wide range of tasks.Due to our limited computation resources, we only implement the scaling experiment on the base model with more updates till convergence.We leave the scaling experiments on a bigger data set and a larger model to future work.
The results from the longer pre-training recipe are shown in the second block of Table 3.When seeing the same amount of data (125K updates), 3ML self performs better than BERT (84.9/85.0 vs. 82.5),comparable to RoBERTa (84.9/85.0 vs. 84.9)and ELECTRA (84.9/85.0 vs. 85.1).However, 3ML-125K is faster than any baseline (28% faster than RoBERTa with a masking rate of 50%).When using the same computation budget as RoBERTa, 3ML self -40%-153K achieves the best result (85.3) among all models.We also notice that the pretraining of 3ML self s already converge with 125K updates.As also shown in Table 3, we only obtain 0.4% and 0.5% improvement with extra 28K and 35K updates for 3ML self -40% and 3ML self -50%, respectively.Further improvement is expected if the model is trained on a larger dataset.
Similar to the efficient pre-training recipe, 3ML cross performs worse than the strong baselines, RoBERTa and ELECTRA, but better than BERTbase.We argue the reason for the poor performance of 3ML cross is: The positional embeddings for [MASK]s and unmasked tokens are not in the * " denotes our re-implementation." • " means the corresponding metric is used for calculating the average score.When using the same metrics as ELECTRA to compute the average score, models with " ⋄ " stay the same as the shown average score.More details on the baseline models are in Appendix C. same latent space (Figure 2), which makes it difficult for the model to fuse all information and predict missing information.In addition, the decoder doesn't further transfer information among unmasked tokens (the most complex flow).We leave the further investigation to future work.
In summary, when trained on the same number of samples, 3ML self performs similarly to strong baselines with less computation.3ML cross performs comparably to the baselines for efficient pre-training.When trained with a similar amount of computation, 3ML self performs the best and 3ML cross performs better than the standard baseline, BERT.

Related Work
The most related work to this paper is MAE (He et al., 2021) from computer vision.3ML self shares almost the same architecture as MAE, but with the additional prediction of unchanged tokens and randomly replaced tokens, while MAE only reconstructs the masked patches.We also have different findings from MAE: A small decoder works better for 3ML self .While MAE applied a deeper decoder for better performance.
3ML's encoder-decoder architecture looks similar to BART (Lewis et al., 2020) and MASS (Song et al., 2019).But BART and MASS apply a causal masking decoder that is suitable for generation tasks rather than classification tasks.Our decoder is still a bidirectional architecture like the encoder.In addition, the decoder of BART and MASS is used for fine-tuning downstream tasks.For GLUE tasks, one needs to input the same sequence to both encoder and decoder, which is less efficient.3ML's decoder is removed for fine-tuning, having the same inference speed as vanilla MLM.
Hou et al. ( 2022) do a concurrent work, dropping the representations of unimportant tokens for some intermediate layers to reduce the sequence length for efficiency.We don't include any prior information and only drop the [MASK] token at the very beginning.In addition, 3ML achieves better results.Wettig et al. (2022) and our work have the same finding, better performance for a middle-level masking rate.Although ALBERT (Lan et al., 2020) and ELECTRA (Clark et al., 2020) make the pretraining of MLM more efficient, their studies are orthogonal to ours.

Conclusion
We propose a two-stage learning method for efficient masked language modeling and design two models, 3ML self and 3ML cross , for our method.Two prerequisites for our efficient method are: We can have a higher masking rate and append [MASK]s at a later layer.Our experiments show that both, masking more and masking later, are possible and favorable.This allows us to reduce the sequence length of the encoder during pre-training by a factor depending on the masking rate.By conducting extensive experiments, we observe that 3ML self performs better on downstream tasks than 3ML cross .It can speed up the pre-training by a factor of 1.5x for our efficient pre-training recipe without any performance degradation.Using roughly the same computation budget, 3ML self outperforms all of our strong baselines like ELECTRA and RoBERTa.

Limitations
Our investigation is limited to classification tasks.While 3ML outperforms other models on GLUE tasks, it might not be good at other tasks, especially for the mask-infilling task where the token embedding of [MASK] is used directly.More tasks need to be evaluated to validate 3ML's robustness.
In addition, we only train 3MLs on BookCorpus and English Wikipedia.The scaling behavior of 3ML with respect to model size and amount of data is an open question.Further, it needs to be validated whether the results transfer to other languages than English.We leave this to the future.We also don't apply any fine-tuning tricks, like layer-wise learning rate in ELECTRA, and tune the hyperparameters.It would be better for us to provide a specific optimal recipe for our model, making it more practical.
the training recipe from RoBERTa as our longer pre-training recipe.

D Calculation of FLOPs
We borrow a simplified version of FLOPs calculation from Pan et al. (2021) where the computation for bias, activation, and dropout is neglected because it only occupies a small amount (< 1%) of the total FLOPs.We restate their calculation and make some modifications here.The meaning for different notations is listed in Table 6 for your convenience.
Transformer block with self-attention Given the sequence length n and the embedding dimension d, the FLOPs of the multi-head self-attention (MSA) layer come from: (1) the projection of the input sequence to key, query and, value There are two fully-connected (FC) layers for a feed-forward (MLP) layer.The first FC projects the embedding dimension from d to 4d, and the second one project it back to d.The FLOPs are: So the overall FLOPs for a transformer block are: Embedding layer We assume all models, including baseline models, implement sparse lookup for token embedding, which is different from ELEC-TRA (Clark et al., 2020) which states RoBERTa and BERT obtain the token embedding by multiplying the embedding layer with one-hot vectors.We make this assumption since any model can be easily re-implemented in this efficient way.Sparse lookup is efficient.It can be neglected for calculating FLOPs.
Prediction layer The prediction layer consists of two FC layers.The first FC layer keeps the input and output sequence in the same dimension, and the second one projects to the vocabulary size |V |.Like typical language models, we use shared weight between the embedding layer and the second FC layer.Like RoBERTa, we only input the masked tokens, including unchanged and randomly replaced tokens, to the prediction layer.With a masking rate of r, the FLOPs are: where b is the batch size, u is the number of updates, l is the number of transformer blocks and 2 at the beginning denotes the forward and backward process.Both forward and backward processes consume similar FLOPs.
3ML self has two types of transformer blocks, one for the encoder and one for the decoder.The only difference between them is the hidden dimension.To project the output sequence from the encoder to the same hidden dimension of the decoder, we add an FC layer in between.The FLOPs of the prediction layer are modified as: where n de = 0.8nr and n de = (1 − 0.8r)n, since only 80% of all masked tokens are replaced by [MASK]s.The second last term denotes the prediction layer on the encoder side.Both prediction layers share the weight from the embedding layer.The masking rate from the second last term is achieved by calculating the ratio between the number of unchanged and randomly replaced tokens and n en .
The masking rate of the last term is 1 because there are only [MASK]s on the decoder side.

Figure 1 :
Figure1: Information flows of vanilla MLM.A sentence, {x 1 , x 2 , x 3 , x 4 , x 5 }, is corrupted by replacing x 2 and x 3 with a virtual token m indexed with the corresponding positions.

Figure 4 :
Figure 4: Mutual Information between hidden representations of [MASK] tokens per layer and the original tokens.Layer 0 corresponds to the token embeddings.All models are pre-trained with a masking rate of 40%.
6 ; (2) the attention map from key and query ϕ map = 2n 2 d; (3) the self-attention operation ϕ attn = 2n 2 d; (4) the projection of the selfattention output ϕ out = 2n 2 d.Then the overall FLOPs for an MSA layer are:

ϕ
P red (n, r, d, |V |) = 2nr(d 2 + d|V |)RoBERTa The total FLOPs for RoBERTa can be computed asϕ RoBERT a (b, u, l, n, d, r, |V |) =2bu(l ⋅ ϕ BLK (n, d) + ϕ P red (n, r, d, |V |)) (2) (n, r, d en , d de , |V |) =2nr(d de d en + d en |V |)where d en and d de is the hidden dimension of the encoder and decoder, respectively.Then the overall FLOPs are:ϕ 3M L self (b, u, l en , l de , n en , n, d en , d de , r, |V |) =2bu(2n en d en d de + l en ⋅ ϕ BLK (n en , d en ) + l de ⋅ ϕ BLK (n, d de ) + ϕ ′ P red (n, r, d en , d de , |V |))(3)where n en = (1 − 0.8r)n since only 80% of all masked tokens are replaced by[MASK]s.The first term on the right is for the dimension projection from the encoder to the decoder.l en and l de are the number of Transformer blocks for the encoder and decoder, respectively.Transformer block with self & cross-attentionThe decoder of 3ML cross contains an MSA layer, a multi-head cross-attention (MCA) layer, and an MLP layer.The FLOPs calculation of both MSA and MLP stay the same as above.Similar to the MSA layer, the FLOPs for different components of the MCA layer are ϕ ′ qkv = 4n en d 2 de + 2n de d 2 de ϕ ′ map = 2n en n de d de ϕ ′ attn = 2n en n de d de ϕ ′ out = 2n de d 2 de where n en and n de are the sequence lengths of the encoder and decoder inputs, respectively.So the FLOPs for an MCA layer are: ϕ M CA (n en , n de , d de ) =4n en d 2 de + 4n de d 2 de + 4n en n de d de Then the overall FLOPs for a cross-attention Transformer block are: ϕ CBLK (n en , n de , d de ) =ϕ BLK (n de , d de ) + ϕ M CA (n en , n de , d de ) =4n en d 2 de + 28n de d 2 de + 4n en n de d de + 4n 2 de d de 3ML cross Similar to 3ML self , 3ML cross has an encoder and a decoder with a hidden size of d en and d de , respectively.Since their hidden sizes might be different, there is an FC layer projecting from d en to d de .Then the overall FLOPs for training a 3ML cross is ϕ 3M L cross (b, u, l en , l de , n en , n de , d en , d de , |V |) =2bu(2n en d en d de + l en ⋅ ϕ BLK (n en , d en ) + l de ⋅ ϕ CBLK (n en , n de , d de ) + ϕ P red (n en , 0.2r 1 − 0.8r , d en , |V |)) + ϕ ′ P red (n de , 1, d en , d de , |V |)) Overview of 3ML architectures.A sentence,"<s> Let us agree to disagree.</s>", is corrupted to "<s> should us[MASK]to[MASK].</s>".Left: 3ML with only self-attention layers (3ML self ).Right: 3ML (Liu et al., 2019))is mainly used for ablation studies.It is a computation-friendly recipe that takes 9 hours with 16 Nvidia Tesla V100 GPUs.The longer pre-training recipe from RoBERTa(Liu et al., 2019)takes about 36 hours with 32 Nvidia Tesla V100 GPUs.With our efficient method, we can reduce the training time proportionally to the reduced computation (FLOPs).More hyperparameter details for these two recipes are shown in Table4(see Appendix A).And the calculation details of training FLOPs are in Appendix D.

Table 1 :
MLM perplexity of 24hBERT and 3MLs on the validation set.All models are pre-trained with a masking rate of 40%.
s and unmasked tokens

Table 2 :
3ML ablation experiments on the large model.The default settings are marked in gray, i.e. 2 decoder layers, a hidden dimension of 512 for 3ML's decoder, and a masking rate of 40% with the 80-10-10 strategy.

Table 3 :
Complete results on GLUE dev.set.Speedup is computed based on the pre-training FLOPs."

Table 6 :
Notation for the calculation of training FLOPs.