Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

Masked language modeling (MLM), a self-supervised pretraining objective, is widely used in natural language processing for learning text representations. MLM trains a model to predict a random sample of input tokens that have been replaced by a [MASK] placeholder in a multi-class setting over the entire vocabulary. When pretraining, it is common to use alongside MLM other auxiliary objectives on the token or sequence level to improve downstream performance (e.g. next sentence prediction). However, no previous work so far has attempted in examining whether other simpler linguistically intuitive or not objectives can be used standalone as main pretraining objectives. In this paper, we explore five simple pretraining objectives based on token-level classification tasks as replacements of MLM. Empirical results on GLUE and SQUAD show that our proposed methods achieve comparable or better performance to MLM using a BERT-BASE architecture. We further validate our methods using smaller models, showing that pretraining a model with 41% of the BERT-BASE’s parameters, BERT-MEDIUM results in only a 1% drop in GLUE scores with our best objective.


Introduction
Masked Language Modeling (MLM) pretraining (Devlin et al., 2019;Liu et al., 2019;Lan et al., 2020;Wang et al., 2020) is widely used in natural language processing (NLP) for self-supervised learning of text representations. MLM trains a model (typically a neural network) to predict a particular token that has been replaced with a [MASK] placeholder given its surrounding context. Devlin et al. (2019) first proposed MLM with an additional next sentence prediction (NSP) task (i.e. predicting whether two segments appear consecutively in the original text) to train BERT.
Recently several studies have extended MLM, by masking a contiguous segment of the input instead of treating each token independently (Song et al., 2019;Sun et al., 2020;Joshi et al., 2020). Yang et al. (2019) reformulated MLM in XLNET, to mask out attention weights rather than input tokens, such that the input sequence is auto-regressively generated in a random order. ELECTRIC (Clark et al., 2020a) addressed the expensive softmax issue of MLM using a binary classification task, where the task is to distinguish between words sampled from the original data distribution and a noise distribution, using noise-contrastive estimation. In a different direction, previous work has also developed methods to complement MLM for improving text representation learning. Aroca-Ouellette and Rudzicz (2020) have explored sentence and tokenlevel auxiliary pretraining objectives, showing improvements over NSP. ALBERT (Lan et al., 2020) complemented MLM with a similar task that predicts whether two sentences are in correct order or swapped. ELECTRA (Clark et al., 2020b) introduced a two-stage token-level prediction task; using a MLM generator to replace input tokens and subsequently a discriminator trying to predict whether a token has been replaced or not.
Despite these advances, simpler linguistically motivated or not auxiliary objective tasks acting as primary pre-training objectives substituting completely MLM have not been explored. Motivated by this, we propose five frustratingly simple pretraining tasks, showing that they result into models that perform competitively to MLM when pretrained for the same duration (e.g. five days) and fine-tuned in downstream tasks in GLUE (Wang et al., 2019) and SQUAD (Rajpurkar et al., 2016) benchmarks. Contributions: (1) To the best of our knowledge, this study is the first to investigate whether linguistically and non-linguistically intuitive tasks can effectively be used for pretraining ( §2). (2) We empirically demonstrate that our proposed objec- tives are often computationally cheaper and result in better or comparable performance to MLM across different sized models ( §4).

Pretraining Tasks
Our methodology is based on two main hypotheses: (1) effective pretraining should be possible with standalone token-level prediction methods that are linguistically intuitive (e.g. predicting whether a token has been shuffled or not should help a model to learn semantic and syntactic relations between words in a sequence); and (2) the deep architecture of transformer models should allow them to learn associations between input tokens even if the pretraining objective is not linguistically intuitive (e.g. predicting the first character of a masked token should not matter for the model to learn that 'cat' and 'sat' usually appear in the same context). Figure 1 illustrates our five linguistically and nonlinguistically intuitive pretraining tasks with a comparison to MLM.
Shuffled Word Detection (SHUFFLE): Motivated by the success of ELECTRA, our first pretraining objective is a token-level binary classification task, consisting of identifying whether a token in the input sequence has been shuffled or not. For each sample, we randomly shuffle 15% of the tokens. This task is trained with the token-level binary cross-entropy loss averaged over all input tokens (i.e. shuffled and original). The major dif-ference between ours and ELECTRA is that we do not rely on MLM to replace tokens. Our intuition is that a model can acquire both syntactic and semantic knowledge by distinguishing shuffled tokens in context.

Random Word Detection (RANDOM):
We now consider replacing tokens with out-of-sequence tokens. For this purpose we propose RANDOM, a pretraining objective which replaces 15% of tokens with random ones from the vocabulary. Similar to shuffling tokens in the input, we expect that replacing a token in the input with a random word from the vocabulary "forces" the model to acquire both syntactic and semantic knowledge from the context to base its decision on whether it has been replaced or not.
Manipulated Word Detection (SHUFFLE + RANDOM): For our third pretraining objective, we seek to increase the task difficulty and subsequently aim to improve the text representations learned by the model. We therefore propose an extension of SHUFFLE and RANDOM, which is a three-way token-level classification task for predicting whether a token is a shuffled token, a random token, or an original token. For each sample, we replace 10% of tokens with shuffled ones from the same sequence and another 10% of tokens with random ones from the vocabulary. This task can be considered as a more complex one, because the model must recognize the difference between tokens replaced in the same context and tokens replaced outside of the context. For this task we use the cross-entropy loss averaged over all input tokens.
Masked Token Type Classification (TOKEN TYPE): Our fourth objective is a four-way classification, aiming to predict whether a token is a stop word, 2 a digit, a punctuation mark, or a content word. Therefore, the task can be seen as a simplified version of POS tagging. We regard any tokens that are not included in the first three categories as content words. We mask 15% of tokens in each sample with a special [MASK] token and compute the cross-entropy loss over the masked ones only not to make the task trivial. For example, if we compute the token-level loss over unmasked tokens, a model can easily recognize the four categories as we only have a small number of non-content words in the vocabulary. Implementation Details: We pretrain and finetune our models with two NVIDIA Tesla V100 (SXM2 -32GB) with a batch size of 32 for BASE and 64 for MEDIUM and SMALL. We pretrain all our models for up to five days each due to limited access to computational resources and funds for running experiments. We save a checkpoint of each model every 24 hours. 5 Evaluation: We evaluate our approaches on GLUE (Wang et al., 2019) and SQUAD (Rajpurkar et al., 2016) benchmarks. To measure performance in downstream tasks, we fine-tune all models for five times each with a different random seed.
Baseline: For comparison, we also pretrain models with MLM. Following BERT and ROBERTA, we mask 15% of tokens in each training instance, where 80% of the tokens are replaced with [MASK], 10% of the tokens are replaced with a random word and the rest of tokens remain unchanged. We compute the cross-entropy loss averaged over the masked tokens only.

Results
Performance Comparison: Table 1 presents results on GLUE and SQUAD, for our five pretraining tasks compared to MLM across all model configurations ( §3). We also include for reference our replicated downstream performance by finetuning BERT-BASE (MLM + NSP) pretrained 6 for 40 epochs (Upper Bound). We first observe that our best objective, Shuffle + Random, outperforms MLM on GLUE Avg. and SQUAD in the majority of model settings (BASE, MEDIUM and SMALL) with five days pretraining. For example in GLUE, we obtain an average of 79.2 using Shuffle + Random with BERT-BASE compared to 77.6 using MLM. This suggests that Shuffle + Random can be a competitive alternative to MLM Figure  2 (a)). If we take a closer look, we can also see that Shuffle + Random obtains higher performance to MLM across all model configurations when training for a similar number of epochs, suggesting that our approach is a more data efficient task. Finally, we can also assume that Shuffle + Random is more challenging than MLM as in all settings it results in lower GLUE scores after the first day of pretraining (Figure 2 (b)). However, with more iterations it is clear that it results in learning better text representations and quickly outperforms MLM. For example, it achieves a performance of 78.2 compared to 76.1 for MLM with MEDIUM on the fifth day. Regarding the remainder of our proposed objectives, we can see that they perform comparably and sometimes better than the MLM under SMALL and MEDIUM model settings. However, MLM on average outperforms them in the BASE setting where the models are more highly parameterized. Lastly, we observe that for the majority of GLUE tasks, we obtain better or comparable performance to MLM with a maximum of approximately three epochs of training with a BASE model. This demonstrates that excessively long and computationally inefficient pretraining strategies do not add a lot in downstream performance.

Discussion
Based on our results, there are mainly two key elements that should be considered for designing pretraining objectives.
Task Difficulty: A pretraining task should be moderately difficult to learn in order to induce rich text representations. For example, we can assume from the results that Token Type was somewhat easy for a model to learn as it is a four-way classification of identifying token properties. Besides, in our preliminary experiments, predicting whether a masked token is a stop word or not (Masked Stop Word Detection) also did not exhibit competitive downstream performance to MLM as the task is a lot simpler than Token Type. Robustness: A model should always learn useful representations from "every" training sample to solve a pretraining task, regardless of the task difficulty. For instance, Figures 3 to 5 in Appendix D demonstrate that Shuffle needs some time to start converging across all model configurations, which means the model struggled to acquire useful representations at first. In contrast, the loss for Shuffle + Random consistently decreases. Because Shuffle + Random is a multi-class classification, unlike Shuffle or Random, we assume that it can convey richer signals to the model and help stabilize pretraining. Finally, we can also assume that MLM satisfies both elements as it is a multi-class setting over the entire vocabulary and its loss consistently decreases.

Conclusions
We have proposed five simple self-supervised pretraining objectives and tested their effectiveness against MLM under various model settings. We show that our best performing, manipulated word detection task, results in comparable performance to MLM in GLUE and SQUAD, whilst also being significantly faster in smaller model settings. We also show that our tasks result in higher performance trained for the same number of epochs as MLM, suggesting higher data efficiency. For future work, we are interested in exploring which has the most impact in pretraining: the data or the pretraining objective?

A Task Details
Here, we detail our frustratingly simple pretraining objectives, which are based on token-level classification tasks and can be used on any unlabeled corpora without laborious preprocessing to obtain labels for self-supervision.
Shuffled Word Detection (SHUFFLE): Our first pretraining task is a token-level binary classification task, which consists of identifying whether a token in the input sequence has been shuffled or not. For each sample, we randomly shuffle 15% of the tokens. This task is trained with the token-level binary cross-entropy loss averaged over all input tokens: where N is the number of tokens in a sample, and p(x i ) represents the probability of the i-th input token x i predicted as shuffled by a model. y i is the corresponding target label. This task is motivated by the success of ELEC-TRA, whose pretraining task is to let a discriminator to predict whether a given token is original or replaced (replaced word detection) in addition to MLM. The major difference between ours and ELECTRA is that we do not rely on MLM, whereas ELECTRA utilizes it as its generator. Here, our intuition is that a model should acquire both syntactic and semantic knowledge to detect shuffled tokens in contexts.

Random Word Detection (RANDOM):
We also consider replacing tokens with out-of-sequence tokens. For this purpose we propose RANDOM, a pretraining objective which replaces 15% of tokens with random ones from the vocabulary. Similar to shuffling tokens in the input, we expect that replacing a token in the input with a random word from the vocabulary "forces" the model to acquire both syntactic and semantic knowledge from the context to base its decision on whether it has been replaced or not. This task is trained with the token-level binary cross-entropy loss averaged over all input tokens (Eq. (1)).

Manipulated Word Detection (SHUFFLE + RANDOM):
Our third task is a three-way tokenlevel classification of whether a token is a shuffled token, a random token, or an original token. For each sample, we replace 10% of tokens with shuffled ones and another 10% of tokens with random ones. This task is an extension of SHUFFLE and RANDOM and can be regarded as a more complex one because the model must recognize the difference between a token replaced in the same context and a token replaced outside of the context. For this task we employ the cross-entropy loss averaged over all input tokens: where p ij (x i ) represents the probability of the ith input token x i predicted as shuffled (j = 1), randomized (j = 2), or original (j = 3) by a model. y ij is the corresponding target label.
Masked Token Type Classification (TOKEN TYPE): Our fourth task is a four-way classification task that identifies whether a token is a stop word 7 , a digit, a punctuation mark, or a content word. We regard any tokens that are not included in the first three categories as content words. We mask 15% of tokens in each sample with a special [MASK] token and compute the cross-entropy loss over the masked ones only not to make the task trivial: if we compute the token-level loss, including unmasked tokens, a model can easily recognize the four categories of tokens as we have a small number of tokens for non-content words. In this task, a model should be able to identify the distinction between different types of tokens; therefore, the task can be seen as a simplified version of POS tagging.

Masked First Character Prediction (FIRST CHAR):
Our last task is a 29-way classification task, where a model needs to predict the first character of a masked token. The 29 categories include the English alphabet (0 to 25), a digit (26), a punctuation mark (27), or any other character (28). We mask 15% of tokens in each sample and compute the cross-entropy loss over the masked ones only. This task can be seen as a simplified version of MLM as the model just need to predict the first 7 A stop word category is based on the Natural Language Toolkit's stop word list: https://www.nltk.org/. character of each masked token. Besides, it is also similar to masked character-level language modeling, in that the output of both tasks is in characters.

B Non-linguistically Intuitive Task
As we have described in Section 2, a nonlinguistically intuitive task should not be "explicitly" related to an input sequence to solve, unlike linguistically intuitive tasks, such as Shuffle and Random. For example, predicting the first character of a masked token should not matter for a model to learn that 'cat' and 'sat' usually appear in the same context. However, because accurately predicting the first character requires the model to guess its whole word "implicitly" given its surrounding tokens, the first character of each masked token should be related to the context. The deep architecture of transformer-based models should allow them to learn such "implicit" associations between input tokens by solving the non-linguistically intuitive task, which leads to helping them to learn syntactic and semantic relations between tokens.

C.2 Data
Following Devlin et al. (2019), we use the English Wikipedia and BookCorpus (Zhu et al., 2015) data (WikiBooks) downloaded from the datasets library 8 . We remove headers for the English Wikipedia and extract training samples with a maximum length of 512. For the BookCorpus, we concatenate sentences such that the total number of tokens is less than 512. For the English Wikipedia, we extract one sample from articles whose length 8 https://github.com/huggingface/ datasets is less than 512. We tokenize text using byte-level Byte-Pair-Encoding (Sennrich et al., 2016). The resulting corpus consists of 8.1 million samples and 2.7 billion tokens in total.
We pretrain our models with two NVIDIA Tesla V100 (SXM2 -32GB) and use one for finetuning.
Pretraining: We set the batch size to 32 for the BASE models and 64 for the MEDIUM and SMALL models. We pretrain models for five days and optimized them with an Adam optimizer (Kingma and Ba, 2014). We apply automatic mixed precision and distributed training during pretraining. Note that we generate labels dynamically during pretraining.
Finetuning: We fine-tune models for up to 10 and 20 epochs with early stopping for SQUAD and GLUE, respectively. To minimize the effect of random seeds, we test five different random seeds for each task. We omitted the problematic WNLI task for GLUE, following Aroca-Ouellette and Rudzicz (2020).

C.4 Hyperparameter Details
As explained in Section 3, we entirely followed the BERT architecture and only modified its output layer depending on the task employed. Table 2 shows the hyperparameter settings for pretraining and fine-tuning. Note that we utilized neither any parameter sharing tricks nor any techniques that did not appear in Devlin et al. (2019).