Training ELECTRA Augmented with Multi-word Selection

Pre-trained text encoders such as BERT and its variants have recently achieved state-of-the-art performances on many NLP tasks. While being effective, these pre-training methods typically demand massive computation resources. To accelerate pre-training, ELECTRA trains a discriminator that predicts whether each input token is replaced by a generator. However, this new task, as a binary classification, is less semantically informative. In this study, we present a new text encoder pre-training method that improves ELECTRA based on multi-task learning. Specifically, we train the discriminator to simultaneously detect replaced tokens and select original tokens from candidate sets. We further develop two techniques to effectively combine all pre-training tasks: (1) using attention-based networks for task-specific heads, and (2) sharing bottom layers of the generator and the discriminator. Extensive experiments on GLUE and SQuAD datasets demonstrate both the effectiveness and the efficiency of our proposed method.


Introduction
Contextualized representations from pre-trained text encoders have shown great power for improving many NLP tasks (Rajpurkar et al., 2016;Wang et al., 2019b,a;Liu and Lapata, 2019). Most pre-trained encoders, despite their variety, follow BERT (Devlin et al., 2019) and adopt the masked language modeling (MLM) pre-training task which trains the model to recover the identities of a small subset of masked tokens. Although being more effective than conventional left-to-right language model pre-training (Peters et al., 2018;Radford et al., 2018) due to capturing bidirectional information, MLM-based approaches (Liu et al., 2019b;Joshi et al., 2019) can only learn from those masked tokens which are typically just 15% of all tokens in the input sentences.
To address the low sample efficiency issue, ELECTRA (Clark et al., 2020a) proposes a new pre-training task. Specifically, it corrupts a sentence by replacing some tokens with plausible alternatives sampled from a generator and trains a discriminator to predict whether each token in the corrupted sentence is replaced or not. After pretraining ends, it throws away the generator and exports the discriminator for down-stream applications. As the discriminator can learn from all input tokens, ELECTRA is more sample efficient than previous MLM-based methods. However, followup studies (Xu et al., 2020;Aroca-Ouellette and Rudzicz, 2020) find this new replaced token detection task, as a binary classification, is often too simple to learn. As a result, the discriminator output representations are insufficiently trained and encode inadequate semantic information.
In this work, we propose a novel text encoder pretraining method TEAMS which stands for "Training ELECTRA Augmented with Multi-word Selection". Compared with ELECTRA, our method also consists of a generator and a discriminator but they are equipped with different pre-training tasks. For each masked position in the input sentence, the generator replaces the original token with an alternative token and samples a candidate set that consists of the original token and other K non-original ones. Then, we train the discriminator to simultaneously perform two tasks: (1) a multi-word selection task in which the discriminator learns to select the original token from the sampled candidate set, and (2) a replaced token detection task similarly defined in ELECTRA. The first task, as a (K + 1)−way classification on the masked positions, pushes the discriminator to differentiate ground truth tokens from other negative non-original ones. At the same time, the second task, with reduced task complex-ity, keeps the discriminator to achieve the same sample efficiency as ELECTRA.
To further improve the performance and efficiency of our method, we introduce two techniques. The first one is using attention-based task-specific heads for discriminator multi-task pre-training. Different from previous studies (Liu et al., 2019a;Sun et al., 2020) that pass the last encoder layer outputs to different task heads, our method directly incorporates task-specific attention layers into the discriminator encoder. Such a design offers higher flexibility in capturing task-specific token dependencies in sequence and leads to significant performance boost. The second technique is to share the bottom layers of the generator and the discriminator. This technique reduces the number of parameters, saves computes, and serves as a form of regularization that stabilizes the training and helps the generalization.
Combining above novelties all together, we train our models of various sizes and test their performance on the GLUE natural language understanding benchmark (Wang et al., 2019b) and SQuAD question answering benchmark (Rajpurkar et al., 2016). We show that TEAMS substantially outperforms previous MLM-based methods and ELEC-TRA, given the same model size and pre-training data. For example, our base-sized model, achieving 84.51 SQuAD 2.0 F1 score, outperforms BERT and ELECTRA by 8.34 and 2.99, respectively. Moreover, TEAMS-Base can outperform ELECTRA-Base++ using a fraction of computes.
Contributions. The major contributions of this paper are summarized as follows: (1) We propose a new text encoder pre-training method TEAMS that simultaneously learns a generator and a discriminator using multi-task learning. (2) We develop two techniques, attention-based task-specific head and partial layer sharing, to further improve TEAMS performance. (3) We conduct extensive experiments to verify the effectiveness of TEAMS on GLUE and SQuAD benchmarks 1 .

Background
In this section, we first discuss some related studies on pre-training text encoders. Then, we introduce our notations and describe ELECTRA in details. 1 Code and pre-trained model weights are available at https://github.com/tensorflow/models/ tree/master/official/nlp/projects/teams.

Text Encoder Pre-training
Current state-of-the-art natural language processing systems often rely on a text encoder to generate contextualized representations. This text encoder is commonly pre-trained on massive unlabeled corpora using different self-supervised tasks. Peters et al. (2018) and Radford et al. (2018) pre-train either a LSTM or a Transformer (Vaswani et al., 2017) using the standard language modeling task. To further improve pre-trained models, more effective pre-training objectives have been developed, including masked language modeling and next sentence prediction in BERT (Devlin et al., 2019), permutation language modeling in XLNet , masked span prediction in SpanBERT (Joshi et al., 2019), sentence order prediction in Struct-BERT , and more.
Most pre-training methods demand massive amounts of computes, which limits their accessibilities and raises concerns about their environmental costs. To alleviate such issue, Gong et al. (2019) and Yang et al. (2020) propose to accelerate BERT training by progressively stacking a shallow model to a deep model. Gu et al. (2020) extend this idea by growing a low-cost model in different dimensions. Along another line of work, Clark et al. (2020a) propose a new pre-training task, named replaced token detection, that learns a text encoder to distinguish real input tokens from synthetically generated replacements. Compared to BERT-style MLM pre-training in which only 15% of tokens are utilized, ELECTRA can leverage all tokens in input sentences and thus achieves better sample efficiency. Following this idea, Xu et al. (2020) propose a new pre-training task based on the multichoice cloze test with a rejection option, and Clark et al. (2020b) connect ELECTRA with cloze modeling and pre-train the text encoder as an energybased cloze model. As our method is built upon ELECTRA, we discuss it in more detail below.

ELECTRA
ELECTRA jointly trains two models, a generator G and a discriminator D. Both models adopt the Transformer architecture as their backbones and map a sentence of n tokens x = [x 1 , . . . , x n ] to their corresponding contextualized representations The generator G is trained using the masked language modeling (MLM) task. Specifically, given an input sequence x, it first randomly selects a few masked positions and replaces tokens at these positions with a special mask symbol [MASK]. We denote this masked sequence as x M . Then, the generator learns to predict the original identities of those masked-out tokens by minimizing the below MLM loss: where P G (x i |x M ) is the probability that G predicts token x i appears in the masked position i in x M , and the expectation is taken over the random draw of masked positions. More specifically, the generator calculates P G (x i |x M ) by passing contextualized representations h G (x M ) into a softmax layer as follows: where e(x i ) is the embedding of token x i and V denotes the vocabulary of all tokens. Finally, for each masked position i, the generator samples one tokenx i ∼ P G (·|x M ) and replaces the original token x i withx i . We use x R to denote this corrupted sentence with replaced tokens. The discriminator D learns to perform the replaced token detection (RTD) task that requires a model to predict whether each token in x R is replaced or not. In particular, ELECTRA adopts a sigmoid layer, on top of the discriminator output contextualized representations h D (x R ), to decide the probability that token x R i matches the original token x i as follows: where w is a learnable parameter. The loss on D is then defined as follows: Finally, the generator and discriminator are jointly learned based on losses in Eq. (1) and Eq. (4). After pre-training, ELECTRA throws out the generator and keeps only the discriminator for fine-tuning on downstream tasks.

The TEAMS Method
In this section, we first introduce a new pre-training task named "multi-word selection". Then, we present our TEAMS method with two techniques for performance improvements.

Multi-word Selection Task
To train a model on an input sequence x = [x 1 , . . . , x n ] using the multi-world selection task, we first choose a random set of positions in this sequence, denoted as {i 1 , . . . , i m } where m is an integer between 1 and n. Then, for each chosen position i j , j ∈ {1, . . . , m}, we replace token x i j with another tokenx i j and create a candidate set S i j that includes the original token x i j and K nonoriginal ones. Following ELECTRA, we use x R to denote the corrupted sentence with all tokens in chosen positions replaced. Finally, the model inputs the corrupted sentence and outputs a probability for selecting the original token x i j from the candidate set S i j as follows: where h(x R ) i j is the contextualized representation of tokenx i j from the model outputs. Figure 1 shows a concrete example wherein a sequence of 6 tokens is given and its 2 nd , 4 th , and 6 th positions are chosen to be masked. Take the 2 nd position as an example, the generator replaces the original token x i 1 ="famous" with another tokenx i 1 ="old" and generates the candidate set S i 1 ={"top", "young", "french", "famous"} which includes the original token x i 1 and K = 3 non-original alternatives.
We may view the multi-word selection task as a simplification of masked language modeling and a generalization of replaced token detection. Asking the model to select the correct word from a candidate set rather than from the entire vocabulary, we can save more computes. At the same time, being essentially a (K + 1)−way classification problem, the multi-word selection task is more challenging than the replaced token detection task (which is a binary classification problem) and thus pushes the model to learn more semantic representations. We describe how to generate the candidate set and present our entire method below.

Multi-task Learning in TEAMS
In TEAMS, we jointly train two transformer encoders, one as the generator network G and the other as the discriminator network D. Given a masked sequence x M , we use the generator G to perform two tasks for each masked position i j in this sequence. First, similar to ELECTRA, we sample one tokenx i j ∼ P G (·|x M ) (c.f. Eq. (2)) and  Figure 1: The overview framework of TEAMS. For each masked position, the generator replaces its original token with a new one and outputs a candidate set consisting of the original token and another K possible alternatives. The discriminator inputs the corrupted sentence and learns to (1) predict for every token whether it is replaced or not and (2) select the original token from the candidate set for each masked position.
Finally, we learn the generator G using the standard masked language modeling task (c.f. Eq. (1)).
On the discriminator side, we train the discriminator network D using two tasks -replaced token detection (RTD) task and multi-word selection (MWS) task. Given a corrupted sentence x R of length n, the discriminator will generate two sets of contextualized representations {h RTD , one for each pre-training task. For each position i ∈ {1, . . . , n}, we use h RTD D (x R ) i to calculate the probability that the token x R i is replaced as follows: and optimize the same RTD loss defined in Eq. (4). For each masked position i j , j ∈ {1, . . . , m}, we obtain the candidate set S i j from generator outputs and use h MWS D (x R ) i j to compute the probability of selecting the correct original token x i j from this candidate set as follows: As the multi-word selection task is a multi-class classification problem, we define its loss function 2 More discussions on other possible negative sampling strategies are presented in experiment section. as follows: is the collection of candidate sets at all masked positions. Finally, we learn TEAMS by optimizing a combined loss as follows: where λ 1 and λ 2 are two loss balancing hyperparameters. For the example sequence in Figure 1, the discriminator needs to predict the tokens in 1 st , 3 rd , 5 th positions are not replaced, the tokens in 2 nd , 4 th , 6 th positions are replaced, and select tokens "famous", "sold", and "painting" in 2 nd , 4 th , 6 th positions, respectively.
After pre-training, we keep the discriminator network and fine-tune it for downstream applications 3 . Attention-based Task-specific Heads. One remaining question is how to generate two sets of task-specific representations on the discriminator side. Previous studies (Liu et al., 2019a;Sun et al., 2020; Aroca-Ouellette and Rudzicz, 2020) achieve this goal by adding task-specific layers on top of each individual token, as shown in Figure 2 (Left). However, this approach does not model token dependencies within the task-specific layers.
In this work, we propose to use attention-based task-specific heads to capture global dependen-  cies in sequences. Particularly, we design this attention head to be one transformer layer (i.e., a self-attention block followed by a fully connected network with one hidden layer). Since our discriminator also uses a transformer model to obtain each token's task-agnostic representation, we can merge one task head into the discriminator backbone. From this perspective, we can generate different sets of task-specific representations as follows. First, we input the sequence to a transformer with L layers and retrieve the final layer output representations for one task. Then, we feed the output of an intermediate layer (e.g., the (L − 1) th layer) into another transformer layer to obtain token representations for the second task.
Partial Layer Sharing. ELECTRA has shown that tying the embedding layers of the generator and the discriminator can help improve the pretraining effectiveness. Our study confirms this observation and finds that sharing some transformer layers of the generator and discriminator and can further boost the model performance. More specifically, we design the generator to have the same "width" (i.e., hidden size, intermediate size and number of heads) as the discriminator and share the bottom half of all transformer layers between the generator and the discriminator.

Experiment Setups
Pre-training Datasets. We use two datasets for model pre-training: (1) WikiBooks, which consists 3.3 Billion tokens from English Wikipedia and BooksCorpus (Zhu et al., 2015). This is the same dataset used in BERT (Devlin et al., 2019).
Evaluation Datasets and Metrics. We evaluate all pre-trained models on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019b) and Stanford Question Answering (SQuAD) dataset (Rajpurkar et al., 2016). GLUE benchmark includes various tasks formatted as either single sentence classification (SST, CoLA) or sentence pair classification (e.g., RTE, MNLI, QNLI, MRPC, QQP, STS). More details of each task are available in the Appendix Section A. SQuAD dataset requires models to select a text span from a given passage that answers a question. In SQuAD v1.1, the answers can always be located in the passage, while SQuAD v2.0 contains some questions unanswerable by the given passage. We compute Spearman correlation for STS, Matthews correlation for CoLA, accuracy for all other GLUE tasks, and report the GLUE score as the average of all 8 tasks. For SQuAD, we use the standard evaluation metrics of Exact Match (EM) and F1 scores. Since different random seeds may significantly affect fine-tuned model performances, we report the median of 15 fine-tuning runs from the same pre-trained model checkpoint for each result. Unless stated otherwise, results are on the GLUE and SQuAD development sets.  (Clark et al., 2020a), we design the generator network size to be 1/2 of the discriminator network size for models of all sizes. For TEAMS, we set the loss balancing parameter λ 1 = 5, λ 2 = 2 (c.f., Eq. (9)), and the number of sampled non-original tokens K = 5 (c.f., Section 3.1).  During pre-training, we set the batch size to be 256 and the input sequence length to be 512 for both small-sized and base-sized models. We update small-sized models for 500K steps and base-sized models for 1M steps on the WikiBooks dataset. Moreover, we test the performance of each model when it is pre-trained for longer time with larger batch size using the WikiBooks++ dataset. We use the suffix "small++" to denote a small-sized model pre-trained for 2M steps with batch size 256, and the suffix "base++" to denote a base-sized model pre-trained for 1M steps with batch size 1024. Finally, for large-sized models, we use batch size 2048 and pre-train the model for 1.76M steps. All large-sized models and models with suffix "++" are trained using the WikiBooks++ dataset. More pre-training and fine-tuning details are included in the Appendix Section B and C.
Model Implementations. For fair comparison, we implement all compared methods in TensorFlow 2 and evaluate their performances using the official pipeline in TensorFlow Model Garden 4 . In addition to our own implementations, we also report the performance of ELECTRA publicly released checkpoints 5 . All models are trained on TPU v3.

Experiment Results
We validate the advantages of our proposed TEAMS method by comparing it with BERT (Devlin et al., 2019) and ELECTRA (Clark et al., 2020a). Table 1 shows the comparison results on GLUE and SQuAD datasets. We find that TEAMS can consistently outperform baseline models of the same size. For example, compared to ELECTRA-Base, our TEAMS-Base improves SQuAD 2.0 performance from 78.59 to 81.59 and from 81.52 to 84.51 in terms of EM and F1 score, respectively.
To further verify the performance improvements do not come from consuming more computations, we draw the learning curves of TEAMS-Small/Base and ELECTRA-Small/Base in Figure 3. We observe that for both small-sized and base-sized models, our method can consistently outperform ELECTRA when trained for the some amount of time. Moreover, on SQuAD datasets, TEAMS-Base can even outperform the ELECTRA-Base++ model that requires much more computation.

Ablation Studies
We continue to evaluate the design of each component within TEAMS and test its sensitivity to some critical hyper-parameters.
Effectiveness of Pre-training Tasks. We report the results of small-sized models learned using dif-   Table 2: Effectiveness of multi-task pre-training for small++ models. "MLM", "RTD", and "MWS" stand for "masked language modeling", "replaced token detection", and "multi-word selection", respectively. ferent pre-training tasks in Table 2. First, we can see that the model trained with multi-word selection (MWS) task can outperform the one learned using masked language modeling (MLM) task. Second, on SQuAD datasets, we find that pre-training on only 15% of masked tokens using MWS task is comparable with pre-training on all tokens using replaced token detection (RTD) task. These observations demonstrate the effectiveness of our proposed MWS task. Finally, we show that a text encoder pre-trained using both MWS and RTD tasks can outperform those learned using only single task.
Task-specific Layer Designs. In TEAMS, we pre-train the discriminator network using multitask learning and introduce the attention-based task-specific heads. To verify the effectiveness of these attention-based task-specific heads, we train another model that uses the traditional feed forward network (FFN) as the task-specific head. Table 3 shows the results. We can see that our model achieves better performances because the attention-based heads can effectively model the token dependencies in sequences. We continue to study where to add these taskspecific heads. Currently, given a transformer with 12 layers, we treat its last layer output for one task  Table 3: Analysis of task-specific layers and exported representations for small++ models. Please refer to Section 4.3 for detailed descriptions of each method. and feed the 11 th layer output to a separate transformer layer to obtain representations for the second task 6 . An alternative design is to add two separate transformer layers (as two task-specific heads) directly on top of the last layer (i.e., the 12 th layer). As shown in Table 3, we find the latter design can slightly improve the model performance on SQuAD datasets but leads to a larger discriminator network with effectively 13 transformer layers and thus requires more computation during both pre-training and fine-tuning stages. Finally, as our discriminator network will output two sets of contextualized representations, one for MWS task and the other for RTD task, we need to decide which set of representations to use in the fine-tuning stage. Empirically, we find the representations for MWS task has better fine-tuning performance than the ones for RTD task, especially on the SQuAD datasets (c.f. Table 3). This observation also confirms the effectiveness of our proposed MWS task as it produces representations capturing more fine-grained semantic information compared to the RTD task.
Partial Layer Sharing. Table 4 reports the results of our models with different levels of parameter sharing between the generator and the discriminator. First, we can see that tying all generator layers with discriminator layers results in significant performance drops, as such a binding restricts the model representation power. Second, we find that compared to no weight sharing, our design of partial layer tying can improve the model performance. One possible explanation is that such layer tying serves as an implicit form of regularization and forces the shared transformer layers to capture useful information for both generator pre-training task (i.e., MLM) and discriminator pre-training tasks (i.e., RTD and MWS).  Table 4: Effect of sharing generator and discriminator bottom layers for small++ and base++ models. "Full Tie" and "No Tie" stand for tying all or none of generator layers with the discriminator, respectively.

Sampling Strategy and Negative Sample Size.
To use the multi-world selection task for pretraining, we need to first obtain a set of negative samples (i.e., non-original tokens) for each masked position in a sequence. In this study, we test two strategies to generate K negative samples for each masked position. Given the generator output probability distribution for a target position, we can either sample from this distribution K times without replacement or directly select K non-original tokens with the highest probabilities. We denote these two approaches as "Sampled" and "Hardest", respectively, and report the results in Figure 5. First, we can see that performing repeated sampling is a better strategy than always selecting those hardest samples. One possible reason is that the "Sampled" strategy can generate more diverse negative samples and thus helps model to generalize 7 . Second, we notice that increasing K over 5 will somewhat hurt the model performance. One reason is that a larger K causes a higher probability of including false negative examples. Finally, we find that for a wide range of K from 3 to 50, our method can outperform ELECTRA, which further demonstrates the effectiveness of multi-word selection task.
Generator Size. We test how the size of generator affects the model performance by varying the number of transformer layers in the generator. For all tested models, we tie the bottom half of generator with the discriminator. Figure 4 reports the results. We find that the performance first increases as the generator size increases until it reaches about half of the discriminator size and then starts to decrease when we further increase the generator size. The same results also hold for base-sized models. 7 A similar result is also witnessed in (Shen et al., 2019) and thus we adopt the "Sampled" approach in this study.

Related Work
Besides the general language pre-training work we discussed in Section 2.1, this study is particularly related to methods that apply multi-task learning (Caruana, 1997;Ruder, 2017;Shen et al., 2018) to language representation learning. An early study (Liu et al., 2019a) proposes to simultaneously fine-tune a pre-trained BERT model to perform multiple natural language understanding tasks and achieves promising results on the GLUE dataset. Sun et al. (2020) continue this line of work and propose to push the multi-task learning to the model pre-training stage. Specifically, they use a continual multi-task learning framework that incrementally builds and inserts seven auxiliary tasks (e.g., masked entity prediction, sentence distance prediction, etc..) to the text encoder. More recently, Aroca-Ouellette and Rudzicz (2020) extend this idea to incorporate fourteen auxiliary tasks and identify six tasks are particularly useful. While achieving inspiring performance, these studies all assume the MLM pre-training task must present and just combine MLM with additional tasks. In this paper, we relax this assumption and combine our new multi-word selection task with the replace token detection task for effective pre-training.
This work presents a new text encoder pre-training method that simultaneously learns a generator and a discriminator using multi-task learning. We propose a new pre-training task, multi-word selection, and combine it with previous pre-training tasks for efficient encoder pre-training. We also develop two techniques, attention-based task-specific heads and partial layer sharing, to further improve pre-training effectiveness. Extensive experiments on GLUE and SQuAD datasets demonstrate our TEAMS method can consistently outperform previous state-of-the-arts methods. In the future, we plan to explore how other auxiliary pretraining tasks can be integrated into our framework. Moreover, we consider extending our pretraining method to text encoders with other architectures such as those based on dynamic convolution and sparse attention. Finally, being orthogonal to this study, distillation techniques could be applied to further compress our pre-trained encoders into smaller models for faster inference speeds.

Broader Impact Statement
Recent years have witnessed the great success of pre-trained text encoders in lots of NLP applications such as text classification, question answering, text retrieval, dialogue system, etc. This paper presents a new pre-training method TEAMS that learns a text encoder with better performance using lower training cost. Therefore, on the positive side, our work has the potentials to benefit all downstream applications that leverage a pretrained text encoder, especially those applications with limited computation resources. On the negative side, TEAMS, as one specific pre-training method, could still face the generic issues for all language pre-training work. For example, the pretraining large corpora, collected from the internet, may include abusive language usages and fail to capture the cultures that have smaller linguistic footprints online.

A GLUE Details
The original GLUE benchmark (Wang et al., 2019b) contains 9 natural language understanding datasets. We describe them below: • MNLI: The Multi-genre Natural Language Inference Corpus (Williams et al., 2018) contains 393K training sentence pairs with textual entailment annotations. Given a premise sentence and a hypothesis sentence, a model needs to predict whether the premise entails the hypothesis, contradicts the hypothesis, or neither.
• QNLI: Question Natural Language Inference dataset is a binary sentence pair classification dataset constructed from SQuAD (Rajpurkar et al., 2016). It contains 108K training sentence pairs and requires a model to predict whether a context sentence contains the answer to a question sentence.
• CoLA: Corpus of Linguistic Acceptability (Warstadt et al., 2018). This dataset includes 8.5K training sentences annotated with whether it is a grammatical English sentence.
• SST: Stanford Sentiment Treebank (Socher et al., 2013) dataset contains 67K sentences from movie reviews and their corresponding binary sentiment annotations.
• MRPC: Microsoft Research Paraphrase Corpus (Dolan and Brockett, 2005) includes 3.7K sentence pairs from online news sources. The task is to predict whether two sentences are semantically equivalent or not.
• STS: Semantic Textual Similarity (Cer et al., 2017) benchmark contains 5.8K training sentence pairs. The task is to predict the similarity score of two sentences from 1 to 5.
• QQP: Quora Question Pairs (Iyer et al., 2017) dataset includes 364K question pairs sampled from the community question-answering website Quora. Models are trained to predict whether a pair of questions are semantically equivalent.
• WNLI: Winograd NLI (Levesque et al., 2011) is a small natural language inference dataset. However, as GLUE official website 8 indicates there are some issues during its construction process, we follow previous studies (Clark et al., 2020a; and exclude this dataset for fair comparisons.

B Pre-training Details
For the pre-training architecture configurations, we mostly use the same hyper-parameters as BERT and ELECTRA. To generate masked positions, we follow BERT and duplicate training data 40 times so each sequence is masked in 40 different ways. We find this static masking strategy performs similar to the dynamic masking strategy in ELECTRA, while being easier to implement and has less computation overhead. Besides, we notice a mask percentage of 15 works well for all models and thus do not increase it to 25 for large-size models as suggested in ELECTRA. We set λ 1 and λ 2 , the loss balancing parameters to 5 and 2, respectively, to ensure different loss terms are of the same scale. For small-size and base-size models, we search for the learning rate out of {1e-4, 2e-4, 3e-4, 5e-4}, batch size from {128, 256, 512, 1024}, and training steps from {500K, 1M, 1.5M, 2M}. For large-size models, we search for the learning rate out of {1e-4, 2e-4, 3e-4, 5e-4} and batch size from {1024, 2048}. Also, we select the generator size out of {1/4, 1/3, 1/2} in early experiments. The best configurations are reported in the main text and we perform no other hyper-parameter tuning. The full set of hyper-parameters are listed in Table 5.

C Fine-tuning Details
For fair comparisons, we fine-tune all pre-trained checkpoints using the official pipeline in Tensor-Flow Model Garden 9 and report the median of 15 fine-tuning runs. We do not include layer-wise learning-rate decay. We search for the learning rate from {1e-5, 3e-5, 5e-5, 8e-5, 1e-4}, batch size from {32, 48}, and training epoch from {2, 3, 5}. For GLUE tasks, best evaluation scores during finetuning are reported. For SQuAD, scores at the end of fine-tuning are reported. The full set of hyperparameters are listed in