PEER: Pre-training ELECTRA Extended by Ranking

,


Introduction
Language model pre-training has made great achievements in many natural language processing (NLP) downstream tasks, by designing and using effective pre-training tasks.A milestone is the BERT model, which conducts a masked language model (MLM) task by randomly masking a proportion (typically 15%) of tokens in the input sentence and then recover the original sentence.Since the success of the BERT, many variant models (Liu et al., 2019;Joshi et al., 2019;Wang et al., 2019;Yang et al., 2019;Dong et al., 2019;Wu et al., 2020) have been proposed to further improve the performance of the BERT by refining or adding pre-training tasks.
One common issue of these MLM-based models, however, is the pre-training efficiency, because they all need highly expensive pre-training computation cost to achieve good performance.To address this issue, the ELECTRA model is proposed by Clark et al. (2020b).The ELECTRA uses an auxiliary generator network to provide plausible tokens to replace a proportion (typically 15%) of the original tokens according to the input context, and then utilizes a main discriminator network to perform replaced token detection (RTD) task, that is, to classify whether each token is original or replaced by the generator.In order to prevent the generator from producing replaced tokens overchallenging for the training of discriminator, the ELECTRA make the generator relative weaker than the discriminator by decreasing the hidden size of the generator.After its pre-training, the generator is discarded and the discriminator is further finetuned for downstream NLP tasks.The ELECTRA has shown impressive advantages over MLM-based models in various downstream tasks under similar computation cost, especially when a model size is small.
After the success of the ELECTRA, researchers have proposed quite a few models each of which has an auxiliary generator network.Because the RTD task performed by the ELECTRA has accelerated pre-training so substantially, however, it is very challenging to advance the efficiency frontier established by the ELECTRA, by using or adding other pre-training tasks, as the recent comprehensive study of Bajaj et al. (2022) summarizes.Thus, to further improve the pre-training efficiency, we propose to extend the RTD task in the ELECTRA into a token quality ranking (TQR) task, a task of ranking input tokens according to K different quality levels.Besides determining whether each input token is replaced by a generator or not, the TQR task also needs to distinguish replaced tokens by ranking them according to their replacement quality.We call our method PEER, Pre-training ELECTRA Extended by Ranking.Please refer to Figure 1 for demonstration.Our proposal is based on the key observation that the quality of replaced tokens are not even.While some replaced tokens fit the context nearly as well as the corresponding original tokens, others do not.Thus, our PEER generalizes the binary classifier in the ELECTRA into a K-level ranker to perform a more precise task.We design a scheme capable of retrieving rank labels for a majority of replaced tokens from the relative weak generator, which serves as the basis for the TQR task.The extension from the ELECTRA to the PEER also adds negligible computation cost, because the TQR task largely re-uses the computation already performed by the original ELECTRA.Additionally, our PEER adopts partial transformer-layer sharing technique between generator and ranker to further reduce computation cost in our method, as its advantage has been demonstrated in the TEAMS model (Shen et al., 2021), a recent model proposed to improve the ELEC-TRA.Our extensive experiments in small and base scale models show that the PEER is able to outperform both the ELECTRA and the TEAMS in downstream GLUE tasks using the same or less computation cost.

Related Work
As introduced in Section 1, since ELECTRA (Clark et al., 2020b) greatly boosts the pre-training efficiency, a few models have been proposed in order to further advance this pre-training efficiency frontier.
The Electric model is proposed by Clark et al. (2020a) as an energy-based model to perform the cloze task (Taylor, 1953) using noise-contrastive estimation (Gutmann and Hyvärinen, 2012).It is particularly effective at producing likelihood scores for text but slightly under-performs ELECTRA on the GLUE tasks.
The MC-BERT model is proposed by Xu et al. (2020) to replace the RTD binary classification task in ELECTRA with a multi-choice cloze test with a reject option (which is essentially a multi-class classification task).The MC-BERT consists of a meta controller network and a generator network.The meta controller corrupts the original input sentence by replacing a proportion of tokens with sam-pled tokens, just as ELECTRA's generator does.Meanwhile, the meta controller also generates a set of k candidate tokens for each token in the input sentence.The generator uses the corrupted sentence as the input and learns to correct each token by choosing the correct answer among its k candidates.Xu et al. (2020) empirically show that the overall performance of the MC-BERT is similar to that of the ELECTRA in GLUE tasks, since the MC-BERT outperforms the ELECTRA in GLUE semantic tasks but is worse than the ELECTRA in the GLUE syntactic task CoLA.COCO-LM (Meng et al., 2021) is proposed to improve ELECTRA by using two new pre-training tasks called corrective language modeling (CLM) task and sequence contrastive learning (SCL) task.While ELECTRA's main network (discriminator) conducts only RTD task for each token position, COCO-LM's main network undertakes the CLM task by jointly performing both RTD task and MLM task for each token position in the corrupted input.Additionally, COCO-LM's main network also performs the SCL task to find a pair of the MLM replaced sentence and the cropped sentence originated from the same source sentence among all other sentences in the same training batch.
The DeBERTaV3 (He et al., 2021) is proposed to combine both the advantages of the DeBERTa model (He et al., 2020) and those of the ELEC-TRA.The DeBERTa (He et al., 2020) introduces two novel mechanisms to improve the effectiveness of the MLM task: disentangled attention and an enhanced mask decoder.The disentangled attention computes the attention weights among tokens using disentangled matrices on two separate vectors (content vector and relative position vector) of each token, while an enhanced mask decoder includes absolute positions in the decoding layer to predict the masked tokens.The DeBERTaV3 keeps these mechanisms but replaces the MLM task (used in the DeBERTa) with ELECTRA's RTD task, and shows that the new combination outperforms both the original DeBERTa and the ELECTRA.Additionally, the DeBERTaV3 introduces gradientdisentangled embedding sharing method as a better alternative to the vanilla token embedding sharing used in the ELECTRA.
The SAS (self-augmentation strategy) is proposed by Xu et al. (2021) in order to improve ELECTRA's pre-training efficiency from the perspective of data augmentation.The SAS uses a sin- gle network to jointly conduct MLM and RTD tasks in order to reduce computation cost and regularize the model parameters for training balance.Essentially, the generator and the discriminator share all their transformer layers in the SAS, and only two separate light-weight heads (MLM and RTD heads) are built on top of the common heavy-weight transformer layers.The MLM head also samples one token in each selected position in order to generate the corrupted input used for the next epoch of the pre-training.The SAS is empirically shown by Xu et al. (2021) to outperform the ELECTRA in small models in GLUE tasks given the same computation cost, but such an advantage vanishes in larger models.
The TEAMS is proposed by Shen et al. (2021) to improve the ELECTRA by adding a multi-word selection (MWS) task along with the original RTD task.Similar to the MC-BERT, the MWS task, which is a multi-choice cloze test, is conducted to choose one correct answer from a candidate set of tokens provided from the generator.Different from the MC-BERT, however, the candidate set in the TEAMS does not contain a reject option, since the MWS task is only performed at the masked positions (instead of all positions).Besides adding the MWS task, the TEAMS introduces two refinements to model structure.One is to share bottom transformer layers of the generator and the discriminator, the other is to use separate top transformer layers for RTD head and MWS head.(Rajbhandari et al., 2020), scaled initialization techniques, customized Fused Operations in mix-precision training.In terms of pre-training tasks, however, the empirical study by Bajaj et al. (2022) shows that many previously proposed tasks, such as multi-choice cloze test (Xu et al., 2020), CLM and SCL (Meng et al., 2021), do not provide much improvement for the RTD task in GLUE and SQuAD tasks.

Method
In this section, we describe our PEER method, which extends the binary discriminator of the ELECTRA into a ranker.Our PEER method jointly trains two neutral networks, an auxiliary generator network G and a main ranker network R. Each network is mainly a Transformer encoder (Vaswani et al., 2017), which transforms an input token sequence

Generator in PEER
The generator G in the PEER works exactly the same as the generator in the ELECTRA.It first randomly selects a proportion (typically 15%) of position indexes {1, • • • , n} to produce a masked position set M. It then generates a masked token sequence x M by replacing x i in x with a special mask token [MASK] for each i ∈ M. Afterwards, the generator G transforms the input x M into h G (x M ) through transformer layers.For position i, the token generating probability of any token x v given the context x M is produced from a softmax function as follows: (1) where e(x v ) is the embedding of token x v , and V is the vocabulary.The inner product e(x v ) T h G (x M ) i in E.q. (1) is essentially a logit of token x v at position i given the context x M , denoted as which will be also used in our ranker.
The loss of the MLM task L M LM (x; θ G ) is a cross entroy loss (i.e., negative log likelihood): For each position i in M, the generator also sample one token xi from the token generating probability , and then replace the original token x i with the sampled xi to produce a corrupted token sequence x C .

Ranker in PEER
Given the corrupted token sequence x C , the Klevel ranker performs token quality ranking (TQR) task, that is, assigns each token in x C into a rank value r ∈ {1, 2, . . ., K}.
Assuming that rank label R i at position i of the corrupted token sequence x C is given, for rank value r ∈ {1, 2, . . ., K − 1}, the probability that R i ≤ r is given: where σ is a sigmoid function, h R (x C ) i is the contextualized representation vector at position i out of the ranker transformer, w is the to-be-learned weight vector, {ξ 1 , ξ 2 , . . ., ξ K−1 } is a set of tobe-learned threshold parameters with the property The binary discriminator in the ELECTRA can be viewed as a ranker with K = 2, where rank label R i is naturally given by where x c i is the token at position i in x C .Accordingly, the loss of the TQR task with K = 2 levels is a binary cross entropy loss: where I[] is an indicator function.

Rank Label Retrieving Scheme
In order to use a ranker with K > 2, we need to assign a rank label R i to each x c i , the token at position i in x C .Thus, we design a label retrieving scheme to obtain rank labels from the generator.For notational convenience, we use p , the generating probability of the original token x i at position i in the context; and use , the generating probability of any token x c i at position i in the context.We use rank(p where T is a hyperparameter with a small value1 .
Our rank label retrieving scheme is shown in Table 1, where {τ 1 , τ 2 , • • • , τ K−2 } are a set of probability partitioning hyperparameters with property We set the rank label of the original token to the highest value K.For each replaced token x c i (which differs from x i ), we set up the levels (buckets) based on p i is set according to the bucket which p (i) c will fall into.Note, however, just as the ELECTRA, the generator in our PEER is set to be weak (small) relative to the ranker in order to prevent generating toochallenging replaced tokens.Therefore we use the condition rank(p o ) ≤ T to identify every position i where the generator can provide well-estimated probability p (i) G (•|x M ) for tokens x i and x c i .For every replaced token x c i at position i where rank(p o ) > T , we just set its rank label to a special value −1 to indicate that its rank is less than K but the exact rank value is unknown.
Table 1: Rank label retrieving scheme for a K-level ranker in PEER.
Table 2: Rank label retrieving scheme internally implemented for a K-level ranker in PEER, along with an additional buffer option associated with hyperparameter δ.
For the purpose of numerical stability, determining which bucket p (i) c falls into (in Table 1) is actually implemented using its equivalent form on the basis of the logit difference: logit , where both logit terms are defined in E.q. (2).Additionally, in Table 2 we also introduce a buffer option between level k and level k + 1 for each k ∈ {1, • • • , K − 2} to further safe-guard against the relative weakness of the generator.All the data points in the buffer are regarded as being in a grey (potentially noisy) area and are excluded for the binary classification between level k and level k + 1.If we want to use the buffer option, we add a positive hyperparameter δ inside the relevant buckets 4 to set up the buffers and add a small fixed value ∆ ∈ (0, 1) to the corresponding rank label 5 .A larger value of δ leads to the smaller number of the training data points, but adds confidence in removing potentially noisy data points.If we do not want to use the buffers, we set hyperparameter δ equal to 0 so that these buffers will disappear. 4The hyperparameter δ needs to satisfy the condition that 5 We internally set ∆ to 0.1 though its exact value does not matter.

Loss of TQR Task
Because some replaced tokens have their exact rank labels unknown (represented by the special value −1), the loss of the TQR task cannot be directly formulated as the loss of standard ordinal regression (McCullagh and Nelder, 1989).To address this challenge, we set the loss of the TQR task with K levels to be the summation of K −1 binary cross entropy losses: where γ r is a positive relative weight hyperparameter for the binary cross entropy loss at level r6 Essentially, L T QR contains both the loss of RTD task stated in E.q. (4) and each binary entropy loss at level r ∈ {1, • • • , K − 2}.
We set the loss of the TQR task to the summation of K − 1 binary cross entropy losses in E.q. (5) in the entire pre-training process except a beginning warming-up phase.In the warming-up phase 7 we still use only one binary cross entropy stated in E.q. (4) as the loss of the TQR task, in order to ensure that the generator gets some basic training so that its token generating probability p is generally reliable for the rank labeling purpose.
Overall, we train the PEER by minimizing a combined loss: where λ is the relative weight for the loss of TQR task8 .After pre-training, we discard the generator and fine-tune the ranker for downstream NLP tasks.
As an additional note, extending the ELEC-TRA to the PEER requires negligible increase in computation cost.The only added parameters in our PEER are K − 1 threshold parameters {ξ 1 , ξ 2 , . . ., ξ K−1 }.The same sequence contextualized representation vectors h R (x C ) out of the ranker transformer is re-used for different levels, along with the shared weight parameter vector w in E.q. (3).The R i labeling is also based on the logit information in E.q. (2) which has already been computed for p  (Wolf et al., 2020).We include ELECTRA, TEAMS as well as BERT for comparison.Under the current constraints of computation resource, we focus on the small and base models which have been extensively studied and compared by Clark et al. (2020b), and we set architectures and hyperparameters largely aligned with ELECTRA.Please refer to Appendix B for the detailed model architecture and pre-training hyperparameter values.We implement each model by largely re-using the corresponding code from Huggingface (Wolf et al., 2020), if a pre-trained checkpoint has not been publicly released by its authors.We use the same pre-training data as BERT, ELECTRA-Small and ELECTRA-Base, which consists of 3.3 Billion tokens from Wikipedia and BooksCorpus datasets.For fair comparison, we follow Clark et al. (2020b) to use FLOPs (floating point operations) to measure computation usage (since FLOPs is a measure agnostic to the particular hardware and low-level optimizations).We reuse the FLOPs computation code9 released from Clark et al. (2020b) so that we essentially take the exactly same assumptions made by Clark et al. (2020b).Some details of the experimented models are as follows.
• ELECTRA: We pre-train ELECTRA-Small and ELECTRA-Base using the exactly same hyperparameter values as Clark et al. (2020b), except for larger batch size and learning rate for ELECTRA-Small to reduce the pretraining time (which is not reflected in the FLOPs calculation).For ELECTRA-Small model as well as all other small models, we use batch size 512 and 250K pre-training steps, instead of batch size 128 and 1M steps in Clark et al. (2020b).Accordingly, we add 100% increase in learning rate for ELECTRA-Small and BEET-Small, and add 50% increase in learning rate for TEAMS-Small and PEER-Small10 .We observe that the change in batch size and learning rate is able to significantly reduce the pre-training time without degrading the model performance.As a reference point, we also include ELECTRA-Small++ whose pre-trained model checkpoint is publicly released by Clark et al. (2020b).Note that ELECTRA-Small++ uses 18x training FLOPs compared to ELECTRA-Small, because it is pre-trained much longer with much larger data and its input sequence length is also quadrupled (Clark et al., 2020b).
• BERT: For BERT-Base, we use its model checkpoint publicly released by Devlin et al. (2018).We implement our BERT-Small and set its embedding size the same as its hid-den size11 , according to the convention of the BERT models.Please refer to the appendix for the details about the hyperparameters.Our BERT-Small setting makes its FLOPs similar to that of ELECTRA-Small when the training steps are the same, so that fair comparison of their performance can be made directly.
• TEAMS: We pre-train TEAMS-Small and TEAMS-Base using the same hyperparameter values described by Shen et al. (2021), except the aforementioned larger batch size and learning rate.The model structures of TEAMS-Small and TEAMS-Base are also the same as the ones used by Shen et al. (2021).Specifically, the discriminator in TEAMS-Small has 12 transformer layers and set its hidden size to 256; and the discriminator in TEAMS-Base has 12 transformer layers and set its hidden size to 768.The generator has 6 transformer layers and set its hidden size same as the corresponding discriminator.The generator and the discriminator share three layers on the bottom, the discriminator also has one additional separate transformer layer on the top for its MWS task.
• PEER: We pre-train PEER using the hyperparameter values the same as TEAMS.With respect to hyperparameter δ, we set it to 3 in PEER-Small and set it to 9 in PEER-Base for model comparison, and discuss the effect of δ in Appendix A.2 due to space constraint.We focus on the PEER with 3-level ranker, and discuss the effect of the number of levels in Appendix A.1.The model structures of the PEER are the same as the TEAMS, except that there is no additional transformer layer for MWS task in the PEER.This difference makes FLOPs per training step in the PEER smaller than the ones of the corresponding TEAMS.However, FLOPs per training step in the PEER are still larger than the ones of the corresponding ELECTRA.This is largely because the generator in the ELECTRA decreases its hidden size (instead of its number of transformer layers), which in turn leads to the decrease in the intermediate size in every fully connected feed-forward network (FFN).
We will clearly record these training FLOPs in our experimental results and ensure that our PEER uses FLOPs no more than other models during the performance comparison.
The evaluation metrics are the average of MNLImatch accuracy and MNLI-mismatch accuracy for MNLI, the average of Spearman correlation and Pearson correlation for STS-B, Matthews correlation for CoLA, and accuracy for other GLUE tasks.We also take the average of metrics of these eight GLUE tasks, denoted by G-AVG, as the overall performance metric on these tasks.All the evaluation is based on the Dev dataset.
Fine-tuning Procedure: For the fine-tuning of GLUE tasks, we add simple linear classifiers on top of the encoder of a pre-trained model.Because we observe a large performance variance in the GLUE tasks with small data sizes (including CoLA, MRPC, STS-B and RTE), we adopt the following two methods to reduce the variance.First, we follow the strategy proposed in the papers (Mosbach et al., 2020;Zhang et al., 2020;Dodge et al., 2020) to train more epochs with small learning rates for these small tasks.Second, we fine-tune these small tasks by using multiple random seeds and obtain the average score across the seeds.Please refer to Appendix C for the details in fine-tuning hyperparameter settings.
For base models, we pre-train each model once and then use the above fine-tuning strategy to obtain the score of each GLUE task.Since for some small models we still observe non-negligible variance of the resulting scores, we pre-train each small model using five different random seeds.The finally reported score of each task is the average across the five pre-trained model checkpoints.

Overall Comparison Results
Table 3 shows the performance comparison among the small models.In the table, the second column lists the training FLOPs of each model, and the third column shows the mean and the standard deviation of the G-AVG for each model across five independently pre-trained checkpoints.We report the performance of each small model pre-trained through 250K steps (i.e., 5 epochs).Additionally, we report the performance of PEER-Small pretrained exactly after 212.5K steps to ensure that its computation cost is no more than that of any other competitor.
Note that the G-AVG of ELECTRA-Small implemented by us is about 98.44% of that of ELECTRA-Small++ released by Clark et al. (2020b) (80.77 vs. 82.05),which is higher than the 97.87% in Table 8 of the original paper (Clark et al., 2020b).This verifies the correctness of our ELECTRA implementation.As for TEAMS-Small, the G-AVG of TEAMS-Small is slightly higher than that of ELECTRA-Small when they go through the same number of pre-training steps, which is consistent with the comparison results shown by Shen et al. (2021).While Shen et al. (2021) do not report the performance of each individual task, our result shows that TEAMS-Small performs much better in MNLI task but much worse in CoLA task when comparing with ELECTRA-Small.
With respect to our PEER, Table 3 clearly demonstrates its advantages over all the other competitors in small models.Using less computa-tion cost, PEER-Small (212.5K)outperforms both ELECTRA-Small and TEAMS-Small in six out of eight GLUE tasks, as SST-2 and RTE tasks are the only two exceptions.The G-AVG of PEER-Small (212.5K) is 0.63 point higher than that of ELECTRA-Small and is 0.56 point higher than that of TEAMS-Small.Because we have independently run the whole (pre-training and finetuning) process five times for each small model, by using the two-sample t test with unequal variances, we can conclude with strong evidence (at the significance level 0.005) that the real mean of G-AVG of our PEER-Small (212.5K) is larger than that of ELECTRA-Small.Similarly, based on the two-sample t test with unequal variances, we can conclude with strong evidence (at the significance level 0.005) that the real mean of G-AVG of our PEER-Small (212.5K) is larger than that of TEAMS-Small.
Table 4 shows the comparison results on the base models.In the first column of the table, we show the pre-training steps of each model and have ensured that PEER-Base takes FLOPs no more than other models.Using less computation cost, PEER-Base achieves the best performance among all the investigated models in six out of eight GLUE tasks, while two exceptions are SST-2 and RTE tasks (just as in small models).Overall, PEER-Base has the highest G-AVG, which is 0.35 point higher than that of ELECTRA-Base and is 0.66 point higher than that of TEAMS-Base.

Pre-Training Efficiency
To further investigate the pre-training efficiency, in Figures 2 and 3, we plot G-AVG and MNLI accuracy score with respect to the number of pretraining epochs for PEER-Small, ELECTRA-Small and TEAMS-Small.For each model, we select the median run whose pre-training random seed achieves the median G-AVG among the five random seeds.Then for the selected median run of each model, we save a checkpoint every epoch (i.e, 50K pre-training steps), and fine-tune it on every GLUE task and finally report the scores across the tasks.Note that the ratio of the training FLOPs per epoch among PEER-Small, ELECTRA-Small and TEAMS-Small is 1.50 : 1.29 : 1.55, which has also been shown in Table 3. Figure 2 shows that PEER-Small starts to significantly outperform its competitors in G-AVG since the second epoch, and its G-AVG at the end of third epoch is already higher than G-AVG of both ELECTRA-Small and TEAMS-Small at the end of the whole pre-training.Figure 3 shows that both PEER-Small and TEAMSsmall perform considerably better than ELECTRA-Small in MNLI task, and PEER-Small performs better than TEAMS-small (by using less computation cost) since the third epoch.

Conclusion and Future Work
We propose the PEER by extending ELECTRA's RTD task to a token quality ranking (TQR) task in order to further improve the pre-training efficiency.Besides detecting whether every token is replaced or not, the TQR task also needs to rank replaced tokens into different levels according to their quality given the context.We design a scheme to retrieve rank label information from the generator so that the complete TQR task can be performed for a majority of replaced tokens.We empirically show that our proposed PEER outperforms the state-ofthe-art pre-training efficient competitors in small and base scale models using the same or less computation cost.In the future, we will validate the advantages of our PEER in larger scale models when sufficient computation resources are available.We also plan to improve our rank label retrieving scheme so that even larger proportion of replaced tokens can be involved in the complete TQR task.

Limitations
There are several limitations in our paper.First, we have not validated the advantages of our proposed PEER in model scales larger than base model, due to the constraint in our computation resource.We plan to experiment the PEER in larger scale models when more computation resource is available.Second, in order to filter out potential noise from the relative weak generator, our current rank label retrieving scheme uses a strict condition T = 3, which leads to the fact that a significant proportion of tokens have rank label −1 and essentially are involved only in the original RTD task.Please refer to the details in Appendix A.3.We intend to design some label retrieving scheme which applies a softer criterion so that more tokens can be fully or partially involved in the complete TQR task.Finally, our PEER currently does not have the ability of automatically searching for an optimal value of A Supplementary Experimental Results

A.1 Number of Levels K
We vary K (the number of levels used in the ranker) from 3 to 5 in PEER-Small models to see its impact.Table 5 shows the corresponding results.Each model is pre-trained 212.5K steps and has nearly the same computation cost.The table shows that increasing K from 3 does not lead to further improvement in the performance of GLUE tasks.The G-AVG of the PEER-Small with 4 or 5 levels actually decreases slightly, though it is still larger than that of its competitors shown in Table 3 by using less computation cost.We conjecture that the main reason is that increasing K leads to the smaller number of tokens staying in low levels, which in turn brings difficulty in the learning process.We will further investigate the impact of K in our future work.

A.2 Buffer Hyperparameter δ
We test the impact of buffer hyperparameter δ by using a set of three different values {0, 3, 9}, where value 0 leads to no buffer and value 9 leads to a large buffer.By its design, a larger buffer leads to the smaller number of the training data points, but adds confidence in removing potentially noisy data points due to the relative weakness of the generator.Tables 6 and 7 show the results in the PEER-Small models and PEER-Base models respectively.Since the value of δ has a negligible effect in the training FLOPs, we do not list the training FLOPs here as they have already been shown in Table 3 and 4. Both tables show that the G-AVG decreases slightly when δ decreases to 0, though it is still no worse than that of any its competing model by using less computation cost.The PEER-Small achieves the highest G-AVG and MNLI scores when δ is set to 3. The PEER-Base achieves the highest G-AVG when δ is set to 9, and achieves the highest MNLI score when δ is set to 3. In the future we will investigate how to let the PEER automatically search for an optimal value of δ during its pre-training to further boost its performance.

A.3 Proportion of Tokens with Rank Label
−1 Figures 4 and 5 demonstrate the proportion of tokens with rank label −1 in the masked positions during the pre-training for PEER-Small and PEER-Base.With respect to PEER-Small, the proportion decreases from 44.82% at 40K steps (i.e., the end

B Pre-training Details
The following pre-training details apply to our PEER and its competing methods including the BERT, the ELECTRA and the TEAMS.We always use Adam as the optimizer with weight decay.We mostly use the same hyperparameters as BERT and ELECTRA.Our own implementation does not include the next sentence prediction (NSP) task proposed in the original BERT, as the recent works such as Liu et al. (2019) have suggested that it does not improve the performance.We searched for the best learning rate for Small models out of [1e-3, 7.5e-4, 5e-4] .Otherwise, we did no hyperparameter tuning beyond the experiments.The full set of hyperparameters is listed in Table 8.

C Fine-tuning Details
We originally fine-tuned all the pre-trained models for 4 epochs.However, because we observed a large variance in the small tasks in GLUE, following the advice from Mosbach et al. (2020), we increase the fine-tuning process to 20 epochs and select the best epoch for the four small tasks including CoLA, MRPC, STS-B and RTE.For Small models, we searched for the best learning rate out of [1e-4, 7.5e-5].For Base models, we searched for a learning rate out of [5e-5, 3e-5] without the layerwise learning-rate decay proposed by ELECTRA, but otherwise used the same hyperparameters as for small models.Due to limited computation resource, we adjust the number of independent fine-tuning runs (with different random seeds) so that we finetune more times for these tasks with smaller data sizes (i.e., with more variability).The full set of hyperparameters is listed in Table 9.Following the BERT and the ELECTRA, we do not show results on the WNLI GLUE task for the Dev set results.

C.1 Details about GLUE
We provide further details about the GLUE benchmark tasks as follows.
CoLA: Corpus of Linguistic Acceptability (Warstadt et al., 2019).The task is to determine whether a given sentence is linguistically acceptable or not.The dataset contains 8.5k train examples from books and journal articles on linguistic theory.
SST-2: Stanford Sentiment Treebank (Socher et al., 2013).The task is to determine if the sentence is positive or negative in sentiment.The dataset contains 67k train examples from movie reviews.
MRPC: Microsoft Research Paraphrase Corpus (Dolan and Brockett, 2005).The task is to predict whether two sentences are semantically equivalent or not.The dataset contains 3.7k train examples from online news sources.
STS-B: Semantic Textual Similarity (Cer et al., 2017).The task is to predict how semantically similar two sentences are on a 1-5 scale.The dataset contains 5.8k train examples drawn from news headlines, video and image captions, and natural language inference data.
QQP: Quora Question Pairs (Iyer et al., 2017).The task is to determine whether a pair of questions are semantically equivalent.The dataset contains 364k train examples from the community questionanswering website Quora.
MNLI: Multi-genre Natural Language Inference (Williams et al., 2017).Given a premise sentence and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis, contradicts the hypothesis, or neither.The dataset contains 393k train examples drawn from ten different sources.
QNLI: Question Natural Language Inference; constructed from SQuAD (Rajpurkar et al., 2016).The task is to predict whether a context sentence contains the answer to a question sentence.The dataset contains 108k train examples from Wikipedia.
RTE: Recognizing Textual Entailment (Giampiccolo et al., 2007).Given a premise sentence and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis or not.
Both refinements have been empirically shown to be able to further improve the performance of the TEAMS.Recently, Bajaj et al. (2022) conduct a comprehensive empirical study of ELECTRA-style pretraining techniques, and propose a corresponding pre-training recipe for Model-generated dEnoising TRaining Objective (METRO).Their pre-training recipe incorporates a set of techniques to improve the efficiency and stability of large scale model pretraining, such as the ZeRO optimizer Details: We implement the PEER within Huggingface Transformers framework

Figure 2 :
Figure 2: G-AVG for PEER-Small model and its competitors with respect to the number of pre-training epochs on the GLUE dev set.

Figure 3 :
Figure 3: MNLI's average accuracy for PEER-Small model and its competitors with respect to the number of pre-training epochs on the GLUE dev set.

Figure 4 :
Figure 4: Proportion of tokens with rank label −1 in the masked positions for PEER-Small during the pretraining.

Figure 5 :
Figure 5: Proportion of tokens with rank label −1 in the masked positions for PEER-Base during the pretraining.

Table 4 :
Comparison of base models on the GLUE dev set.

Table 5 :
CoLA SST-2 MRPC STS-B QQP QNLI RTE Comparison of PEER-Small models with different K levels (under 212.5K pre-training steps) on the GLUE dev set.

Table 6 :
Comparison of PEER-Small models with different δ values on the GLUE dev set.Each PEER-Small model has 3 levels and is pre-trained 212.5K steps.

Table 8 :
Pre-training hyperparameters for all the models pre-trained by us.The dataset contains 2.5k train examples from a series of annual textual entailment challenges.