Pre-training Language Model as a Multi-perspective Course Learner

ELECTRA, the generator-discriminator pre-training framework, has achieved impressive semantic construction capability among various downstream tasks. Despite the convincing performance, ELECTRA still faces the challenges of monotonous training and deficient interaction. Generator with only masked language modeling (MLM) leads to biased learning and label imbalance for discriminator, decreasing learning efficiency; no explicit feedback loop from discriminator to generator results in the chasm between these two components, underutilizing the course learning. In this study, a multi-perspective course learning (MCL) method is proposed to fetch a many degrees and visual angles for sample-efficient pre-training, and to fully leverage the relationship between generator and discriminator. Concretely, three self-supervision courses are designed to alleviate inherent flaws of MLM and balance the label in a multi-perspective way. Besides, two self-correction courses are proposed to bridge the chasm between the two encoders by creating a"correction notebook"for secondary-supervision. Moreover, a course soups trial is conducted to solve the"tug-of-war"dynamics problem of MCL, evolving a stronger pre-trained model. Experimental results show that our method significantly improves ELECTRA's average performance by 2.8% and 3.2% absolute points respectively on GLUE and SQuAD 2.0 benchmarks, and overshadows recent advanced ELECTRA-style models under the same settings. The pre-trained MCL model is available at https://huggingface.co/McmanusChen/MCL-base.


Introduction
Language models pre-training (Radford et al., 2018(Radford et al., , 2019;;Devlin et al., 2019;Liu et al., 2019) has shown great success in endowing machines with the ability to understand and process various downstream NLP tasks.A wide range of pre-training strategies have been proposed, among which the most prevailing object is the masked language modeling (MLM) (Devlin et al., 2019).Such autoencoding language modeling objects (Vincent et al., 2008) typically first randomly corrupt a certain percentage of training corpus with masked tokens, and then encourage encoders to restore the original corpus.To reduce the randomness of pre-training and produce a sample-efficient method, ELECTRAstyle frameworks (Clark et al., 2020) leverage an Transformer-based (Vaswani et al., 2017) generator training with MLM to build challenging ennoising sentences for a discriminator in the similar structure to carry out the denoising procedure.
Typically in the ELECTRA-style training, the generator first constructs its semantic representations through MLM training and cloze these masked sentences with pseudo words; in the meanwhile, the discriminator inherits the information from the former and distinguish the originality of every token, which is like a step-by-step course learning.However, only MLM-based generator training may lead to monotonous learning of data, which conduces to the incomprehensive generation and imbalanced label of corrupted sentences for the discriminator (Hao et al., 2021).Besides, interactivity between the two encoders stops abruptly except the sharing of embedding layers (Xu et al., 2020;Meng et al., 2021), since there is no direct feedback loop from discriminator to generator.
To enhance the efficiency of training data and to adequately utilize the relationship of generator and discriminator, in this work, we propose a sampleefficient method named multi-perspective course learning (MCL).In the first phase of MCL, to fetch a many degrees and visual angles to impel initial semantic construction, three self-supervision courses are designed, including cloze test, word rearrangement and slot detection.These courses instruct language models to deconstruct and dissect the exact same corpus from multi perspectives under the ELECTRA-style framework.In the second phase, two self-correction courses are tasked to refine both generator and discriminator.A confusion matrix regarding to discriminator's recognition of each sentence is analyzed and applied to the construction of revision corpora.Secondary learning is carried out for the two components in response to the deficiencies in the previous course learning.At last, the model mines the same batch of data from multiple perspectives, and implement progressive semantic learning through the self-correction courses.
Experiments on the most widely accepted benchmarks GLUE (Wang et al., 2019) and SQuAD 2.0 (Rajpurkar et al., 2018) demonstrate the effectiveness of the proposed MCL.Compared with previous advanced systems, MCL achieved a robust advantage across various downstream tasks.Abundant ablation studies confirm that multi-perspective courses encourage models to learn the data in a sample-efficient way.Besides, a course soups trial is conducted to further interpret and dissect the core of multi-perspective learning, providing a novel approach to enhance the pre-training efficiency and performance.

Preliminary
In this work, we built our system based on the ELECTRA-style framework.Thus, the framework of ELECTRA is reviewed.Unlike BERT (Devlin et al., 2019), which uses only one transformer encoder trained with MLM, ELECTRA is trained with two transformer encoders: a generator G and a discriminator D. G is trained with MLM and used to generate ambiguous tokens to replace masked tokens in the input sequence.Then the modified input sequence is fed to D, which needs to determine if a corresponding token is either an original token or a token replaced by the generator.
Generator Training Formally, given an input sequence X = [x 1 , x 2 , ..., x n ], a mask operation is conducted to randomly replace its tokens with [MASK] at the position set r. 1 And the masked sentence X mask = [x 1 , x 2 , ..., [MASK] i , ..., x n ] is fed into the generator to produce the contextualized representations {h i } n i=1 .G is trained via the following loss L MLM to predict the original tokens 1 Typically the proportion is set as 15%, which means 15% of the tokens are masked out for each sentence.
from the vocabulary V at the masked positions: where t=1 are the embeddings of tokens that are replaced by [MASK].Masked language modeling only conducts on the masked positions.
Discriminator Training.G tends to predict the original identities of the masked-out tokens and thus X rtd is created by replacing the masked-out tokens with generator samples: (3) D is trained to distinguish whether the tokens in X rtd have been replaced by G via the replaced token detection (RTD) loss L RTD : where w is a learnable weight vector.This optimization is conducted on all tokens.The overall pre-training objective is defined as: where λ (typically 50) is a hyperparameter used to balance the training pace of G and D. Only D is fine-tuned on downstream tasks after pre-training.

Challenges
Biased Learning Though the ELECTRA training method is simple and effective, treating corpora from a single perspective could cause biased learning.As for the progress of MLM and RTD, there exists an inherent flaw that G might predict appropriate but not original token on the

Methodology
In this section, we will start by formulating three self-supervision courses which encourage models to treat data in a multi-perspective way.Then two self-correction courses are elaborated, deriving from the course-like relationship between G and D. These various courses are weaved into the entirety of the multi-perspective course learning method.

Self-supervision Course
The essentiality of large-scale data pre-training, undoubtedly is to excogitate a way to take full advantage of the massive rude corpora.ELECTRA has provided an applicable paradigm for models to construct semantic representations through ennoising and denosing.Based on this framework, we extend the perspective that models look at sequences and propose three binary classification tasks in order to improve training efficiency, alleviate biased learning, and balance label distributions.

Replaced Token Detection
On account of the compelling performance of pre-training language models with masked language modeling, we retain the replaced token detection task from ELECTRA.Following the previous symbol settings, given an original input sequence X = [x 1 , x 2 , ..., x n ], we first mask out it into X mask = [x 1 , x 2 , ..., [MASK] i , ..., x n ], which is then fed into G to get the filling-out sequence X rtd = [x 1 , x 2 , ..., x rtd i , ..., x n ] by generator samples.Finally, D is tasked to figure out which token is original or replaced.As illustrated in the Section 2, G and D are trained with L MLM and L RTD respectively.MLM endows G with fundamental contextual semantic construction by cloze test, and RTD is a higher level course for D to let the model drill down into context for seeking out dissonance in the pseudo sequence X rtd .

Swapped Token Detection
Intuitively, recombination tasks contribute to sequence-related learning.As mentioned in Section 3, information absence at the [MASK] position will cause the unreliability of generated pseudo words.Whether the filled sample is appropriate or not, biased learning occurs interfering training of D. Thus, to reserve primeval message for precise prediction, without slashing the degree of task difficulty, we present swapped token detection (STD) course to sharpen the model's structure perception capability through a word rearrangement task.
For an input sentence X = [x 1 , x 2 , ..., x n ], a random position set s is chosen for swapping operation. 2Precisely, tokens at the chosen position are extracted, reordered and filled back into the sentence.G is required to restore the swapped sentence X swap to X, and the adjacent D is tasked to discriminate which token is swapped in X std by generator samples.Note the contextualized representations from G as {h i } n i=1 , the training process of swapped language modeling (SLM) is formulated below: where s=1 are the embeddings of tokens at the swapped positions.Note that the vocabulary V is still the same across all courses, because it helps the generation of G in a consistent and natural environment, even the correct answer is lying in the pending sequence during SLM.SLM is only conducted on tokens at the swapped positions.
SLM brings G to making reasonable even original predictions on the swapped positions, taking the attention of training from guessing of a single word to comprehensively understanding structure and logic of the whole sequence.The swapped token detection (STD) course of D is naturally formed as a deja vu binary classification.X std is created by replacing the swapped positions with generator samples: (9) D is trained to distinguish whether the tokens in X std is original or not via the swapped token detection (RTD) loss: where w s is an independent trainable parameter from w since each of courses uses its own binary classification head.

Inserted Token Detection
With the pace of pre-training with MLM and SLM, G is inevitably able to produce much more harmonious sequences for the consummation of semantic learning.In the meanwhile, the label distribution of corrupted sentences provided by G becomes magically imbalanced, since almost all tokens exactly match the words in the original sentence.Thus, training of D faces serious interference and lack of efficiency.The propensity of the training labels leads to the propensity of D's judgment.
To alleviate the issue of label-imbalance, and to seek another perspective of treating data, we propose the inserted token detection (ITD) course.For a given sentence X = [x 1 , x 2 , ..., x n ], [MASK] is randomly inserted into the sequence at the inserted position set i.The extended sentence X in contains several illusory vacancies waiting for the prediction of G. Subsequently, D has to figure out which token should not be presented in the generated sentence X itd with the training of the following loss: On the one hand, the ratio of real and inserted words is fixed, solving the label-imbalance to some extent.On the other hand, training on void locations tones up the generation capability of models.
The overall structure of the proposed selfsupervision courses is presented in Figure 1.All courses are jointly conducted within the same data and computing steps.

Self-correction Course
According to the above self-supervision courses, a competition mechanism between G and D seems to shape up.Facing the same piece of data, G tries to reform the sequence in many ways, while D yearns to figure out all the jugglery caused previously.However, the shared embedding layer of these two encoders becomes the only bridge of communication, which is apparently insufficient.
To strengthen the link between the two components, and to provide more supervisory information on pre-training, we conduct an intimate dissection of the relationship between G and D. Take the procedure of RTD for example.For each token x rtd i in the corrupted sentence X rtd , whereafter fed into D, we identify and document its label by comparing with the token x i at the corresponding position in X.After the discrimination process of D, this token is binary classified as original or replaced.As shown in Table 1, there exist four situations of distinguish results for x i .pos 1 : where G predicts the correct answer on the [MASK] position and D successfully makes a good judgment, no additional operation needs to be conducted for this kind of token.pos 2 : where G fills an alternative to replace the original token and D inaccurately views it as original, it means G produces an appropriate expression to form a harmonious context as mentioned in Section 3, which makes it difficult for D to distinguish.pos 3 : where D makes a clumsy mistake of incorrectly annotating an original token as replaced.pos 4 : where G fills in an impertinent token at the [MASK] position and D easily figures it out.
To sum it up, on the one hand, G needs to regenerate tokens at pos 4 , since the initial alternatives are inappropriate and unchallenging for D. As shown in Figure 2, too much [MASK] are placed at important locations rich in information, leading to the erratic generation "thanked".Considering that other [MASK] in the same sequence may interfere with the generation of tokens at pos 4 , we restore other [MASK] to the original tokens for convenience of the re-generation process.On the other hand, D is expected to re-discriminate tokens at pos 2 and pos 3 .When there exist tokens at pos 4 in a sequence, these inappropriate tokens may seriously disturb decisions of D on other tokens, leading to the consequence of pos 2 and pos 3 .Take the sentence in Figure 2 for example, serious distraction "thanked" makes D falsely judges "meal" as replaced.So we replace tokens at pos 4 in X rtd to original tokens to alleviate this kind of interference, and conduct the re-discrimination training on D at pos 2 and pos 3 .By sorting out and analyzing errors, an "correction notebook" for G and D is built, guiding the re-generation and re-discrimination training.Note that it's not just redoing the problem, however, we redesign the context for each kind of issue.Thus, L re-MLM and L re-RTD is designed as the learning objective for self-correction course of RTD.Likewise, L re-SLM and L re-STD presents the training loss selfcorrection course of STD. 3 Cause there are no original tokens at the inserted [MASK] positions, no revision is conducted for ITD.Two proposed selfcorrection courses bridge the chasm between G and D through introspection and melioration, and provide a sample-efficient secondary-supervision for the same piece of data.
Finally, G and D are co-trained with three selfsupervision courses as well as two self-correction courses.The proposed MCL dissects one same sequence profoundly and comprehensively, without incurring any additional inference or memory costs.

Setup
Pre-training Settings We implement the experiments on two settings: base and tiny.Base is the standard training configuration of BERT Base (Devlin et al., 2019).The model is pre-trained on English Wikipedia and BookCorpus (Zhu et al., 2015), containing 16 GB of text with 256 million samples.We set the maximum length of the input sequence to 512, and the learning rates are 5e-4.Training lasts 125K steps with a 2048 batch size.We use the same corpus as with CoCo-LM (Meng et al., 2021) and 64K cased SentencePiece vocabulary (Kudo and Richardson, 2018).The details of the hyperparameter of pre-training is listed in Appendix A. Tiny conducts the ablation experiments on the same corpora with the same configuration as the base setting, except that the batch size is 512.

Model Architecture
The layout of our model architecture maintains the same as (Meng et al., 2021) both on base and tiny settings.D consists of 12-layer Transformer, 768 hidden size, plus T5 relative position encoding (Raffel et al., 2020).G is a shallow 4-layer Transformer with the same hidden size and position encoding.After pre-training, we discard G and use D in the same way as BERT, with a classification layer for downstream tasks.

Downstream Tasks
To verify the effectiveness of the proposed methods, we conduct evaluation experiments on various downstream tasks.We evaluate on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019) and Stanford Question Answering 2.0 (SQuAD 2.0) dataset (Rajpurkar et al., 2018).As for the evaluation metrics of GLUE tasks, we adopt Spear-man correlation for STS, Matthews correlation for CoLA, and accuracy for the other.For SQuAD 2.0, in which some questions are unanswerable by the passage, the standard evaluation metrics of Exact-Match (EM) and F1 scores are adopted.More details of the GLUE and SQuAD 2.0 benchmark are listed in Appendix B.
Baselines Various pre-trained models are listed and compared in the base setting.All numbers are from reported results in recent research.When multiple papers report different scores for the same method, we use the highest of them for comparison.
Implementation Details Our implementation builds upon the open-source implementation from fairseq (Ott et al., 2019).With 128 A100 (40 GB Memory), one pre-training run takes about 24 hours in base setting.The fine-tuning costs are the same with BERT plus relative positive encodings.More details of fine-tuning are listed in Appendix C.

Evaluation Results
We first pre-trained our model with the proposed MCL method, and then fine-tuned it with training sets of 8 single tasks in the GLUE benchmark.We conducted a hyperparameter search for all downstream tasks, and report the aver-Model SQuAD 2.0

EM F1
Base Setting BERT (Devlin et al., 2019) 73.7 76.3 XLNet (Yang et al., 2019) 78.5 81.3 RoBERTa (Liu et al., 2019) 77.7 80.5 DeBERTa (He et al., 2021) 79.3 82.5 ELECTRA (Clark et al., 2020) 79.7 82.6 +HP Loss +Focal (Hao et al., 2021)  We also evaluated the proposed MCL on the SQuAD 2.0 datasets, which is an important reading comprehension dataset that requires the machine to extract the answer span given a document along with a question.The results of Exact-Match (EM) and F1 score (F1) are displayed in the top half of Table 3. Consistently, our model significantly improves the ELECTRA baseline and achieves a banner score compared with other same-size models.Specifically, under the base setting, the proposed MCL improves the absolute performance over ELECTRA by 3.2 points (EM) and 3.3 points (F1).Also, our model outperforms all other previous models with an overt margin.
The compelling results demonstrate the effectiveness of the proposed MCL.With the equal amount of training corpus, plus slight comput-ing cost of forward propagation, MCL tremendously advanced ELECTRA baseline, showing its property of sample-efficiency.In other words, multi-perspective course learning gives the model a deeper and more comprehensive insight into the underlying meaning of corpora, which provides more valuable information for the pre-training process.

Ablation Study
In order to dive into the role of each component in the proposed MCL, an ablation study is conducted under the tiny setting.Both the GLUE and SQuAD 2.0 datasets are utilized for evaluation, and the ablation results are listed in the bottom half of Table 2 and Table 3. Bolstered by several curve graphs regarding with loss and accuracy during pre-training, every course is discussed below. 4TD The most basic component, also represents the ELCETRA itself.Its performance would be employed as the baseline to compare with other additions.Not only the scores, but also the curves would be taken for important reference.
STD This course is tasked to help the model to obtain better structure perception capability through a more harmonious contextual understanding.STD improves ELECTRA on all tasks in GLUE and SQuAD 2.0 datasets.It is worth noting that the scores on CoLA task surprisingly stand out amongst the crowd.The Corpus of Linguistic Acceptability (CoLA) is used to predict whether an English sentence is linguistically acceptable or not.Apparently, pre-training on word rearrangement indeed lifts the global intellection of models, which makes it focus more on structure and logic rather than word prediction.Even the best CoLA result of 71.25 comes from the re-STD course, which further embodies the effectiveness of STD.
ITD This course is tasked to alleviate labelimbalance.As shown in Figure 5, replace rate reflects the prediction accuracy of G. Accompanied by MLM and SLM, G predicts more correct words on the [MASK] positions, causing the "replaced" labels to become scarce for the training of D. By adding inserted [MASK], the replace rate has a fixed lower limit corresponding to the inserted proportion, leading to a balanced distribution of labels.Besides, ITD shows great improvements over ELECTRA, especially on SST-2 datasets.The Stanford Sentiment Treebank (SST-2) provides a dataset for sentiment classification that needs to determine whether the sentiment of a sentence extracted from movie reviews is positive or negative.Predicting for the illusory [MASK] makes the model focus more on content comprehension, which may helpful for sentiment classification.
Self-correction Course Revision always acts as a difficult assignment, because of the challenge to reverse stereotypes.As shown in Figure 5, losses of G and D during self-correction training generally exceed that during self-supervision training, demonstrating the difficulties.However, the replace accuracy of re-RTD course goes higher than the baseline, certifying the effectiveness.Despite that self-correction training outperforms other components on all downstream tasks, the phenomena of "tug-of-war" dynamics is worth exploring.Scores listed in the last three rows of Table 2 almost touch each other, and optimal results of every single task do not always appear under the same model.It means multi-perspective courses may interfere with each other in attempts to pull parameters in different directions, which seems even more apparent under the self-correction course where secondarysamples are well designed for bootstrapping.To alleviate this situation and further improve the effectiveness of training, we found a feasible solution elaborated in Section 5.5.

Sample Efficiency Comparison
To demonstrate the proposed MCL is sampleefficient, we conduct a comparative trial between MCL and ELECTRA.As shown in Figure 3, the prevalent task MNLI is chosen for evaluation.For every 25K steps of pre-training, we reserved the model and fine-tuned it with the same configuration mentioned in Section 5.1.Obviously, MCL preponderates over ELECTRA baseline on every training node, which obtains 87.8 points at 25K steps, demonstrating its enormous learning efficiency even on small pieces of corpora.

Course Soups Trial
Inspired by model soups (Wortsman et al., 2022), which averages many models in a hyperparameter sweep during fine-tuning, we find similarities and bring this idea to MCL in a task sweep during pretraining.Different courses lead the model to lie in a different low error basin, and co-training multiple courses may create the "tug-of-war" dynamics.To solve the training conflicts, and to further improve the learning efficiency of models in the later pretraining stage, we conduct a "course soups" trial.
For ingredients in soups, we arrange all combinations of 4 losses in self-correction courses, training them into 14 single models while retaining the structure of self-supervision courses.Then all ingredients are merged through uniform and weighted integration.Results lies in Figure 4. Optimal results obtained by weight soups, which improves the average GLUE score by 0.19 absolute points against our best model MCL.The results show that the course soups suggests a effective way to guide the later training of the model by separating multiple objectives and combining them at last.More details scores are listed in Appendix E.

Conclusion
This paper proposes the multi-perspective course learning method, containing three self-supervision courses to improve learning efficiency and balance label distributions, as well as two self-correction courses to create a "correction notebook" for revision training.Besides, the course soups method is designed to figure out a novel approach for efficient pre-training.Experiments show that our method significantly improves ELECTRA's performance and overshadows multiple advanced models under same settings, verifying the effectiveness of MCL.

Limitations
Although the proposed method has shown great performance to alleviate the issues of biased learning and deficient interaction, which are common problems among ELECTRA-style pre-training models, we should realize that the proposed method still can be further improved.For example, the inherent flaw of RTD mentioned in Section 3 could only be relieved rather than solved.More about mission design regarding with this issue is worth studying.Besides, although the results show great performance, more efforts are required to explore the hidden impact of each course, which will help the application of the proposed model in the future.

A Hyperparameters for Pre-training
As shown in Table 4, we present the hyperparameters used for pre-training MCL on the base setting.We follow the optimization hyperparameters of CoCo-LM (Meng et al., 2021)
Semantic similarity tasks aim to predict whether two sentences are semantically equivalent or not.The challenge lies in recognizing rephrasing of concepts, understanding negation, and handling syn-  Classification The Corpus of Linguistic Acceptability (CoLA) is used to predict whether an English sentence is linguistically acceptable or not.The Stanford Sentiment Treebank (SST-2) provides a dataset for sentiment classification that needs to determine whether the sentiment of a sentence extracted from movie reviews is positive or negative.
As a widely used MRC benchmark dataset, SQuAD 2.0 is a reading comprehension dataset that requires the machine to extract the answer span given a document along with a question.We select the v2.0 version to keep the focus on the performance of pure span extraction performance.Two official metrics are used to evaluate the model performance: Exact Match (EM) and a softer metric F1 score, which measures the average overlap between the prediction and ground truth answer at the token level.
The summary of the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019) is shown in Table 5.

C Hyperparameters for Fine-tuning
Table 6 presents the hyperparameters used for finetuning over SQuAD v2.0 (Rajpurkar et al., 2018), and the GLUE benchmark (Wang et al., 2019) following CoCo-LM for fair comparison.On the development sets, the hyperparameters are searched based on the average performance of five runs.

Figure 1 :
Figure 1: The overall structure of the self-supervision courses.<m> denotes [MASK].A capital letter stands for a token and letters in red indicate operated positions.

Figure 2 :
Figure 2: An example for self-correction course of RTD.

Figure 4 :
Figure 4: Average GLUE results of the course soups.

Table 1 :
The confusion matrix of output tokens from D. ✓denotes that D makes a correct judgment, conversely ✗ presents the situation of wrong discrimination.

Table 2 :
All evaluation results on GLUE datasets for comparison.Acc, MCC, PCC denote accuracy, Matthews correlation, and Spearman correlation respectively.Reported results are medians over five random seeds.

Table 3 :
All evaluation results on SQuAD 2.0 datasets.

Table 4 :
for comparisons.Note that all losses conducted on D are multiplied by λ (set as 50), which is a hyperparameter used to balance the training pace of G and D. Hyperparameters for pre-training.