Instance Regularization for Discriminative Language Model Pre-training

Discriminative pre-trained language models (PrLMs) can be generalized as denoising auto-encoders that work with two procedures, ennoising and denoising. First, an ennoising process corrupts texts with arbitrary noising functions to construct training instances. Then, a denoising language model is trained to restore the corrupted tokens. Existing studies have made progress by optimizing independent strategies of either ennoising or denosing. They treat training instances equally throughout the training process, with little attention on the individual contribution of those instances. To model explicit signals of instance contribution, this work proposes to estimate the complexity of restoring the original sentences from corrupted ones in language model pre-training. The estimations involve the corruption degree in the ennoising data construction process and the prediction confidence in the denoising counterpart. Experimental results on natural language understanding and reading comprehension benchmarks show that our approach improves pre-training efficiency, effectiveness, and robustness. Code is publicly available at https://github.com/cooelf/InstanceReg.


Introduction
Leveraging self-supervised objectives to pre-train language models (PrLMs) on massive unlabeled data has shown success in natural language processing (NLP) (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2019;Dong et al., 2019;Lan et al., 2020;Clark et al., 2020;Luo et al., 2021;Zhu et al., 2022).A wide landscape of pre-training objectives has been produced, such as autoregressive (Radford et al., 2018;Yang et al., 2019) and autoencoding (Devlin et al., 2019; Ennoising Denoising Joshi et al., 2020) language modeling objectives, which serve as the principled mechanisms to teach language models general-purpose knowledge through the pre-training, and then those pretrained PrLMs can be fine-tuned for downstream tasks.Based on these unsupervised functions, three classes of PrLMs have been proposed: autoregressive language models (e.g.GPT (Radford et al., 2018)), autoencoding models (e.g.BERT (Devlin et al., 2019)), and encoderdecoder models (e.g.BART (Lewis et al., 2020a) and T5 (Raffel et al., 2020)).In this work, we focus on the research line of autoencoding models, also known as discriminative PrLMs that have achieved impressive performance on natural language understanding (NLU).
Although the discriminative PrLMs may vary in language modeling functions or architectures as discussed above, they can be generalized as denoising auto-encoders, which contain two procedures, ennoising and denoising.The pretraining procedure is illustrated in Figure 1.
2) Denoising enables a language model to predict missing or otherwise corrupted tokens in the input sequences.Recent studies focus on designing improved language modeling functions to mitigate discrepancies between the pre-training phase and the fine-tuning phase.Yang et al. (2019) reformulates MLM in XLNet by restoring the permuted tokens in factorization order, such that the input sequence is autoregressively generated after permutation.In addition, using synonyms for the masking purpose (Cui et al., 2020) and simple pre-training objectives based on token-level classification tasks (Yamaguchi et al., 2021) have also proved effective as an MLM alternative.
Most of the existing studies of PrLMs fall into the scope of either investigating better ennoising operations or more effective denoising strategies.They treat training instances equally throughout the training process.Little attention is paid to the individual contribution of those instances.In standard MLM ennoising, randomly masking different tokens would lead to different degrees of corruption that may, therefore, cause different levels of difficulty in sentence restoration in denoising (as shown in Figure 1) and thus increase the uncertainty in restoring the original sentence structure during the denoising process.For example, if "not" is masked, the corrupted sentence tends to have a contrary meaning.
In this work, we are motivated to estimate the complexity of restoring the original sentences from corrupted ones in language model pretraining, to provide explicit regularization signals to encourage more effective and robust pre-training.Our approach includes two sides of penalty: 1) ennoising corruption penalty that measures the distribution disparity between the corrupted sentence and the original sentence, to measure the corruption degree in the ennoising process; 2) denoising prediction penalty that measures the distribution difference between the restored sequence and the original sentence to measure the sentence-level prediction confidence in the denoising counterpart.Experiments show that language models trained with our regularization terms can yield better performance and become more robust against adversarial attacks.

Related Work
Training powerful large-scale language models on a large unlabeled corpus with self-supervised objectives has attracted lots of attention, which commonly work in two procedures of ennoising and denoising.The most representative task for pre-training is MLM, which is introduced in Devlin et al. (2019) to pre-train a bidirectional BERT.A spectrum of ennoising extensions has been proposed to enhance MLM further and alleviate the potential drawbacks, which fall into two categories: 1) mask units and 2) noising scheme.Mask units correspond to the language modeling units that serve as knowledge carriers in different granularity.The variants focusing on mask units include the standard subword masking (Devlin et al., 2019), span masking (Joshi et al., 2020), and n-gram masking (Levine et al., 2021;Li and Zhao, 2021).For noising scheme, BART (Lewis et al., 2020a) corrupts text with arbitrary noising functions, including token deletion, text infilling, sentence permutation, in conjunction with MLM.UniLM (Dong et al., 2019) extends the mask prediction to generation tasks by adding the auto-regressive objectives.XLNet (Yang et al., 2019) proposes the permuted language modeling to learn the dependencies among the masked tokens.MacBERT (Cui et al., 2020) suggests using similar words for the masking purpose.Yamaguchi et al. (2021) also investigates simple pre-training objectives based on token-level classification tasks as replacements of MLM, which are often computationally cheaper and result in comparable performance to MLM.In addition, ELECTRA (Clark et al., 2020) proposes a novel training objective called replaced token detection, which is defined over all input tokens.
Although the above studies have an adequate investigation to reduce the mismatch between pre-training and fine-tuning tasks, an essential problem of the common denoising mechanism lacks attention.The construction of training examples based on ennoising operations would cause the break of sentence structure, either for replacement, addition, or deletion-based noising functions.In extreme cases, the destruction would lead to completely different sentences, making it difficult for the model to predict the corrupted tokens.Therefore, in this work, we propose to enhance the pre-training quality by using instance regularization (IR) terms to estimate the restoration complexity from both sides of ennoising and denoising aspects.
The proposed approach is partially related to some prior studies of hardness measurement in training deep learning models (Lin et al., 2017;Kalantidis et al., 2020;Hao et al., 2021), whose focus is to guide the model to pay special attention to hard examples and prevent the vast number of easy negatives from overwhelming the training process.In contrast to optimizing the training process by heuristically finding the hard negatives, this work does not need to distinguish hard examples from ordinary ones, but measures the corruption degree between the masked sentence and the original sentence instead, and uses the degree as the explicit training signals.

Methodology
This section will start by formulating the ennoising and denoising processes for building PrLMs and then introduce our instance regularization approach to estimate the restoration complexity in both ennoising and denoising views.

Preliminary: Denoising Auto-Encoders
The training procedure for discriminative language models includes ennoising and denoising processes, as described below.For the sake of simplicity, we take the widely-used MLM as a typical example to describe the ennoising process.
Ennoising Given a sentence W = {w 1 , w 2 , . . ., w n } with n tokens,1 we randomly mask some percentage of the input tokens with a special mask symbol [MASK] and then predict those masked tokens.Suppose that there are m tokens replaced by the mask symbol.Let D = {k 1 , k 2 , . . ., k m } denote set of masked positions, we have W ′ as the masked sentence and M = {w k 1 , w k 2 , . . ., w km } are the masked tokens.In the following part, we use w k to denote each masked token for simplicity.
Denoising In the denoising process, a language model is trained to predict the masked token based on the context.W ′ is fed into the PrLM to obtain the contextual representations from the last Transformer layer, which is denoted as H.
Training The training objective is to maximize the following objective: (1)

Instance Regularization
In this part, we will introduce our instance regularization approach, which involves two sides: corruption degree in the ennoising data construction process and the sequence-level prediction confidence in the denoising counterpart.
During denoising, the PrLM trained by MLM is required to predict the original masked tokens w k given the hidden states H of the corrupted input W ′ .Let w ′ k denote the predicted tokens, we replace the mask symbols by filling w ′ k back to W ′ .As a result, we have the predicted sequence, denoted as P = {p 1 , p 2 , . . ., p n }, where the tokens in positions of D are predicted ones; otherwise, they are the same as the originals ones in W .
Obviously, the corruption would break the sentence structure and easily cause the semantic deviation of sentence representations.According to our observation, the hidden states would vary dramatically before and after the token corruption -similar findings were also observed in Wang et al. (2021) that small disturbance can inveigle PrLMs into making false predictions.In a more general perspective, replacing a modest percentage of tokens may result in a totally different sentence, let alone imperceptible disturbance as used for textual attacks.
Therefore, we propose two approaches called ennoising corruption penalty (ECP) and denoising prediction penalty (DPP) as the regularization terms in the training process to alleviate the issue.Figure 2 overviews the overall procedure.ECP measures the semantic shift from the original sentence to the corrupted one as an explicit signal to help the model distinguish easy and hard examples and learn with different weights, which can be seen as instance weighting compared with MLM.As the complement, DPP measures the sequencelevel semantic distance between the predicted and original sentence to supplement the rough tokenlevel matching of MLM, thus transforming the token prediction task to sequence matching to pay more attention to sentence-level semantics.
Both methods are used for estimating the difficulty of restoring the whole sequence from the Here we go back to the formulation in MLM.As shown in Figure 2, given the original sequence W , the masked sequence W ′ , and the predicted sequence P , we obtain the contextualized representation from PrLM.Note that we already have the contextualized representation H for the input sequence W ′ in the vanilla MLM training.Similarly, we feed W and P to the PrLM, and the corresponding hidden states are written as Ĥ and H, respectively.Then, H, Ĥ and H are leveraged as the elements for the corruption agreement and semantic agreement.
Ennoising Corruption Penalty After we get the distributions H and Ĥ, we measure the extent of the corruption degree after ennoising by calculating the distribution difference between the masked and the original representations after normalization: where KL refers to Kullback-Leibler (KL) divergence.Concretely, we apply softmax on the two matrices along the hidden dimensions to have two distributions.Then, we calculate KL divergence between the two distributions for each position in each sentence.Intuitively, higher L ECP means the corruption is more severe, so is the gap between the ennoised instance and denoised prediction.Therefore, the model is supposed to update the gradient more significantly for those "harder" training instances.
Denoising Prediction Penalty In the denoising language modeling, the model would yield reasonable predictions but be discriminated as wrong predictions because such predictions do not match the single gold token for each training case using token-level cross-entropy.Therefore, we estimate the semantic agreement between the predicted sequence and the original gold sequence, by guiding the probability distribution of model predictions H to match the expected probability distribution Ĥ, we have: where L DP P is applied as the degree to reflect the sentence level semantic mismatch.
The semantic agreement method works as a means of soft regularization to capture the sequence-level similarity as a supplement to the standard hard token-level matching in crossentropy.
In language model pre-training, we minimize L H and L S .Thus, the loss function is written as 4 Experiments

Setup
To verify the effectiveness of the proposed methods, we conduct pre-training experiments and fine-tune the pre-trained models on downstream tasks.All codes are implemented using PyTorch (Paszke et al., 2017). 2 The experiments are run on 8 NVIDIA GeForce RTX 3090 GPUs.

Pre-training
We employ BERT and ELECTRA as the backbone PrLMs and implement our methods during the pre-training.For pre-training corpus, we use English Wikipedia corpus and BookCorpus (Zhu et al., 2015) following BERT (Devlin et al., 2019).As suggested in Liu et al.
(2019), we do not use the next sentence prediction (NSP) objective as used in Devlin et al. (2019), but only use MLM as the baseline language modeling objective, with a masked ratio of 15%.After masking, 80% of the masked positions are replaced with [MASK], 10% are replaced by randomly sampled words, and the remaining 10% are kept unchanged.We set the maximum length of the input sequence to 512, and the learning rates are 3e-5.We pre-train the base and large models for 100k steps using the pre-trained weights of the public BERT and ELECTRA models as initialization.The baselines are trained to the same steps for a fair comparison.To keep the simplicity like BERT training, following Li et al. (2020), we discard the generator in ELECTRA models and use the discriminator in the same way as BERT, with a classification layer to predict the corrupted tokens.

Fine-tuning
We use an initial learning rate in {8e-6, 1e-5, 2e-5, 3e-5} with warm-up rate of 0.1 and L2 weight decay of 0.01.The batch size is selected in {16, 24, 32}.The maximum number of epochs is set in [2, 5] depending on tasks.Texts are tokenized with a maximum length of 384 for SQuAD and 512 for other tasks.Hyper-parameters were selected using the development set.

Tasks and Datasets
For evaluation, we fine-tune the pre-trained models on GLUE (General Language Understanding Evaluation) (Wang et al., 2019) and the popular Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016) to evaluate the performance of the pre-trained models.The concerned tasks involve natural language inference, semantic similarity, text classification, and machine reading comprehension (MRC).
Natural Language Inference Natural Language Inference involves reading a pair of sentences and judging the relationship between their meanings, such as entailment, neutral and contradiction.We evaluate on three diverse datasets, including Multi-Genre Natural Language Inference (MNLI) (Nangia et al., 2017), Question Natural Language Inference (QNLI) (Rajpurkar et al., 2016) and Recognizing Textual Entailment (RTE) (Bentivogli et al., 2009).
Semantic Similarity Semantic similarity tasks aim to predict whether two sentences are semantically equivalent or not.The challenge lies in recognizing rephrasing of concepts, understanding negation, and handling syntactic ambiguity.Three datasets are used, including Microsoft Paraphrase corpus (MRPC) (Dolan and Brockett, 2005), Quora Question Pairs (QQP) dataset (Chen et al., 2018) and Semantic Textual Similarity benchmark (STS-B) (Cer et al., 2017).
Classification The Corpus of Linguistic Acceptability (CoLA) (Warstadt et al., 2019) is used to predict whether an English sentence is linguistically acceptable or not.The Stanford Sentiment Treebank (SST-2) (Socher et al., 2013) provides a dataset for sentiment classification that needs to determine whether the sentiment of a sentence extracted from movie reviews is positive or negative.
Reading Comprehension As a widely used MRC benchmark dataset, SQuAD (Rajpurkar et al., 2016) is a reading comprehension dataset that requires the machine to extract the answer span given a document along with a question.We select the v1.1 version to keep the focus on the performance of pure span extraction performance.Two official metrics are used to evaluate the model performance: Exact Match (EM) and a softer metric F1 score, which measures the average overlap between the prediction and ground truth answer at the token level.

Main Results
Table 1 presents the results of our methods and baselines under the same pre-training settings on the GLUE development sets.We see that our method achieves consistent performance gains over both BERT and ELECTRA baselines under the base and large settings, i.e., with the increased average scores of +1.20%/0.57%on BERT (base/large) and +1.01%/0.83%on ELECTRA (base/large). 3The results indicate that the instance regularization approach is effective for improving the general language understanding capacity of PrLMs.
We also show the comparisons with public methods on the GLUE test sets.For a fair comparison, we only compare with the related single models fine-tuned for a single task, without model ensembling and task-specific tricks.
According to the results, we observe that our models yield consistent advances on most of the tasks compared with public BERT and ELECTRA models under both base and large sizes.We further evaluate the performance of our models on the challenging SQuAD MRC task.
Table 2 shows the results, which indicate modest performance gains in the reading comprehension task.The results show that our method is not only effective for the sentence-level classification or regression tasks of NLU but also beneficial for passage-level reading comprehension.

Ablation Study
To investigate the contribution of the internal components of the proposed IR objective, we conduct an ablation study under BERT base and ELECTRA base on the GLUE development set.Table 3 shows the performance when removing each one of the methods.We observe that removing either ECP or DPP objective will result in performance drop generally, which verifies the effectiveness of both methods.

Comparison with other distance measures
We apply KL divergence to measure the distance between distributions.We compare the performance for different distance measures by using mean-square error (MSE) loss.The average

Performance in Different Training Steps
To interpret the training effectiveness of our proposed method, we illustrate the performance of different training steps of BERT base and BERT-IR base on the development sets of the small-scale CoLA and the large-scale MNLI tasks, as shown in Figure 3.We see that the accuracy of the baselines boost slightly as the training steps increase.In contrast, our models can still yield obvious gains, which indicates our PrLM models could absorb extra beneficial signals via the newly proposed instance regularization approach.

Convergence Speed
Figure 4 shows the training curve of the BERT base and BERT-IR base models when training from scratch. 4 We observe that the absolute values of our approaches are relatively higher than the baselines at the very beginning.The reason is that our loss function is composed of three elements as formalized in Eq. 4.However, our model converges quickly.The loss of BERT-IR base falls below the baseline when the training goes on, and the slope of our curve is obviously larger than that of the baseline.In addition, we also evaluate the baseline and our model trained from scratch (Table 4), which achieve the average accuracy of 71.74% and 73.08% (+1.3%) on the GLUE datasets, respectively.The analysis above indicates that the PrLM model trained with our approach could absorb extra knowledge via the newly proposed instance regularization approach, and it would benefit the training of the vanilla masked language modeling as well.

Training Cost
Since the calculation of the regularization terms involves two forward passes of input

Robustness Against Synonym-Based Adversarial Attacks
The semantic agreement in IR measures the consistency between similar sentences, which may improve our model's robustness.To verify the hypothesis, we evaluate our models in The examples are generated by a robustness evaluation platform TextFlint (Wang et al., 2021), using SwapSynEmbedding and SwapSynWordNet, which transform an input by replacing its words with synonyms provided by GloVe embeddings (Pennington et al., 2014) or WordNet (Miller, 1998), respectively.The results are shown in Table 6, from which we observe that the adversarial attacks can lead to an obvious performance drop of the baseline models, i.e., 3.67(EM)%/3.63(F1)% of BERT base on SwapSynWordNet.In contrast, our models perform less sensitively against the adversarial examples, and our BERT-IR base even yields an increase of scores in SwapSynEmbedding attack.The results indicate that the regularization helps the model to resist synonym-based adversarial attacks with less performance degradation.
denoising processes in discriminative language model pre-training, motivated by the observation that the quality of denoising has to be subject to the complexity of the constructed training data from ennoising.
The estimation is decomposed into ennoising corruption penalty and denoising prediction penalty, which are used as regularization terms for language model pretraining.Experiments show that language models trained with our regularization terms can yield improved performance on downstream tasks, with better robustness against adversarial attacks.In addition, the training efficiency can be improved as well, without severe costs of computation resources and training speed.We hope our work could facilitate related studies to improve training quality while keeping a lightweight model size.

Limitations
We acknowledge that the major limitation of the proposed method is additional computation compared with the vanilla language models because the calculation of the regularization terms involves two forward passes of input sequences.As discussed in Section 5.5, the training time of BERT-IR base and BERT base baseline for 200K steps is 67h/74h (about 10% increase) with the same hyper-parameters on base models.A more efficient instance regularization method without additional training passes could be future work.

[Figure 1 :
Figure 1: Overview of AutoDecoders.As the two examples show, the random sampling operation during ennoising would result in training instances of different degree of difficulty, e.g., the variety of valid alternatives.

Figure 2 :
Figure2: Overview of the procedure for the instance regularization approach, which estimates the corruption degree in the ennoising data construction process and the prediction confidence in the denoising counterpart.
time of BERT-IR base and BERT base baseline for 200K steps is 67h/74h (only 10% increase) with the same hyper-parameters on base models.The memory cost also keeps basically the same scale as the baseline since the regularization does not necessarily require extra gradient backpropagation.

Figure 4 :
Figure 4: Training curve of the BERT-based models.

Table 1 :
Comparisons between our proposed methods and the baseline models under on the GLUE development sets.STS-B is reported by Spearman correlation, CoLA is reported by Matthew's correlation, and the other tasks are reported by accuracy.Only one decimal place is reserved for the test results which are from the online GLUE server.

Table 2 :
Results on the SQuAD development set.The evaluation metrics are Exact-Match (EM) and F1 scores.

Table 3 :
Ablation study of the proposed methods under BERT-base and ELECTRA-base on the GLUE development set.STS-B is reported by Spearman correlation, CoLA is reported by Matthew's correlation, and other tasks are reported by accuracy.

Table 4 :
Results of training BERT base and BERT-IR base from scratch.
Figure 3: The performance (accuracy) of different training steps of BERT base and BERT-IR base on the CoLA and MNLI development sets.

Table 6 :
(Wang et al., 2021)on on the SQuAD dataset.Original means the results of original dataset sampled from the SQuAD v1.1 development set by TextFlint(Wang et al., 2021), and Swap.indicates the transformed one.The assessed models are the base models from Table2.In this analysis, the lower performance drop means the better.
Table 2 with synonym-based adversarial examples derived from the SQuAD v1.1 development set.