Uncertainty-aware Parameter-Efficient Self-training for Semi-supervised Language Understanding

The recent success of large pre-trained language models (PLMs) heavily hinges on massive labeled data, which typically produces inferior performance in low-resource scenarios. To remedy this dilemma, we study self-training as one of the predominant semi-supervised learning (SSL) approaches, which utilizes large-scale unlabeled data to generate synthetic examples. However, too many noisy labels will hurt the model performance, and the self-training procedure requires multiple training iterations making it more expensive if all the model parameters of the PLM are updated. This paper presents UPET, a novel Uncertainty-aware Parameter-Efficient self-Training framework to effectively and efficiently address the labeled data scarcity issue. Specifically, we incorporate Monte Carlo (MC) dropout in Bayesian neural network (BNN) to perform uncertainty estimation for the teacher model and then judiciously select reliable pseudo-labeled examples based on confidence and certainty. During the student training, we introduce multiple parameter-efficient learning (PEL) paradigms that allow the optimization of only a small percentage of parameters. We also propose a novel Easy-Hard Contrastive Tuning to enhance the robustness and generalization. Extensive experiments over multiple downstream tasks demonstrate that UPET achieves a substantial improvement in terms of performance and efficiency. Our codes and data are released at https: //github.com/wjn1996/UPET.


Introduction
Pre-trained language models (PLMs) have become the imperative infrastructure in a series of downstream natural language understanding (NLU) tasks (Devlin et al., 2019;Liu et al., 2019;Yang et al., 2019), aiming at capturing prior knowledge by pre-training over large-scale unsupervised corpora and fine-tuning on the target tasks.However, the conventional fine-tuning approaches heavily depend on the time-consuming and labor-intensive process of data annotation, which could be even more bothersome in some real-world scenarios and typically produces inferior performance in few-shot settings (Liu et al., 2021b;Kojima et al., 2022).
Recently, self-training (Chawla and Karakoulas, 2005; Amini et al., 2022) has been presented to address the labeled data scarcity issue by leveraging the large-scale unlabeled data in addition to labeled data, which is one of the mature paradigms in semisupervised learning (Qi and Luo, 2022;Yang et al., 2021a;Chawla and Karakoulas, 2005;van Engelen and Hoos, 2020;Yang et al., 2021b).A teacher model is fine-tuned on the few-shot labeled data, then the pseudo label of each unlabeled example can be generated.After that, a student model can learn the knowledge derived from the large-scale pseudo-labeled data, leading to better performance near to full-supervised learning.Previous works typically use self-training in conjunction with large PLMs to endow the model with the ability of fewshot learning.Despite the big success, we observe that there are still two challenges.1) The pseudolabeled data consists of too many noises, inevitably degrading the model performance due to confirmation bias (Wang et al., 2021).2) The procedure of self-training is too expensive when updating all parameters of the large PLM1 (Wang et al., 2022).
In this paper, we develop a novel Uncertaintyaware Parameter-Efficient self-Training framework (UPET) for improving self-training through two perspectives, i.e., effectiveness and efficiency.To reach these goals, we respectively present two novel techniques, including Reliable Example Sampling (RES) and Efficient Robust Tuning (ERT).The goal of RES is to explicitly mitigate the effect of label noises.Concretely, we obtain the prediction probability distribution over all unlabeled data derived from the teacher model.Then, we utilize Monte Carlo (MC) dropout technique in Bayesian neural network (BNN) (Gal and Ghahramani, 2016;Wang and Yeung, 2016) to estimate the uncertainty of each unlabeled example.To this end, the example with higher confidence and certainty will be judiciously selected as the reliable pseudo-labeled data.In ERT, we aim to leverage PEL paradigms to train a robust student model over reliable pseudolabeled data.We design multiple PEL-based model architectures for the student model that only need to update a small scope of tunable parameters in PLM during iterative self-training.Additionally, we introduce Easy-Hard Contrastive Tuning to improve the robustness of the parameter-efficient model, which can be viewed as a regularization in the semantic space that keeps the noisy labels away from the reliable examples.
We conduct extensive experiments over multiple NLU tasks.Results show that UPET outperforms strong baselines in terms of both effectiveness and efficiency.The improvement is consistent in different settings with different PEL methods and the number of labeled data.Our key contributions to this field are summarized as follows: 1) We use parameter-efficient learning of PLMs in conjunction with uncertainty estimation to form an efficient and effective self-training framework.2) To better improve the robustness of the parameter-efficient model, we introduce Easy-Hard Contrastive Learning.3) Extensive experiments among a wide range of tasks demonstrate that our proposed framework outperforms prevailing strong baselines.

Related Work
Semi-supervised Learning and Self-training.SSL aims to effectively utilize unlabeled data in addition to labeled data, which has been widely used in the NLP community (Yang et al., 2017;Gururangan et al., 2019;Xie et al., 2020;Chen et al., 2020).For instance, Yang et al. (2017); Gururangan et al. (2019) utilize variational autoencoders (VAEs) for sequence classification and labeling.Chen et al. (2020) proposes MixText to mix labeled, unlabeled, and augmented data, and performs similar consistency training as UDA (Xie et al., 2020).Self-training is one of the mature SSL approaches that use teacher-student architecture to augment data (Hu and Khan, 2021;Mukherjee and Awadallah, 2020;Amini et al., 2022;Wang et al., 2021;Tsai et al., 2022).For example, Hu and Khan (2021)  Parameter-Efficient Learning.PEL is to optimize a small portion of parameters while keeping the model backbone frozen, which aims at improving the training efficiency and preserving the model's effectiveness (He et al., 2022).Houlsby et al. (2019) integrates task-specific neural modules called adapters into PLMs, and only these adapters are updated during fine-tuning.Ptuning (Liu et al., 2021b) and Prefix-Tuning (Li and Liang, 2021) respectively introduce a lightweight prefix module into the input layer and each transformer layer, enabling efficient training over these prefix modules.Notable PEL-based models also include BitFit (Zaken et al., 2022), LoRA, etc.This paper integrates PEL into self-training to improve its efficiency.

UPET: The Proposed Method
, where N l and N u respectively denote the number of labeled set and unlabeled set (N l ≪ N u ).X i , Xi ∈ X denote the input sentence in the labeled set and unlabeled set, respectively.Y i ∈ Y is the corresponding label of X i .The task is to train a neural model f W and pseudo label for each unlabeled example Xi , where f W ∶ X → Y is a function with parameters W to map the input space X to the label space Y.We aim to answer the following research problem: • RQ1: How can we mitigate the problem of noisy pseudo labels via judiciously selecting We thus propose the UPET framework which consists of two novel techniques, i.e., Reliable Example Sampling (RES) and Efficient Robust Tuning (ERT).The framework overview is illustrated in Figure 1 and the detailed algorithm is shown in Appendix B.

Fine-Tuning and Pseudo Annotation
We start with a fine-tuning stage over the few-shot labeled data D l to form a teacher model f W tea .After that, the pseudo label Ỹi of each unlabeled example Xi can be generated by the teacher model: where p(⋅) is the probability distribution.However, the generated labels may be wrong due to the model confirmation bias problem.That means we need to explicitly reduce the noise problem by designing a suitable sample selection strategy.

Reliable Example Sampling
To reach this goal, we follow Tsai et al. ( 2022); Mukherjee and Awadallah (2020); Hu and Khan (2021) to leverage uncertainty estimation from BNN to measure what the reliable unlabeled examples can be selected for training.we follow (Houlsby et al., 2011;Gal et al., 2017;Tsai et al., 2022) to leverage information gain of the model parameters to show how certain the model is to the pseudo-labeled examples w.r.t. the true labels2 .Typically, the information gain can be defined as: (2) where W denotes the parameters of the teacher.B( Ỹi , W | Xi , D u ) denotes the information gain which is the difference between H( Ỹi | Xi , D u ) (the final entropy after seeing all examples from unlabeled sentences) and H( Ỹi | Xi , W ) (the current entropy for the example Xi ).p(W |D u ) is the posterior distribution.As the calculation of Eq. 2 is intractable, we utilize MC dropout in BNN to perform approximation.Specifically, we assume that the posterior distribution p(W |D u ) can be replaced with dropout distribution q θ (W ).Thus, we can sample T masked model weight { Wt } T t=1 ∼ q θ (W ), and calculate the approximation value as: where is the predict probability of Xi derived from the t-th masked model  f Wt tea .Thus, a lower B( Ỹi , W | Xi , D u ) value means that the model is more certain about the prediction, as higher certainty corresponds to lower information gain (Tsai et al., 2022) 3 .Formally, we can design a certainty score for each example as: To this end, we can obtain the final sampling weight for each example by considering both model confidence and certainty: where is the model confidence derived from the average approximate posterior of the T masked models w.r.t the pseudo label Ỹi , α (0 ≤ α ≤ 1) denotes the balancing factor.Hence, a number of N r reliable examples can be sampled by these weights to form a new subset D r ⊂ D u .

Parameter-Efficient Tuning
After the annotation and selection of unlabeled examples, we need to train a student model to elicit knowledge from the teacher.Yet, the training process of the self-training paradigm is inefficient.To remedy this dilemma, we aim to introduce PEL in self-training.We initialize a student model f W * stu and a few designated parameters in W * can be tuned, enabling efficiency when training on many pseudo-labeled data.To meet our desiderata, we introduce two prediction paradigms with three PEL methods.The architecture is shown in Figure 2.
Head-Tuning.Head-Tuning leverages CLS head to generate the probability distribution of the given example.Formally, we have: where F W * (⋅) denotes the output representation by the student model f stu .H cls (⋅) denotes a CLS head with a softmax classification layer4 .
Prompt-Tuning.Prompt-Tuning aims at reusing the Masked Language Modeling (MLM) head to make predictions.Specifically, a well-designed template T with a masked token ("[MASK]") is concatenated with the original input sentence.In addition, we need to define a verbalizer V that maps the probability distribution over the whole vocabulary set X to the label set Y. The probability can be calculated as: where H mlm denotes the MLM head derived from the PLM, ⋅||⋅ is the concatenation operation.V y (⋅) aims to map the label word's probability at the masked position to the corresponding class y.Hence, we can integrate Ptuning (Liu et al., 2021b), Prefix-tuning (Li and Liang, 2021) and Adapter-tuning (Houlsby et al., 2019) to unify the PEL with arbitrary PLMs and prediction paradigms, including Head-Ptuning, Head-Prefix, Head-Adapter, Prompt-Ptuning, Prompt-Prefix and Prompt-Adapter.More details are shown in Appendix A.1.During the optimization, we can compute the following cross-entropy objective by: (8) Yet, it is still possible that the subset D r could consist of some wrong labels.During the parameterefficient training stage, the scale of trainable parameters in W * being small, the student model is fragile and the robustness could not be preserved due to the negative effect of these noises in the backward.In that, we follow (Tsai et al., 2022) to utilize partially huberised cross-entropy loss (PHCE loss), which is an alternative variant with a gradient clipping technique.Hence, the loss function in Eq. 8 can be modified as: where ϕ τ (y|x) is the PHCE loss function with a hyper-parameter τ (τ > 1).The detail of the PHCE loss function is shown in Appendix A.3.

Easy-Hard Contrastive Tuning
As mentioned above, the selected example in D r has a higher model certainty and might be too easy to contribute any additional information.Nonetheless, this inevitably leads to the student model over-fitting on these frequently selected samples (Mukherjee and Awadallah, 2020).Intuitively, the example not selected in D r is more likely to be a noise that results in semantic drift.Thus, a natural idea is to exploit some hard examples (which are not selected in D r ) as the negatives to keep them away from easy (reliable) examples, which can be viewed as a regularization in the semantic space.
To reach this goal, we present Easy-Hard Contrastive Tuning.We denote D h as the difference between D u and D r , so the examples in D h represent the hard ones.During the optimization of the student model, given one example ( Xi , Ỹi ) ∈ D r , we aim to choose one another example ( X+ i , Ỹ + i ) from D r as the positive and some negative examples {( Hence, the contrastive regularization term can be computed as: R(f ) where g(⋅, ⋅) is the score function that measures the similarity of two examples in the semantic space.Finally, the whole training objective is designed as: where λ > 0 is the hyper-parameter.

Dataset and Implementation Details
We perform extensive experiments over seven language understanding tasks to evaluate our UPET framework.We choose a series of tasks from the GLUE benchmark (Wang et al., 2018), including SST-2 (Socher et al., 2013) for sentiment analysis, MNLI (Williams et al., 2018) for language inference, QNLI (Rajpurkar et al., 2016) for question answering, MRPC (Dolan and Brockett, 2005) for semantic paraphrasing and RTE (Dagan et al., 2005) for textual entailment.We also choose CB (De Marneffe et al., 2019) from Su-perGLUE (Wang et al., 2019) for linguistic entailment and AGNews (Zhang et al., 2015) for topic classification.For each dataset, the number of labeled examples per class is set as N l ∈ {16, 32, 64}.We repeatedly sample few-shot labeled instances five times with different seeds from {12, 21, 42, 87, 100} and report average performance with standard deviation.
For the implementation details, we choose RoBERTa-large (Liu et al., 2019)

Baselines
We consider some strong baselines for comparison, including UST (Mukherjee and Awadallah, 2020), CEST (Tsai et al., 2022) and LiST (Wang et al., 2022).UST and CEST leverage uncertainty estimation for self-training.LiST integrates Adaptertuning (Houlsby et al., 2019) into prompt-based learning for parameter-efficient self-training, which is similar to the Prompt-Adapter paradigm.In addition, we also design two semi-supervised learning baselines: 1) Head ST aims to use the classic fine-tuning with CLS head to augment unlabeled data through standard self-training.2) Prompt ST aims to reuse the MLM head with a well-designed task-specific template and verbalizer to perform pseudo-labeling in standard self-training.We also choose Head FT and Prompt FT to fine-tune over few-shot or full training data.

Main Results
Table 1 illustrates the main results over seven NLU tasks with different settings.RoBERTa-large trained on fully labeled examples provides the ceiling performance for the few-shot and semisupervised setting.We thus make the following observations.1) According to the overall results, all the methods with self-training outperform conven-tional few-shot learning (i.e., Head FT and Prompt FT).This demonstrates the impact of self-training with unlabeled data.2) We obtain the best overall performance of 78.2% with the lowest tunable parameters (i.e., Prompt-Ptuning) and improve over Head ST, Prompt ST, UST, CEST, and LiST by 7.0%, 3.6%, 6.1%, 5.6%, and 2.0% respectively over seven tasks, which indicates that UPET outperforms state-of-the-arts in terms of both the effectiveness and efficiency.3) Compared to the strong baseline Prompt ST that uses the PEL-based approach, we obtain a 3.6% absolute improvement, demonstrating the substantial contributions of the well-designed reliable example selection and contrastive regularization.4) We also list all 6 PEL paradigms' performance of UPET.We observe that the performance of Prompt-Tuning is higher than Head-Tuning, indicating that reusing the pre-training objective MLM with the task-orient template and verbalizer is more effective for selftraining.In addition, more tunable parameters may enhance the student model's ability to learn semantic knowledge derived from the teacher.

Further Analysis
Impact of Self-training Iterations.To validate the effectiveness of self-training, we choose MNLI and RTE and draw some curves to show the performance of different PEL paradigms at each iteration in Figure 3. From the figure, we find that the performance increases when the framework continual training until the 4-th iteration, indicating the convergence of our framework.Additionally, the student model with Prompt-Tuning (including Prompt-Ptuning, Prompt-Prefix, and Prompt-Adapter) consistently outperforms Head-Tuning (including Head-Tuning, Head-Prefix, and Head-Adapter).This shows that prompt-based methods can better utilize PEL to make self-training both effective and efficient.
Labeled Data Efficiency.To investigate the influence of the number of labeled examples, we vary the examples of each class from 16, 32, and 64.
We choose LiST as the strong baseline.To make a fair comparison, the PEL we select is Prompt-Adapter, which is the same as LiST and only tunes the adapter module in PLM.Results in Table 3 illustrate that the performance gradually improves as the number of labeled data increases, as expected.In addition, we also find that our UPET outper-

Effectiveness of Reliable Example Sampling.
To validate the effectiveness of the RES, we investigate the effect of the balance factor α in Eq. 5 in terms of the average performance.From Table 4, it is necessary to perform sample selection to obtain more clean data.The results also illustrate that both model confidence and certainty substantially make contribute to the performance.We find the best value is set around 0.2, which means certainty plays an important role in the selection.Visualization of the Contrastive Regularization.
To investigate how the proposed Easy-Hard Contrastive Tuning contributes to the final performance, in Figure 4, we use the t-SNE (Van der Maaten and Hinton, 2008) tool and select the AGNews task for validation.Specifically, we randomly sample 1k testing examples to draw the representations in the semantic space.Results demonstrate that the model trained with contrastive regularization can make a clearer boundary between every two classes, corroborating our conclusions that avoiding the overfitting problem and yielding better generalization.

Ablation Study
In this section, we conduct an ablation study to demonstrate the impact of different variants of UPET that remove the designed technique.From Table 5, we thus make the following summarization.1) We find that the performance of w/o.Reliable Example Sampling (RES) decreases a lot (more than 2%).In addition, we also find that the sampling weight considered by both certainty and confidence can make consistent contributions in RES.These phenomena demonstrate the effectiveness of the de-noising approach considered by both model confidence and certainty.2) Removing PHCE loss from UPET in 1.1% performance drop in terms of average results, which indicates the importance of PHCE loss in robust student training.3) Through UPET versus UPET w/o.Easy-Hard Contrastive Tuning, the average performance of the student model is improved by about 1.6%, demonstrating the effectiveness of the contrastive regularization design.

Comparison to Non-BERT Approaches
We end this section with an additional comparison between UPET and non-BERT semi-supervised learning approaches that use a different number of labeled examples for tuning the teacher model.Table 6 shows that our framework achieves a large performance gain with only 64 labeled examples, especially on UPET (best) with at least 7%.

Conclusion
In this paper, we introduce a novel uncertaintyaware parameter-efficient self-training framework (UPET) to better improve the effectiveness and efficiency of self-training.In UPET, we use uncertainty estimation to judiciously select reliable pseudo-labeled examples to explicitly alleviate the noisy label problem.To make self-training more efficient, we integrate multiple parameter-efficient paradigms into self-training.To further improve the performance, we also present Easy-Hard Contrastive Tuning to enhance the robustness and reduce the over-fitting problem.In the future, we will extend our framework to other complex tasks, such as sequence labeling, question answering, etc.

Limitations
Our limitations are shown below: • We only focus on sequence classification-style NLU tasks.However, we think it can be extended to other tasks easily, such as sequence labeling, question answering, etc.
• Our work focuses on the PLM without Transformer decoders.We think it is possible to extend our method to natural language generation (NLG) tasks.We will leave it as our future work.via a simple variant of gradient clipping for the classification loss (e.g.cross-entropy).Given one example (x, y), the PHCE loss ϕ(x, y) is denoted as:

B Self-training Procedure
We show the whole training procedure in Algorithm 1. Specifically, we first use the original PLM f W 0 to initialize a teacher model f W tea (Algorithm 1, Line 1), and then fine-tune the teacher model over few-shot labeled data D l (Algorithm 1, Line 2).During the iteration process, we sample a subset unlabeled set D ′ u from D u , and obtain model confidence and certainty for each unlabeled example X (Algorithm 1, Line 4).Based on these factors, we can calculate the sampling weight for each unlabeled example and sample some reliable examples to form an easy set D r , and the rest is formed as a hard set D h (Algorithm 1, Line 7-9).During the student learning, we use the original PLM f

C Details of NLU task
We list the statistics of each task in Table 7.

D Searching Scope of Grid Search
We use grid search to select the best hyperparameters for each task, the searching score is shown in Table 8.
presents uncertainty estimation for denoising self-training.Tsai et al. (2022) introduces graph-based contrastive learning to preserve consistency regularization.Wang et al. (2021) incorporates self-training into sequence labeling tasks by automatic weighting strategy.

Figure 2 :
Figure 2: Overview of different PEL paradigms.(a)-(c) represent Head-Tuning, aiming to CLS head for prediction.(d)-(f) denote Prompt-Tuning to make prediction via well-designed template and verbalizer.We unify three classic PEL methods for both Head-Tuning and Prompt-Tuning.The block in light yellow and blue means the trainable and frozen parameters, respectively.The block with sketches denotes the adapter module.(Best viewed in color.)

Figure 3 :
Figure 3: The performance (%) of different self-training iterations over MNLI and RTE.
use PHCE loss and Easy-Hard Contrastive Tuning to train the parameter-efficient student model over the pseudolabeled examples (Algorithm 1, Line 6, 10-12).At last, we can copy the parameter of the student model to the teacher and repeat until convergence.

Algorithm 1
Self-training Procedure of UPET Require: Neural model f W 0 , labeled data D l , unlabeled data D u .1: Initialize a teacher model f W tea = f W 0 ; 2: Fine-tune the teacher model f W tea over the labeled data D l (All parameters will be updated); 3: while not converged do 4: Sample an unlabeled data subset D ′ u ⊂ D u ; 5: Pseudo annotate each unlabeled example Xi ∈ D ′ u by f W tea in Eq. 1 to obtain the hard label Ỹi ; to form a subset D r by the sampling weight in Eq. 5.The examples not sampled can be used to form D h .10: Calculate the PHCE loss l(D r , f while 15: return The teacher model f W tea .

Table 1 :
The performance comparison of accuracy or F1 scores (%) with standard deviations on seven tasks.All methods (except fine-tuning with full data) are trained with 16-shot labeled samples for each class and overall results are aggregated over five different runs with different random seeds.In UPET, the first three variants belong to the Head-Tuning paradigm, while the others are Prompt-Tuning.

Table 2 :
The average performance (%) over all tasks with different combinations of PEL paradigms.

Table 5 :
Combination of Different Parameter-EfficientLearning Paradigms in Self-training.We aim to explore how PEL performs in the self-training procedure.We integrate the PEL paradigm into the teacher or student model to show the performance of the different combinations of PEL.As shown in Table2, we choose Head-Adapter and Prompt-Adapter.We find the setting that all parameters in both the teacher and student updated gains the best-average performance, indicating the ceiling performance of each paradigm.Yet, it costs about 11 hours which makes the self-training procedure inefficient.In addition, the time influence on whether the teacher model uses PEL is less than the student, because the teacher model only trains once while the student model needs to update for 100 epochs in each self-training iteration.Correspondingly, this motivated us to leverage PEL in the student model to improve the efficiency of selftraining, preserving its effectiveness.The 16-shot performance (%) of different variants of UPET with Prompt-Ptuning.
Table 4: The average performance (%) of UPET (Prompt-Ptuning) with different selection strategies (varying by α)."None" equals Prompt ST which trains the student model on all pseudo-labeled data.

Table 7 :
The statistics of multiple languages understanding tasks.Since the original test data is unavailable, we use the development sets as our test sets.

Table 8 :
The searching scope for each hyper-parameter.