Revisiting Self-training for Few-shot Learning of Language Model

As unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model. The question is how to effectively make use of such data. In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM. Given two views of a text sample via weak and strong augmentation techniques, SFLM generates a pseudo label on the weakly augmented version. Then, the model predicts the same pseudo label when fine-tuned with the strongly augmented version. This simple approach is shown to outperform other state-of-the-art supervised and semi-supervised counterparts on six sentence classification and six sentence-pair classification benchmarking tasks. In addition, SFLM only relies on a few in-domain unlabeled data. We conduct a comprehensive analysis to demonstrate the robustness of our proposed approach under various settings, including augmentation techniques, model scale, and few-shot knowledge transfer across tasks.


Introduction
Pre-trained language models (Devlin et al., 2019;Liu et al., 2019;Radford et al., 2019;Yang et al., 2019;Lan et al., 2020;Clark et al., 2020) have set new state-of-the-art performance in many downstream NLP tasks. However, such performance often relies on large-scale highquality supervision. Unfortunately, labeled data are not always available in practice.
Recently, Brown et al. (2020) study how to facilitate the few-shot learning of language models via the GPT-3 model. It achieves remarkable performance on many NLP datasets without any gradient updates, by incorporating task-specific prompts into the text and reformulating the task as language modeling problems. However, GPT-3 has 175B parameters, that has a footprint too large for many real-world applications. Gao et al. (2020) applies the concept of prompt strategies in GPT-3 to smallfootprint language models, such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019). After fine-tuned on a few annotated samples, the smallfootprint models exhibit comparable performance to that of the large-footprint GPT-3 model. However, the performance of these models is still lagging behind those under supervised learning, which have a much smaller footprint. Intuitively, unlabeled data also carry rich information of downstream tasks and are more available than labelled data. In this paper, we focus on the few-shot learning of language model with a small amount of labeled and unlabeled data.
Semi-supervised learning benefits from partially labeled datasets. A common implementation of semi-supervised learning is self-training, which leverages supervision signals offered by labeled data to create pseudo-labels for unlabeled data. These pseudo labels serve as additional supervision to refine the models (Yarowsky, 1995;Blum and Mitchell, 1998;Zhu, 2005;Qiu et al., 2019;Zoph et al., 2020). Recent works Schütze, 2020, 2021) apply self-training to language model few-shot learning in an iterative manner whereby multiple generations of models are trained on data pseudo-labeled by the ensemble of previous generations. However, the amount of in-domain unlabeled data required by these methods is quite large, that limits the scope of the applications, especially for low-resource downstream tasks. Du et al. (2020) try to retrieve more task-relevant unlabeled data from open-domain corpus, but the method depends on a quality sentence encoder.
To better address the above issue, we revisit the Self-training techniques and introduce a data-efficient Few-shot learner of Language Model (SFLM). Inspired by recent advances in semi-supervised representation learning for images (Sohn et al., 2020), SFLM combines pseudolabeling with consistency regularization.
Next, we briefly describe the workflow. Given each unlabeled sentence, we construct two views through weak augmentation (random dropout) and strong augmentation (token masking), respectively. The weakly-augmented view is first passed to a prompt-based language model (Gao et al., 2020) to derive the pseudo-label, while the stronglyaugmented view is passed through the model to predict the probability distribution over classes, which is compared to the pseudo-label to derive a cross-entropy loss. This learning procedure encourages the model to capture the information that is almost outside of the data distribution, leading to effective utilization of data.
We evaluate SFLM on two groups of taskssentence classification and sentence-pair classification. Experiments show that our model outperforms other supervised and semi-supervised baselines. We also conduct a detailed analysis of the data efficiency of our model by examining its performance w.r.t various ratios of the amount of the unlabeled data to that of the labelled data. We find out that the performance gain diminishes as more unlabeled data are used. We further extend our method for a more challenging scenario: fewshot transfer across tasks, where the model is first trained on the labeled data of a source task and the unlabeled data of a target task, then evaluated on the target task. We provide the analysis of the factors that affect the model performance to motivate future research.

Few-Shot Learning of Language Model
It is desirable to reduce the amount of labeled data for language model fine-tuning, a.k.a., language model few-shot learning. The popular methods usually address this problem with metalearning (Vinyals et al., 2016;Snell et al., 2017;Finn et al., 2017), which first pre-trains a model on a set of auxiliary tasks, then fine-tunes on the task of interest Han et al., 2018;Bao et al., 2020;Bansal et al., 2020).
Recently, Brown et al. (2020) proposes GPT-3 and demonstrates that the language model itself has a great potential for few-shot learning through task demonstrations and prompts. As GPT-3 (Brown et al., 2020) has an extremely large footprint, that limits its scope of applications.
More recent studies explore few-shot learning with pre-trained language models (Gunel et al., 2020;Schick and Schütze, 2020;Gao et al., 2020) of smaller size. A representative example is the LM-BFF by (Gao et al., 2020), which explores automatic prompt generation and prompt-based finetuning with a RoBERTa-large (Liu et al., 2019) language model in a few-shot setup. LM-BFF has achieved comparable results w.r.t methods finetuned with the full annotated dataset.
We are motivated to study few-shot learning of language model with prompt-based language model fine-tuning. We exploit the rich information in unlabeled data with semi-supervised learning. Furthermore, we adopt an even smaller RoBERTabase model as the backbone of our framework.

Self-Training
Self-training refers to the process of creating pseudo-labels on unlabeled data with a pre-trained teacher model, then applying these labeled data to train a student model. It is a simple and effective semi-supervised approach, which has benefited a wide range of tasks, such as image classification (Xie et al., 2020b), neural sequence generation (He et al., 2019), and parsing (McClosky et al., 2006). Generally, sophisticated learning algorithms (Sohn et al., 2020), and a large corpus of task-relevant data (Xie et al., 2020a) are required for self-training to work well.
From the algorithm perspective, FixMatch (Sohn et al., 2020) is a simple and effective self-training framework for image classification, which unifies consistency regularization and pseudo-labeling. In our work, we transfer this useful framework to language model few-shot learning by exploring various text augmentation techniques for fine-tuning the pre-trained language model. From the data perspective, several recent works have shown the effectiveness of self-training for language model fine-tuning (Du et al., 2020;Schütze, 2020, 2021) leveraging a large amount of unlabeled data. PET (Schick and Schütze, 2020) adopts prompt-based fine-tuning and self-training for language model few-shot learning. This approach assumes the presence of a large number of unlabeled in-domain data (roughly 10,000 examples per class). In addition, Du et al. (2020) propose to retrieve task-relevant unlabeled data from a large-  Figure 1: The learning process of SFLM on both labeled and unlabeled samples with three loss terms. For the supervised loss term L s , in SFLM, a pre-trained language model with a MLM head is used to get the predicted word from the template. Then the predicted word is mapped to the corresponding label with manually defined task-specific word to label mapping M. Two loss terms are computed upon the unlabeled data: (1) We use masked language modeling to compute the self-supervised loss.
(2) We use a weak augmented (dropout) sentence to get the pseudo-label, then force the prediction given by a strongly-augmented (random mask) view against the pseudo label via the self-training loss.
scale open-domain sentence bank. A paraphrasebased universal sentence encoder is designed to output sentence-level vectors for computing cosine similarity between labeled sentences and unlabeled ones in the sentence bank. Unlike the prior studies, which rely on a large amount of unlabeled data and expensive computation resources, we do not assume the availability of abundant in-domain unlabeled data. Instead, we tackle the in-domain data constraint via improving data efficiency, i.e., proposing a scalable and effective self-training framework leveraging only a few unlabeled data.

Methodology
Problem setup: Our goal is to adapt pre-trained language models to downstream tasks in a few-shot setting. The model, m should correctly classify unseen examples leveraging very few labeled data points from each class. Let X denote a small set of labeled training data with N samples per class and an unlabeled dataset, U from the same task domain as X . Assume that this unlabeled dataset has very limited size µN per class, where µ is the ratio between the size of X and that of U.
During training, let each batch consist of B labeled data points, X B , and µB unlabeled data points, U B : Figure 1 illustrates the learning process with an example, containing one labeled and three unlabeled data samples. SFLM is optimized with following loss function: where L s is the prompt-based supervised loss applied to the labeled data (Gao et al., 2020), L st and L ssl refer to self-training loss and self-supervised loss applied to the unlabeled data accordingly, while λ 1 and λ 2 are fixed scalar hyper-parameters controlling the relative weight of the unlabeled loss terms.
Prompt-based supervised loss: The promptbased supervised loss is motivated by LM-BFF (Gao et al., 2020). The classification is reformulated as a language modeling task, in which the probability of class prediction y i ∈ Y is, where M refers a mapping from task labels to the corresponding words 2 , and x prompt i is the reconstructed input sentence with task-specific template. For instance, in a sentence-level binary classification task, the input sentence x i is reconstructed as: where • denotes the string concatenation operation. Instead of using an additional classifier, the pretrained masked language modeling head decides which word to be filled in the masked position.
Then we could fine-tune the model with the standard cross-entropy loss: Self-training loss: For each unlabeled sentence u i , we obtain the weakly-augmented version α(u i ) and the strongly-augmented version A(u i ), where α and A refers to different augmentation strategies. The self-training process consists of two stages. Firstly, we assign a pseudo label to each unlabeled sentence in the batch by computing the output probability distribution corresponding to the weakly-augmented input sentence α(u i ), defined as q i = p m (y i |α(u i )). The pseudo label,q i , is obtained byq i = arg max (q i ). Secondly, we compute the prompt-based cross-entropy loss between q i and the prediction corresponding to the stronglyaugmented input sentence A(u i ). The self-training loss is defined as, where τ defines the threshold above which we retain a pseudo-label. Sohn et al. (2020) adopt AutoAugment (Cubuk et al., 2018) for image augmentation, and highlight the importance of applying proper augmentation techniques in self-training. Text augmentation techniques can be tricky due to the discrete nature of text data. The recent successes in representation learning (Devlin et al., 2019;Gao et al., 2021) motivate us to purely rely on dropout for our weak augmentation, and random token masking for our strong augmentation.
Specifically, the surface forms of weaklyaugmented sentences remain unchanged: α(u i ) = u i . For strong augmentation, we randomly replace 15% of the tokens in A(u i ) with the special mask token, [MASK]. Then, we input α(u i ) and A(u i ) to the language model separately. Therefore, the two input sentences will undergo independent dropout operations (0.1 dropout rate by default), which can be considered as part of the data augmentation process (The green arrow in Figure 1). We empirically show that the performance of our proposed augmentation techniques is superior against other common text augmentation techniques in Section 4.

Self-supervised loss:
We also include an auxiliary self-supervised loss term, L ssl , for regularization purpose. The masked language model loss is used for its simplicity and efficiency.

Baselines
We consider three baselines, namely standard finetuning (FT), supervised learning (Gao et al., 2020) (LM-BFF), semi-supervised learning (Schick and Schütze, 2021) (PET). We use RoBERTa-base (Liu et al., 2019), which has 125M parameters, and the same task-specific manual prompt from (Gao et al., 2020), including template and word-to-label mapping, for prompt-based fine-tuning.  LM-BFF (Gao et al., 2020) as our supervised baseline. We use the prompt with demonstration (Gao et al., 2020) implementation across all tasks for fair comparison. We retrain the model with the official code 3 . PET: For fair comparison, We re-implement our own version of PET based on LM-BFF (Gao et al., 2020), since LM-BFF largely benefits from prompt-based fine-tuning. Specifically, we remove the knowledge distillation on the standard sequence classifier in the original implementation. Instead, we fine-tune the prompt-based language model (Gao et al., 2020) with a mixed training set of labeled and pseudo-labeled data. We iteratively increase the amount of pseudo-labeled data in the training set for model fine-tuning. Through our extensive experiments, we find that our implementation outperforms the official implementation 4 across various tasks. In addition, we evaluate PET under two different settings: (1) using reduced unlabeled dataset, which is the same as our SFLM; (2) using the full training set for self-training, while we limit the number of unlabeled samples to 10,000 3 https://github.com/princeton-nlp/ LM-BFF 4 https://github.com/timoschick/pet per class such that the amount of unlabeled samples across different tasks are kept in the same range. Table 1 presents the performance of SFLM against baselines across various benchmarking tasks. Overall, our proposed SFLM consistently outperforms the supervised and semi-supervised methods by 2% on average with the same amount of data, and by 1.2% with 45 times less unlabeled data. Next, we summarize our observations over the experiment results. First, we find that self-training can greatly improve the performance of vanilla prompt-based finetuning under few-shot setting, either using PET or SFLM. With a large amount of in-domain unlabeled data, even a simple iterative self-training approach can boost the performance by 3.13% on Subj, and 0.88% on average (PET-full vs. LM-BFF). This demonstrates the effectiveness of exploiting the rich information carried in the unlabeled data.

Main Results
Second, in-domain unlabeled data is crucial to the success of semi-supervised methods. As we down-sample the unlabeled dataset size to 64 samples per class, The performance of PET barely has any improvement w.r.t LM-BFF. The performance even degrades in some tasks. For example, in tasks 5 such as RTE, CR, and MRPC, PET doesn't perform as well as LM-BFF even using the full unlabeled dataset.
Third, unlike PET-few, SFLM can still bring a significant improvement w.r.t LM-BFF even when the size of unlabeled data is limited. This observation implies that our approach utilizes the unlabeled data in a much more efficient way. SFLM also outperforms PET on 8 tasks out of 12. The exceptions are the four natural language inference tasks, of which the unlabeled datasets contain 10,000 unlabeled data samples. However, the performance difference between PET-full, which uses all the 10,000 unlabeled data samples, and SFLM in these four tasks is insignificant (less than 0.4%). In addition, SFLM generally has lower variance compared to the baselines.
The major difference between SFLM and LM-BFF is that we utilize the rich information carried in the unlabeled data. The experiment results confirm our hypothesis about the usefulness of semisupervised learning in language model few-shot learning. Furthermore, the major difference between SFLM and PET is the self-training algorithm. We include strong data augmentation technique for consistency regularization. This confirms that our proposed text augmentation techniques are crucial to the success of SFLM.

Analysis of Data Efficiency
One of the key questions in this study is how many labeled and unlabeled data are required. We provide an answer in Figure 2, which illustrates the performance of SFLM with different combinations of N and µ. It can be observed that the error rate reduces generally as µ increases, that suggests SFLM benefits from more unlabeled data, with an exception in SST-5, where the best performance is achieved at µ = 2. a similar trend is also spotted for PET (see Table 1).
We note that SST-5 is the most difficult one among the six tasks. We observe that, with a relatively weak teacher model, self-training doesn't benefit from more unlabeled data. We speculate that by increasing unlabeled data, we introduce more noise into the training process when we have a weak starting point. 5 The full unlabeled datasets for these tasks contain less than 5,000 sentences. In Figure 2, we also observe that, for the simpler tasks, e.g., SST-2 and Subj, the gain obtained by more unlabeled data rapidly saturates when four times of unlabeled data are given, which is consistent with the finding in (Sohn et al., 2020). We are encouraged to see that SFLM continues to improve as µ increases for the other three tasks and saturates later. In general, the trend of performance improvement by varying µ is consistent across different value of N . For instance, the performance gain is ∼ 0.65% for increasing µ from 2 to 4 across different N .
To better understand what SLFM actually improves over the baselines, we further analyze the distribution of incorrectly-labelled examples corrected by SFLM, motivated by (Wei et al., 2020). First, we partition incorrectly-labelled examples into five bins based on the cosine similarity of their sentence embeddings given by SimCSE (Gao et al., 2021) w.r.t. the average embedding of the training set. Next, we show the percentage of examples in each bin whose labels are corrected by SFLM. As shown in Figure 3, SFLM is more likely to correct examples within the gold-labelled data's neighborhood (the region surrounding a data point in the embedding space) than those away from the neighborhood. In other words, the performance gain of SFLM is not only related to the unlabelled data size but also to the data distribution. SFLM is also scalable with different amounts of labeled data, and consistently outperforms LM-BFF by 3.2% for N = 4 and by 2% for N = 32 on average. Especially, SFLM exhibits significant improvement over LM-BFF under the extremely few-shot scenario. When N = 32, the performance of LM-BFF saturates in simple tasks, e.g., SST-2, CR. However, SFLM continues to narrow the gap between few-shot learning and fine-tuning with the entire labeled dataset, which further validates the benefits of self-training to supervised learning.

Augmentation Techniques
It has been shown that strong data augmentation plays a crucial role in semi-supervised visual representation learning (Zoph et al., 2020;Xie et al., 2020b;Sohn et al., 2020). The images can be augmented easily by cutout, flipping, and cropping (Zoph et al., 2020). However, very few works have been done on augmentation techniques for text (Xie et al., 2020b) 7 .
Here, we study how different augmentation techniques would affect the model performance. We fix Dropout as the weak augmentation, and present the results of another three strong augmentation approaches, including Crop, Swap and Deletion in Table 2. Specifically, Dropout is same as the weak augmentation, we directly forward the original sentence into the language model. Crop refers to randomly cropping the original sentence into a continuous span of 85% of the original length. In terms of Swap, we randomly swap two tokens in the sentence and repeat the same procedure 3 times. For Deletion, we randomly delete 15% of the tokens in a sentence. Mask refers to randomly replacing 15% of tokens in a sentence with the special [MASK] token. In this experiment, we keep N = 16 and µ = 4.
We observe that the SFLM framework can also work with Dropout and Swap, as they still outperform PET on average by 0.4 and 0.2 respectively. However, they are less effective than Mask. Another interesting finding is that Deletion, which is similar to Mask, yields poor performance. We hypothesize that Deletion and Crop may adversely affect the original semantics, for example, deleting the word, not, may reverse the meaning of the original sentence. In contrast, Mask keeps the structure of sentences, and hence, it is easier to maintain the semantics by adding consistency regularization and MLM objectives.
Furthermore, we empirically study the effect of different masking ratio on SST-2. 90.62% of accuracy is obtained for 10% masking, 90.14% accuracy for 20% masking, and the best performance of 91.0% for 15% masking.

Model Scale
To put the SFLM framework under a stress test, we further apply SFLM on a smaller language model, distilled RoBERTa with 84M parameters, as reported in Table 3. While DistilRoBERTa shows a similar performance as RoBERTa, it performs much worse under few-shot learning scenarios, e.g., LM-BFF has a significant performance drop of 5% in CR and MPQA. We hypothesize that a robust language model of reasonable footprint is required for effective self-training.
While SFLM consistently outperforms LM-BFF with a smaller language model, the overall performance of SFLM based on DistilRoBERTa-base also degrades sharply w.r.t that based on RoBERTabase. This suggests the crucial role of the language model in our approach.   Table 3: Accuracy (%) for systems with language models of different size. We use N = 16 (# labeled examples per class), and µ = 4 (ratio of unlabeled data to labeled data) for few-shot experiments. ♠, ♥ follow the same definition in Table 1. DistilRoBERa-base: 6layer RoBERTa distilled from RoBERTa-base with 2 times speedup.
Meanwhile, we find that PET better suits the small language model setting. For instance, the performance gain of PET-full w.r.t. LM-BFF increasing from 0.8% to 1.1% under the RoBERTa-base and DistilRoBERTa-base settings respectively. We hypothesize that more unlabeled data benefit fewshot learning of smaller language model. However, the performance of PET is still worse than that of SFLM under the DistilRoBERTa-base setting.
Overall, the above observations confirm the effectiveness of our approach for language model few-shot learning regardless of the backbone language model.

Zero-shot Transfer Across Tasks
Lastly, we show that our proposed method can be easily extended to zero-shot transfer across tasks. Specifically, we assume that before few-shot learning on the target unlabeled dataset U, the learner has access to an annotated dataset D known as the base dataset. 8 Accordingly, we modify the learning objective as: where the last two terms are intended to encourage the model to capture knowledge specific to the target task. We evaluate SFLM on three binary classification tasks, including sentiment analysis (MR and SST) and product reviews (CR). In the experiments, we use one of the three tasks as the source task and test on another two target tasks. The SFLM model is based on RoBERTa-base and trained on 64 samples per class for D and U, respectively. We compare SFLM with a baseline, which is a language model fine-tuned directly on the source dataset with a single prompt-based loss L D s . The results are reported in Table 4. SFLM, which adopts a small number of unlabeled target data, generally outperforms that with supervised fine-tuning on source datasets. In particular, with MR as the source and CR as the target, SFLM obtains 1.3 points higher than the baseline model, which demonstrates that our proposed method has the flexibility to be applied to the cross-task zeroshot learning scenario.

Conclusion and Future Work
In this paper, we present SFLM, a simple and effective self-training framework for few-shot learning of language model. Our approach addresses the  Table 4: Accuracy (%) for zero-shot task transfer between three sentiment classification tasks with RoBERTa-base. Transfer: a language model fine-tuned on source dataset with a single prompt-based loss; We refer SST to  few-shot language model fine-tuning problem with very limited labeled and unlabeled data with a selftraining loss term unifying pseudo-labeling and consistency regularization. Through comprehensive experiments, we show that our approach outperforms previous state-of-the-art methods across tasks, data amount, and model scale. Despite its efficiency, SFLM also has several limitations. Compared to standard fine-tuning, SFLM requires more computational resources for unlabeled data. In addition, the performance gain by self-training is not proportional to the amount of unlabeled data. We leave it to future study. Namely, how to utilize large amount of unlabeled data efficiently through self-training.