Towards Unified Prompt Tuning for Few-shot Text Classification

Prompt-based fine-tuning has boosted the performance of Pre-trained Language Models (PLMs) on few-shot text classification by employing task-specific prompts. Yet, PLMs are unfamiliar with prompt-style expressions during pre-training, which limits the few-shot learning performance on downstream tasks. It would be desirable if the models can acquire some prompting knowledge before adaptation to specific NLP tasks. We present the Unified Prompt Tuning (UPT) framework, leading to better few-shot text classification for BERT-style models by explicitly capturing prompting semantics from non-target NLP datasets. In UPT, a novel paradigm Prompt-Options-Verbalizer is proposed for joint prompt learning across different NLP tasks, forcing PLMs to capture task-invariant prompting knowledge. We further design a self-supervised task named Knowledge-enhanced Selective Masked Language Modeling to improve the PLM's generalization abilities for accurate adaptation to previously unseen tasks. After multi-task learning across multiple tasks, the PLM can be better prompt-tuned towards any dissimilar target tasks in low-resourced settings. Experiments over a variety of NLP tasks show that UPT consistently outperforms state-of-the-arts for prompt-based fine-tuning.


Introduction
The emergence of Pre-trained Language Models (PLMs) has boosted the performance of a variety of NLP tasks (Qiu et al., 2020;Han et al., 2021a).However, during fine-tuning, PLMs can perform poorly with few training samples due to model over-fitting (Gao et al., 2021).
To alleviate this problem for low-resourced scenarios, natural language prompts have been applied to enable few-shot or zero-shot learning with PLMs (Liu et al., 2021a).To make prompts more flexible and task-adaptive, prompt tuning freezes the PLM backbone and adjusts the representations of prompts (Lester et al., 2021).This type of method is especially suitable for ultra-large PLMs that are difficult to tune.For BERT-style PLMs, prompt-based fine-tuning has been proposed, transforming text classification tasks into cloze-style problems (Schick and Schütze, 2021a,b;Gao et al., 2021).To specify, task-specific discrete templates with masked language tokens are added to input texts.The result tokens of the masked positions predicted by the Masked Language Modeling (MLM) head are used for class label prediction2 .Therefore, the pre-trained knowledge acquired by PLMs can be better utilized by "re-using" the MLM training objective.Witnessing the successful usage of prompts for few-shot learning, various followingup works have been conducted, such as continuous prompt encoding (Liu et al., 2021c), knowledgeable prompt learning (Hu et al., 2021), and prompt generation (Shin et al., 2020).
Recently, a few works (Wei et al., 2021;Zhong et al., 2021a;Mishra et al., 2021) focus on multi-It is a wonderful movie.

Input Text Prompt Options
It is sunny today.[SEP] There is no rain today.

Input Text Pairs Prompt Options
Is it great or terrible?It is [MASK].
Is it entailment, neural or contradictory?

b. Self-supervised Learning Task
The positive results in the clinical trivial confirmed that the treatment for COVID-19 was [MASK].
Is it effective or ineffective?

Prompt Knowledgeinduced Options
It is [MASK].

Input Text
Options Knowl.Repo.

Single-sentence Classification Task
Sentence-pair Classification Task The positive results in the clinical trivial confirmed that the treatment for COVID-19 was effective.

Input Sentence
Figure 1: UPT is a unified framework that learns prompting knowledge from non-target NLP datasets to improve the performance on target tasks, in the format of Prompt-Options-Verbalizer (Sect.2.2).Figures a) and b) show examples of supervised and self-supervised learning tasks (i.e., Knowledge-enhanced Selective MLM, Sect.2.3).
task prompt tuning on ultra-large PLMs.Specifically, they tune PLMs on full training samples from different tasks to force PLMs to learn more prompting knowledge, and directly make predictions over the target task by zero-shot learning.Yet, we observe that for BERT-style PLMs, the performance is not satisfactory for two reasons.1) These PLMs are sensitive to different designs of prompt templates and verbalizers (Liu et al., 2021c), which fail to adapt to target tasks with new prompts and verbalizers.2) There are word distribution differences between prompt-style texts and sentences in pre-training corpora.It would be better if BERTstyle PLMs can acquire some prompting knowledge before they are adapted to downstream tasks.Therefore, a natural question arises: how can we make BERT-style PLMs adapt to target NLP tasks accurately with more prompting knowledge?
To address these issues, we introduce a novel framework named Unified Prompt Tuning (UPT), facilitating better few-shot text classification performance for BERT-style models by explicitly capturing general prompting semantics from non-target datasets.Specially, we propose a unified paradigm named Prompt-Options-Verbalizer (POV), which enables mixture prompt-tuning over a series of nontarget NLP tasks of varied types.To further improve the model's generalization abilities on previously unseen tasks, we propose a novel auxiliary task named Knowledge-enhanced Selective MLM (KSMLM), which mimics the behavior of MLM with explicit usage of prompts following the POV paradigm.After multi-task training is completed, the underlying PLM can be fine-tuned to fit any few-shot tasks using the same prompting paradigm.
In the experiments, we verify the effectiveness of UPT over public NLP datasets of various tasks.
Experimental results show that UPT consistently outperforms state-of-the-art approaches for promptbased few-shot fine-tuning.In summary, we make the following major contributions: • We introduce the novel UPT framework to improve prompt-based fine-tuning for BERTstyle models, which captures unified prompting semantics from multiple source tasks of various types for few-shot text classification on new target tasks.
• In UPT, a new paradigm POV is proposed for joint prompt tuning across different NLP tasks.
We further design the self-supervised KSMLM task to improve the PLM's generalization abilities for accurate task adaptation.
• Extensive experiments over various NLP datasets show that UPT consistently outperforms state-of-the-arts for prompt-based fewshot fine-tuning by a relatively large margin.

UPT: The Proposed Framework
We start with a brief overview of the UPT framework, followed by its detailed techniques.In UPT, the model is firstly trained over all the source tasks T (1) , • • • , T (M ) , aiming to learn the semantics of prompts and the general methodology of solving downstream tasks by prompting.After that, it is prompt-tuned over a specific target task T * in the low-resourced scenario.To unify the learning process, each training sample i in all different tasks (either Here, P i is the prompt.O i is the expression containing all possible options of the masked language token appearing in the prompt P i (i.e., the collection of label words).V i is the verbalizer that maps the target token predicted by the MLM head of the PLM to the class label.Readers can also refer to the examples of supervised learning tasks in Figure 1.
In addition, we observe that the diversity of label words in original labeled tasks T (1) , • • • , T (M ) is limited.For previously unseen tasks, the optimization of these tasks alone often leads to a poorly generalized model that is biased towards these tasks.Therefore, we further introduce the selfsupervised Knowledge-enhanced Selective MLM (KSMLM) T as an auxiliary task.Specifically, take the sentences from source tasks training data ) as inputs.These sentences are selectively masked, with options generated by rich knowledge mined from a massive corpus.An example is also in Figure 1.Hence, the model has better generalization abilities and avoids catastrophic forgetting of pre-training knowledge.

The Unified Prompting Paradigm
A fundamental challenge for prompt-based training across D (1) , • • • , D (M ) for BERT-style models is that different NLP tasks have diverse sets of label words w.r.t.masked language tokens.When dealing with a mixture of training samples, a naive solution is to build a unified output prediction space, consisting of candidate label words from all tasks.However, the enlarged output space makes it chal-lenging for the PLM to optimize.Additionally, the output prediction space may not cover the label words of all possible unseen NLP tasks.
Here, we propose a unified prompting paradigm that augments each sample i by a Prompt-Options-Verbalizer (POV) triple (P i , O i , V i ).P i is the prompt that provides task guidance (in line with PET (Schick and Schütze, 2021a,b)).O i is a fixed expression that explicitly provides selection for the model over all its candidate label words4 .To facilitate the fast adaptation of arbitrary tasks, the verbalizer V i maps the output of the masked language token to the entire vocabulary V. We can see that the options are crucial as they give strong indications on the possible outputs of the PLM (i.e., the candidates).Overall, the output probability q(v|i, P i , O i , Θ) of the token v ∈ V w.r.t. the training sample i is computed as follows: where s(v|i, P i , O i , Θ) is the un-normalized score of the MLM head (before the softmax function) for generating token v at the position of the masked language token with i, P i and O i as inputs.Denote the entire prediction vector (of the length |V|) as Q(V|i, P i , O i , Θ).The multi-task prompting loss (denoted as L M P ) can be written as follows: where D = M k=1 D (k) , and P (V|i, P i , O i , Θ) is the one-hot ground-truth prediction vector.
In addition, we notice that D (1) , • • • , D (M ) can be arbitrary labeled datasets with varied sizes.Optimizing L M P directly on their original datasets would make the few-shot learner more likely to be biased towards larger datasets.In our work, we do stratified sampling to form a batch where a training sample i from D (1) , • • • , D (M ) is picked with the probability proportional to its own dataset size (denoted as w i ), i.e., where γ > 0 is a smoothing factor and i ∈ D (k) .Hence, we re-formulate L P T as the weighted multi-task  prompting (WMP) loss L W M P :

Extending Unified Prompting to
Self-supervised Learning One drawback of the above approach is that the diversity of label words in these supervised learning tasks is usually limited, covering a narrow spectrum of the vocabulary V.The model would not be well generalized for tasks with new label words.Hence, we leverage the idea of MLM pre-training, formulated by the POV paradigm.As a naive approach, given a sentence, we can randomly mask a word and generate the options of the correct and a randomly selected word, and then ask the model to make the prediction.Unfortunately, the seemingly feasible approach may ruin the training process, because not all words are suitable label words.For example, stop words and a large number of verbs and adverbs have not been used in any verbalizers in downstream tasks.The alternatives used in options should be reasonable, in order to make the model learn truly useful knowledge.To address the issue, we present the self-supervised KSMLM task, with an example shown in Figure 2. In the following, we describe the POV construction process for KSMLM.After that, the loss function of the task is given.P-Generation.This process aims to generate a template with a [MASK] token for each sentence, which is fixed to be "It is [MASK]."during the multi-task training stage.In the task-specific finetuning stage, we follow LM-BFF (Gao et al., 2021) to automatically generate templates for each task.
During training, the PLM is asked to predict the actual word of the masked position.O-Generation.From Gao et al. (2021), we can see that most label words for language understanding tasks are adjectives5 (such as "great" and "terrible" for sentiment analysis).Thus in our work, we detect all adjectives in the corpus by part-of-speech tagging models6 and filter out low-frequency adjectives.The adjectives are then clustered by K-Means, with their token representations generated from the underlying PLM as features.Formally, We construct a knowledge repository named Options Knowledge Repository (OKR), in the form of triples R = {(v, ⃗ v, c v )}, where v is a candidate label word.⃗ v and c v denote the representation vector and the cluster membership of v, respectively.The cluster centroids are also stored.We do not use existing lexicons such as WordNet (Miller, 1995) because they may have limited coverage of label words.Additionally, the automatic process enables the extension of our algorithm to arbitrary languages and domains.
With the availability of R, we can generate knowledge-induced options.Given a sentence with the masked word as v, we query v against R for the most dissimilar cluster w.r.t.v, denoted as cv , where the cosine similarity of the vector representation ⃗ v and the cluster centroid is employed as the similarity measure.Finally, we randomly select one adjective from cv as the alternative label word to generate the knowledge-induced options.The text expressions of options is fixed, i.e., "Is it [x1] or [x2]?".Readers can further refer to the example in Figure 2. V-Generation.For verbalizers, we map the true and the generated label words in the options to two classes, namely Class: Correct and Class: Incorrect.For instance, the verbalizers of the sample sentence in Figure 2 are: It is "effective".→"Class: Correct" It is "ineffective".→"Class: Incorrect" Loss Function.The KSMLM loss is significantly different from the auxiliary MLM loss used in Schick and Schütze (2021a,b).In D, each training sample i can be directly extended to the training example for KSMLM by POV construction process with exactly one masked token, the knowledgeinduced options O i and the prompt P i .The PLM is trained to predict the correct masked word in the sentence, with the loss function: Overall, the loss function of UPT L is defined as the summation of the WMP and KSMLM loss: where λ ≥ 0 is the balancing hyper-parameter.Discussion.To our knowledge, external knowledge has also been applied to other prompt-based methods, such as KPT (Hu et al., 2021).The major difference between KPT and ours is that UPT uses the knowledge for options creation of the selfsupervised task KSMLM that we proposed, in order to improve the model generalization abilities for accurate adaptation on new tasks.In contrast, previous works consider the expansion of verbalizers for specific downstream NLP tasks.

Few-shot Fine-tuning
For a specific downstream task T * , the samples in the target few-shot training set D * can be processed and computed in the same way as those supervised tasks used during UPT.The learning consistency in the two stages ensures that the underlying PLM has already acquired prompting knowledge for T * .In addition, one can prompt-tune a single PLM over various tasks and uses it to fine-tune over any target tasks, making it computationally efficient to produce models for these applications.
As mentioned above, during UPT, we only leverage full training data from all dissimilar task groups, and then prompt-tune the model on the target task in the low-resourced setting.For example, when the target task is SST-2, the training data during UPT is from NLI and Paraphrase.The underlying PLM is the RoBERTa-large model (with 335M parameters) (Liu et al., 2019), unless otherwise specified.The baselines include standard finetuning, and four recently proposed few-shot learning algorithms: PET (Schick and Schütze, 2021a)8 , LM-BFF (Gao et al., 2021) 9 , P-tuning (Liu et al., 2021c) 10 and PPT (Gu et al., 2021).To make a fair comparison with these single-task baselines, a variant of our approach (denoted as UPT-Single) is also implemented by only fine-tuning over the fewshot target task based on POV without the usage of dissimilar supervised source datasets.
As we use other dissimilar datasets to train our model, we also include two multi-task methods that are meta-tuned using the same dissimilar datasets as strong baselines, namely MT (Zero-shot) and MT (Few-shot) (Zhong et al., 2021a) 11 .We also implement the zero-shot version of UPT, denote as UPT (Zero-shot).In addition, given a supervised NLP task, multiple prompts can be manually crafted.By augmenting one training sample with these prompts, we can automatically realize selfensemble learning.For the self-ensemble version of UPT, we employ five different prompts.For each input sample, we randomly select one expression of options and one set of verbalizers.We denote this method as UPT-SE.The designed prompts, options, and verbalizers are listed in the Appendix.All the results of these models are evaluated in terms of averaged accuracy and its standard deviation, over 5 random seeds.
Our UPT framework is implemented in PyTorch and run with NVIDIA V100 GPUs.Specifically, we train our model with the Adam optimizer.The learning rate for all training stages is fixed to be 1e-5.We set the default hyper-parameters as γ = 0.001 and λ = 0.1, which are also tuned over the development sets.The parameter regularizers are the same as in Gao et al. (2021).

Main Results
In Table 1, we report the general experimental results of UPT and all the baselines.The results show that: 1) Prompt-learning based methods (i.e., PET (Schick and Schütze, 2021a), LM-BFF (Gao et al., 2021), P-tuning (Liu et al., 2021c) and PPT (Gu et al., 2021)) have large improvements over standard fine-tuning.2) UPT-Single outperforms previous few-shot learning models in average, which indicates that the utilization of POV is better than vanilla prompts (Schick and Schütze, 2021a).3) UPT (both the vanilla and the ensemble version) consistently outperforms all baselines on all tasks, which demonstrates that our framework possesses better generalization by learning from dissimilar groups of tasks12 .4) MT (Zeroshot) (Zhong et al., 2021a) and UPT (Zero-shot) do not yields satisfactory results on BERT-style models.Different from ultra-large models, we suggest that few-shot prompt-tuning is necessary for BERTstyle models to produce good results over these tasks.5) By comparing UPT against MT (Fewshot), we can see that the proposed POV paradigm and the self-supervised KSMLM task are more effective for few-shot learning.6) Generally, UPT-SE improves the averaged accuracy on all tasks by 1.2% than UPT.It means that self-ensemble learning can enhance model generalization, but the improvement is not consistent across all tasks.A possible cause is that some prompts and options are not optimal for the target task.

Model Analysis
Parameter Analysis.We conduct parameter analysis to investigate the best choice of the balance coefficient λ. Results over SST-2 and RTE are shown in Figure 3.We have the best performance when λ = 0.1, which indicates that our proposed UPT possess generalization when it is jointly trained over the self-supervised KSMLM task.We also observe that the performance decreases when λ becomes larger.This means KSMLM is a suitable regularization task, but also may introduce a lot of prompts and options that are irrelevant to downstream tasks.This opens up new opportunities for model improvement.
Ablation Study.To clearly verify the contributions of each component in UPT, we conduct an ablation study over all groups and report the mean accuracy.
As shown in Table 3, w/o.POV denotes the method with manually designed prompts without the usage of any options.w/o.KSMLM equals the setting with λ = 0, which is the same as UPT-Single.w/o.OKR means that we randomly choose the alternative label words in the options without knowledge guidance when we optimize the KSMLM task.w/o.POV & KSMLM denotes the method without any options and the auxiliary KSMLM task.
The results show that no matter which module is removed, the model performance is affected.Particularly, when we remove both POV and KSMLM, the performance is decreased by 1.4%, 1.5%, 4.4%, respectively.The accuracy values of this setting are lower than w/o.POV and w/o.KSMLM, which suggests that both two components highly contribute to the high performance of our framework.We also find that w/o.POV or w/o.KSMLM both outperform MT (Few-shot) over all groups.Additionally, we find that if we use KSMLM but remove OKR, the results decrease over all these tasks, but are still higher than w/o.KSMLM.It means that the options knowledge that we mine from the corpus is suitable for the self-supervised learning task.Sample Efficiency.We further explore the model effects with different numbers of training samples per class (K) from 16 to 512.We also use standard fine-tuning as the reference.As shown in Figure 4, each point refers to the averaged score across 5 randomly sampled datasets.We observe that our UPT consistently achieves higher scores regardless of the number of training samples.In addition, the variance of UPT is lower than fine-tuning, meaning that the stability of our method is better.This is different from other prompt-based methods (Schick and Schütze, 2021a,b;Gao et al., 2021).
Model Scale Analysis.To further show that UPT can improve the model performance regardless of the scales, we regard multiple small-scale BERT as model backbones 13 .Due to space limitations, we only illustrate the results in Table 2 over SST-2, MR, and CR.To make a fair comparison, we also test the performance without the usage of dissimilar NLP datasets and show the relative improvements.
The results demonstrate that the model scale plays an important role in the ability of model generalization.We also find that UPT that uses dissimilar datasets can highly improve the effectiveness, especially on small-scale PLMs.Therefore, our method is better suitable for producing high-performing small PLMs for online applications.Adaptation Efficiency of Task Groups.Because we focus on multi-task training before prompttuning over the target task in low-resourced settings.Therefore, it is worth exploring which/how many groups of tasks have a better effect on the adaptation improvement.Specifically, when given a target task (e.g., MNLI), we only choose one group of tasks (e.g., MRPC and QQP of Group 3 (Paraphrase)) for multi-task prompt-tuning, and then fine-tune the model on the target task.As shown in Figure 5, the cell in the i-th row and j-th column denotes the relative improvement from single-task learning over the j-th task to the setting where the i-th group is added for multi-task prompt learning.
For visualization, we normalize the values of each column to show the percentage of influence of each group.The results show that the performance of a target task improves the most when we add data samples from other datasets within the same task group.However, in low-resourced scenarios, similar datasets are not available.By using UPT, we can even transfer the knowledge from datasets from dissimilar tasks to the target task.Specifically, taking NLI as the source group, we randomly choose M dataset(s) from the group as our source tasks and then prompt-tune the model on each target task.The results from Figure 6 demonstrate that the accuracy is further improved when we increase the value M .We also find that the improvements over MRPC and QQP are more obvious.We suggest that NLI is easier to be adapted to paraphrase tasks because they both model the relations between sentence pairs.
Prompt-based Learning.Fine-tuning PLMs directly by learning the [CLS] head may perform poorly with few training samples (Liu et al., 2021a).Recently, the huge GPT-3 model (Brown et al., 2020) has been proposed to enable in-context learning, which introduces handcrafted prompts and demonstrations.Schick and Schütze (2021a) apply handcrafted prompts to prompt-based fine-tuning for BERT-style models.To facilitate the automatic prompt generation, Gao et al. (2021) present LM-BFF to generate discrete templates (Raffel et al., 2020).Other works (Shin et al., 2020;Han et al., 2021b;Scao and Rush, 2021;Utama et al., 2021) mine prompts from the training corpus based on heuristic rules/semantic relations.However, these methods are time-consuming for mining optimized prompts for target tasks.A series of methods are proposed to learn continuous/soft prompt embeddings, such as P-tuning (Liu et al., 2021c), P-tuning-V2 (Liu et al., 2021b), OptiPrompt (Zhong et al., 2021b), Prefix-tuning (Li and Liang, 2021).Zhao and Schütze (2021); Gu et al. (2021) focus on the hybrid training with both discrete and continuous prompts.Hu et al. (2021) consider the automatic expansion of label words and presents Knowledgeable Prompt-tuning (KPT) to utilize knowledge for the construction of verbalizers.Sun et al. (2021) and Wang et al. (2021b) prompt the PLMs to make language inference in zero-shot learning.In addition, Wang et  The resulting PLMs are then utilized to solve unseen tasks by zero-shot learning.These methods successfully work for large PLMs such as GPT-3 (Brown et al., 2020) and T5 (Raffel et al., 2020), but consume a large amount of computation re-sources.We further leverage data from non-target NLP tasks to make prompt-tuned PLMs have better capacities of adapting to unseen NLP tasks.

Conclusion and Future Work
In this paper, we present the Unified Prompt Tuning framework (UPT) that enables better few-shot text classification for BERT-style models by explicitly capturing prompting semantics from non-target datasets.We introduce a novel POV paradigm to unify the task format, and then extend the unified prompting to the self-supervised learning with the knowledge-enhanced selective MLM task.Experiments show that UPT consistently outperforms state-of-the-arts for prompt-based fine-tuning over multiple text classification scenarios.As for future work, we seek to extend UPT to other tasks such as named entity recognition, text generation, and machine translation.In addition, we will explore continuous prompt-tuning for UPT.

Limitations
Our work focuses on the prompt-based fine-tuning for text classification.It is also possible to extend our work to other NLP task (such as question answering, relation extraction, text generation, etc.) which will be addressed in the future work.

Ethical Considerations
Our contribution in this work is fully methodological, namely a unified prompt tuning (UPT) to boost the prompt-tuned PLMs.Hence, there are no direct negative social impacts of this contribution.However, as PLMs may have some negative impacts, such as the existence of social and gender biases, the tuned models produced by our UPT would unavoidably suffer from these issues.We suggest that users should carefully deal with the potential risks before deploying the models online.vanilla MLM, the results indicate that KSMLM is an irreplaceable task for the improvement of the model generalization power.2) We also find that if we ignore the verbalizer construction, the results decrease to a large degree, and lower than UPT w/o.KSMLM.It means that verbalizers are crucial for template-based prompt-tuning.3) When OKR or options are removed, the results also decline, indicating the effectiveness of these techniques.

D Comparing POV with Other Paradigms
To compare the proposed POV paradigm with other paradigms, we perform experiments over SST-2, MR, and CR tasks.The alternative paradigms are as follows: • Multiple-choice.It is a unified template to list all the candidate results.For example, an input sentence can be "The Disney cartoons are very interesting for children to enrich their extracurricular life.A. great; B. terrible.It is [MASK].".This paradigm is closely in line with PPT (Gu et al., 2021).
• Yes/No.We can reformulate the multi-class classification tasks into a series of binary classification.Take NLI for example.We can design three templates for each class, i.e "Are these descriptions are entailment?","Are these descriptions are neutral?", and "Are these descriptions are contradiction?".We follow Zhong et al. (2021a) to add an MLP layer on the top of the PLM to obtain the output of the [MASK] token to classify the answer to be "Yes" or "No".
Experimental results in Table 8 show that in average, POV outperforms all baselines.For Multichoice, we find the results decline a lot.We guess that the PLM is hard to understand and generate the items number, such as "A, B, C, D".In addition, we find the paradigm "Yes/No" has a similar performance to POV.Overall, the experiments prove the effectiveness of POV, which is easy to implement and avoids the transformation to multiple binary classification tasks for tasks with multiple classes.

E Additional Evaluation Results over Other Tasks
In this part, we further present additional experiments over other tasks from GLUE (Wang et al., 2019c) and SuperGLUE (Wang et al., 2019a), including AX-b, AX-g, BoolQ, CB, SST-5, TREC and Subj.The data statistics can be found in the original papers.We choose standard fine-tuning, PET (Schick and Schütze, 2021a), LM-BFF (Liu et al., 2021c) as our baselines to make comparison.
In this experiment, we only conduct task-specific single-task learning to evaluate the efficiency of the POV paradigm.We also set K = 16.As shown in Table 6, we can draw the following conclusions.1) Our UPT framework outperforms strong baselines over these tasks.2) SST-5 and TREC are challenging tasks with many labels, which consist of 5 and 6 classes, respectively.Experiments show that our POV paradigm can achieve the best performances.

2. 1 A
Brief Overview of UPT For clarity, we introduce some basic notations.Let D * be the N -way-K-shot training set of a target NLP task T * .The underlying PLM is parameterized by Θ.The basic goal of few-shot learning is to obtain a high-performing model for T * based on D * , with parameters initialized from Θ.As the size of D * is only N × K, the model performance would be highly limited.Here, we assume that there are M other NLP tasks that are dissimilar to T * , i.e., T (1) , • • • , T (M ) , with their (usually non few-shot) training sets denoted as D (1) , • • • , D (M ) , respectively 3 .The UPT framework seeks to explore how to employ D (1) , • • • , D (M ) to enhance the performance of the PLM on a new task (such as T * ) based on its own few-shot training set D * .

Figure 2 :
Figure 2: An illustrated example of the POV generation process for the KSMLM task.

Figure 4 :
Figure 4: Results of sample efficiency analysis.We compare UPT with standard fine-tuning with different numbers of training samples K over two tasks.

Figure 5 :Figure 6 :
Figure 5: Adaptation efficiency between task groups.The shade of color indicates the degree of adaptation.

Table 1 :
Comparison between UPT and baselines over all testing sets in terms of accuracy (%) and standard deviation."FT" and "PT" refer to the fine-tuning and prompt-based fine-tuning paradigm, respectively.The methods in bold refer to our approach and its variants.The scores of baselines are re-produced using their open-source codes.

Table 2 :
Results of model scale analysis.We report the accuracy (%) of UPT based on BERT with other scales, and relative improvements, compared to the models w/o.prompt learning over dissimilar datasets.

Table 4 :
Dataset statistics.We only sample N × K instances from the original training sets to form the few-shot training and development sets.The testing sets used in the experiments are full datasets.

Table 6 :
Additional experiments for comparison between UPT and baselines over all testing sets in terms of accuracy (%) and standard deviation.

Table 7 :
The ablation analysis of the KSMLM task in terms of accuracy (%).

Table 8 :
The comparison between different paradigms in terms of accuracy (%).