TransPrompt: Towards an Automatic Transferable Prompting Framework for Few-shot Text Classification

Recent studies have shown that prompts improve the performance of large pre-trained language models for few-shot text classification. Yet, it is unclear how the prompting knowledge can be transferred across similar NLP tasks for the purpose of mutual reinforcement. Based on continuous prompt embeddings, we propose TransPrompt, a transferable prompting framework for few-shot learning across similar tasks. In TransPrompt, we employ a multi-task meta-knowledge acquisition procedure to train a meta-learner that captures cross-task transferable knowledge. Two de-biasing techniques are further designed to make it more task-agnostic and unbiased towards any tasks. After that, the meta-learner can be adapted to target tasks with high accuracy. Extensive experiments show that TransPrompt outperforms single-task and cross-task strong baselines over multiple NLP tasks and datasets. We further show that the meta-learner can effectively improve the performance on previously unseen tasks; and TransPrompt also outperforms strong fine-tuning baselines when learning with full training sets.


Introduction
Fine-tuning Pre-trained Language Models (PLMs) has become the standard practice to train models for a majority of NLP tasks (Devlin et al., 2019;Liu et al., 2019b;Qiu et al., 2020). To ensure high accuracy, it is necessary to obtain a sufficient amount of training data for downstream tasks, which is often the bottleneck in low-resource scenarios.
The application of ultra-large PLMs such as GPT-3 (Brown et al., 2020) proves that such PLMs can learn to solve a task with very few training samples. Inspired by these works, Gao et al. (2020) propose a prompt-based approach to fine-tune BERTstyle PLMs in a few-shot learning setting, which * C. Wang and J. Wang contributed equally to this work. † Corresponding author.
adapts PLMs into producing specific tokens corresponding to each class, instead of learning the prediction head. The effectiveness of prompts has also been shown in Schick and Schütze (2020); Scao and Rush (2021); Schick and Schütze (2021) and others. However, designing high-performing prompts is challenging and requires a very large validation set. To alleviate this problem, Liu et al. (2021) propose continuous prompt embeddings with fully differentiable parameters, avoiding the cumbersome manual prompt engineering process. Despite the remarkable success, we notice that current prompt-based approaches may have a few limitations. For few-shot learning, the performance of downstream tasks is still constrained by the number of training instances. It would be highly desirable if the model can acquire the transferable knowledge from similar NLP tasks before it is adapted to specific tasks with few samples. However, it is unclear how the knowledge in prompt encoders and PLMs with prompting techniques is transferred across tasks. A natural question arises: how can we design a prompting framework for BERT-style models that captures transferable knowledge across similar NLP tasks to improve the performance of few-shot learning?
A straightforward solution to the above question is to adopt multi-task fine-tuning across these similar NLP tasks. When the training data is scarce, the fine-tuned PLM would easily be over-fitted to specific instances (Nakamura and Harada, 2019). In machine learning, the meta-learning paradigm is extensively studied, which produces models that are capable of being adapted to a group of similar tasks quickly with few learning steps (Wang et al., 2020c;Huisman et al., 2020). For PLMs, Wang et al. (2020a) discover that training a meta-learner for PLMs is effective to capture the transferable knowledge across different domains. Yet, this method is not designed for prompts for few-shot learning and lacks the mechanism to learn unbiased Task 3 Figure 1: The high-level architecture of the TransPrompt framework. In the toy example, Task 1 and Task 2 are existing tasks, while Task 3 is a new task for the meta-learner to generalize. (Best viewed in color.) representations for all the tasks.
In this paper, we present TransPrompt, a prompting framework that allows PLMs to capture crosstask transferable knowledge for few-shot text classification, with the high-level architecture shown in Figure 1. TransPrompt firstly employs a Multitask Meta-knowledge Acquisition (MMA) procedure to learn the transferable representations of prompt encoders and PLMs jointly across similar NLP tasks. To reduce over-fitting and make the underlying PLM more task-agnostic and less unbiased towards any specific tasks, we propose two debiasing techniques, namely prototype-based debiasing and entropy-based de-biasing. The learned model can be viewed as the meta-learner for a group of similar NLP tasks.
After MMA, TransPrompt takes the Task-aware Model Specification (TMS) step, which can be further divided into two cases. i) When the model is adapted to existing tasks during MMA, a variation of P-tuning (Liu et al., 2021) can be applied for effective adaptation. ii) When it is required to fit a previously unseen task, a model generalization strategy is employed, specifically considering the universal prompting knowledge in the model. This is often the case where re-training of the metaleaner across all the tasks is infeasible due to data privacy or computation efficiency issues. 1 For evaluation, we test the TransPrompt framework on three sets of few-shot NLP tasks (including seven public datasets in total): i) sentiment analysis; ii) Natural Language Inference (NLI); and iii) paraphrase. Experimental results show that TransPrompt consistently outperforms both single-task and cross-task strong baselines. 1 Note that our settings are slightly different from existing works on few-shot learning for PLMs (Gao et al., 2020;Liu et al., 2021) in that we focus on few-shot learning over a series of similar NLP tasks. When tasks during TMS are the same as MMA, our approach can be viewed as a transfer learning algorithm that learns how knowledge can be transferred better. When TransPrompt is required to fit new tasks during TMS, it is placed in a meta-learning setting.
We further show that i) the meta-learner trained by TransPrompt is effective to generalize to unseen tasks; and ii) TransPrompt also outperforms popular fine-tuning algorithms when learning with the full training sets. In summary, we make the following major contributions in this work: • We introduce the novel TransPrompt framework to learn cross-task transferable knowledge for few-shot text classification.
• A prompt-based meta-learner training algorithm with two de-biasing techniques is presented to capture transferable knowledge.
• Experiments on multiple types of NLP tasks show that TransPrompt consistently outperforms strong baselines for both few-shot learning and standard fine-tuning.

Related Work
We summarize the related work on PLMs, PLM prompting, transfer learning and meta-learning.

Pre-trained Language Models
With large-scale pre-training, PLMs have achieved significant improvements on various NLP tasks (Qiu et al., 2020). BERT (Devlin et al., 2019) learns contextual representations by transformer encoders. Other transformer encoder-based PLMs include Transformer-XL , XL-Net , StructBERT (Wang et al., 2020b), Big Bird (Zaheer et al., 2020) and many others. The encoder-decoder architectures are used in T5 (Raffel et al., 2020) and GPT-3 (Brown et al., 2020). As the neural architecture design of PLMs is not our major focus, we do not further elaborate.

Prompt Learning for PLMs
The huge GPT-3 model (Brown et al., 2020) enables few-shot learning without fine-tuning, which relies on handcraft prompts. To facilitate automatic prompt construction, Gao et al. (2020) generates prompts from the T5 model (Raffel et al., 2020). Jiang et al. (2020) mines prompts from the training corpus. AutoPrompt (Shin et al., 2020) employs token-based gradient searching to detect prompts. However, these approaches focus on discrete prompts only. P-tuning (Liu et al., 2021) is a pioneer work to learn continuous prompt embeddings with differentiable parameters. Our work further extends P-tuning (Liu et al., 2021) that allows PLMs to learn from similar tasks to improve few-shot text classification.

Transfer Learning and Meta-learning
Transfer learning aims to transfer knowledge or resources from source domains to target domains (Zhuang et al., 2021). For deep neural networks, it is common practice to learn similar tasks by multi-task learning (Liu et al., 2019a). With the popularity of PLMs, fine-tuning has become the standard practice by learning from PLMs for similar tasks Arase and Tsujii, 2021). In contrast, meta-learning aims to learn models that can quickly adapt to different tasks with little training data available (Wang et al., 2020c;Huisman et al., 2020), typically formulated as a K-way N-shot problem. Meta-learning algorithms have been applied in few-shot NLP tasks, such as text classification (Geng et al., 2020), relation extraction (Gao et al., 2019), question answering (Hua et al., 2020) and knowledge base completion (Sheng et al., 2020). Similar to Wang et al. (2020a); Pan et al. (2021); Wang et al. (2021), the proposed TransPrompt framework can be viewed as a combination of transfer learning and metalearning, which learns transferable knowledge from similar tasks to improve the performance of fewshot text classification, either for existing tasks or new tasks.

The TransPrompt Framework
We formally present our task and the techniques of the proposed TransPrompt framework in detail.

Overview
We begin with a brief summary of our task. Let T 1 , · · · , T M be M similar few-shot text classification tasks. The m-th task can be formulated as: T m : x → y, where x and y ∈ Y represent the input text and the classification label, respectively.
Y is the pre-defined label set with |Y| = N , where N is a pre-defined constant. In our setting, we assume that there are K training samples associated with each class y ∈ Y in each task T m . Hence, we have a training set D m for each task T m , each containing N × K samples. The total number of training instances of M tasks is N × K × M . 2 In TransPrompt, we train a meta-learner F meta with parameters initialized from any PLMs, based on the M few-shot training sets D 1 , · · · , D M . After that, F meta is adapted to each task T m based on its own training set D m . The task-specific model is denoted as F m . As F meta is designed to digest the transferable knowledge across tasks, rather than simple multi-task learning, F meta can also be adapted to previously unseen tasks. Due to the data privacy or computation efficiency issues, when the few-shot training setD of a similar taskT is not available during the training process of F meta , we explore how TransPrompt can be used to generate an accurate modelF based on F meta andD. In this case, F meta does not have any knowledge of the new taskT when it is trained during MMA.
In the following, we introduce the detailed techniques of the TransPrompt framework, which consists of two major stages, i.e., Multi-task Metaknowledge Acquisition (MMA) and Task-aware Model Specification (TMS). Finally, we discuss how to apply TransPrompt to standard fine-tuning scenarios where we have relatively large training sets, instead of solving the N-way K-shot problem.

Multi-task Meta-knowledge Acquisition
For clarity, we illustrate the general architecture of the meta-learner for MMA in Figure 2.

Prompt Encoding
As the TransPrompt framework is placed in the multi-task setting, for each task T m , we have a task-specific prompt template t (m) (x) as follows: is a prompt pseudo token (as proposed in Liu et al. (2021)), I is the total number of pseudo tokens, and M ASK is a special token as the placeholder for model output. We also define a universal prompt template t ( * ) (x) for all the tasks: Class 1 Class 2 Class 3 Output Class Probabilities 1 2 3 1 2 3

Regularization
Multi-task Metaknowledge Encoder For an instance (x, y) ∈ D m , the prompt embedding PE (m) (x) can be computed as follows:

Entropybased Debiasing
where we use bidirectional LSTM networks with multi-layer perceptrons as prompt encoders (Liu et al., 2021). The average pooled results from both task-specific and universal prompt encoders are treated as the prompt embedding. The prompt embedding PE (m) (x) is a sequence as the input of the PLM: where h [x] is the sequence embedding of input x, and h [M ASK] is the masked output token embedding. 3 As prompt parameters are fully differentiable, during back propagation, they effectively capture the task-specific and universal knowledge.

Training the Meta-learner
A naive approach for obtaining the meta-learner is applying the P-tuning process (Staudemeyer and Morris, 2019) across the M tasks with the M + 1 prompt encoders. However, in practice, it does not guarantee satisfactory results. As large PLMs can easily suffer from over-fitting during few-shot learning (Gao et al., 2020), in cross-task scenarios, the meta-learner would unfortunately memorize the non-transferable knowledge from nontarget tasks. To alleviate this problem, we propose two de-biasing techniques to obtain a more unbiased meta-learner encoded with transferable knowledge, namely i) prototype-based de-biasing and ii) entropy-based de-biasing.
Prototype-based De-biasing. This technique aims to give more importance to prototypical instances across tasks during the training process of the metalearner. Here, we extend Snell et al. (2017) to construct a lite Multi-task Prototypical Network G. In the network G, the class centroid embedding c m (y) (y ∈ Y) for each task T m is computed and stored as: where D m,y is the subset of D m such that each instance in D m,y has the label y, and E(x) is the representation of x generated by the meta-learner described previously 4 . For each instance (x, y) ∈ D m , we pass the text x through the network to generate the cross-task prototype score, denoted as s(x): where 0 < ζ < 1 is a pre-defined balancing factor, and sim(·, of Knowcdot) is the similarity function between two embeddings. We can see that an instance receives a higher score if it is semantically related to the centroids from both the task T m itself and other tasks, hence is more transferable across tasks. By treating s(x) as the optimization weight, the overall loss function L(Θ) of F meta can be given by: where Θ is the collection of all model parameters, l(x, y; Θ) is the sample-wise cross-entropy loss, and λ 1 is the regularization hyper-parameter.
Entropy-based De-biasing. One potential risk of applying the prototype-based de-biasing technique only is obtaining a non task-agnostic meta-learner. Consider three tasks T 1 , T 2 and T 3 . If T 1 and T 2 are highly similar, and T 3 is more dis-similar. Instances in D 1 and D 2 would naturally receive high prototype scores, making the meta-learner biased towards T 1 and T 2 , and pays little attention to T 3 . Hence, when the meta-learner is required to fit T 3 , it may have poor parameter initialization settings. To make it more task-agnostic, inspired by Jamal and Qi (2019), we consider the model prediction entropy H(D m ) over D m : whereŷ(x) is the predicted probability of x being assigned to the classŷ ∈ Y. When H(D m ) is used as a part of the model regularizers, the meta-learner will be less over-trained on any specific tasks.
By plugging the term H(D m ) into the loss function L(Θ), we have the new loss function L (Θ): where λ 2 is the regularization hyper-parameter. Optimization Procedure. Despite its simple formula, minimizing L (Θ) is a non-trivial problem. This is because when we calculate s(x), we must obtain model parameters of the PLM beforehand, which is not available before the training process. On the other hand, the optimization of L (Θ) requires the values of s(x) for all training samples, which poses the "chicken-and-egg" problem. We employ a dual optimization process to solve the problem of L (Θ). In the initial stage, all s(x)s are uniformly initialized. Next, we fix s(x)s as constants to minimize l(x, y; Θ) in L (Θ). An inference procedure on the PLM can be applied to obtain all s(x)s. This process iterates for a certain number of epochs. Readers can also refer to Algorithm 1 for an algorithmic overview.

Task-aware Model Specification
After MMA, the meta-learner can be adapted to specific tasks with ease. For a task T m that has already "seen" by the meta-learner, we fine-tune the corresponding prompt encoder and the PLM by minimizing the loss function L (m) (Θ): which is a variant of P-tuning (Liu et al., 2021) with better parameter initialization.
For a previously unseen taskT , the model generalization strategy is employed. Here, we use the universal prompt encoder to initialize its prompt encoder. The entire model is trained over the dataset D, with the loss functionL(Θ) as follows: As the meta-learner is highly generalized, it can provide good initialization for the few-shot learning taskT .

Learning with Full Training Sets
TransPrompt can also be applied for standard finetuning when we have relatively large training sets with few modifications. During MMA, we notice that when it is not a N-way K-shot problem, the sizes of D 1 , · · · , D M can be significantly different. Optimizing L (Θ) directly on these datasets would make the meta-learner biased towards large datasets. To address this problem, when we sample a batch from D 1 , · · · , D M , instead of randomly selection, we employ stratified sampling where training instances are selected with the probability proportional to the dataset distribution Pr(D m ): where γ > 0 is a smoothing factor. This results in the over-sampling of small datasets and the undersampling of large datasets.

Experiments
In this section, we conduct extensive experiments to evaluate the TransPrompt framework and compare it against strong baselines.

Datasets and Experimental Settings
Following Gao et al. (2020), we select seven public datasets to evaluate TransPrompt, divided into three sets of NLP tasks: sentiment analysis (SST-2 (Socher et al., 2013), MR (Hu and Liu, 2004) and Algorithm 1 Meta-learner Training Algorithm 1: for each instance (x, y) ∈ M m=1 Dm do 2: Uniformly set s(x) = 1; 3: end for 4: while number of training epochs does not reach a limit do 5: while current training epoch is not finished do 6: Sample a batch B = {(x, y)} from M m=1 Dm; 7: Use B to update Fmeta by minimizing L (Θ); 8: end while 9: for each instance (x, y) ∈ M m=1 Dm do 10: Compute s(x) based on the updated model; 11: end for 12: end while 13: return the meta-learner Fmeta (i.e., parameters of the PLM and M + 1 prompt encoders).  CR (Pang and Lee, 2005)), NLI (MNLI (Williams et al., 2018) and SNLI (Bowman et al., 2015)) and paraphrase (MRPC (Dolan and Brockett, 2005) and QQP 5 ). 6 The statistics of these datasets are reported in Table 1. The training/development/testing splits are the same as Gao et al. (2020). For few-shot learning, the evaluation protocols are the same as Gao et al. (2020). The underlying PLM is the RoBERTa large model (with 335M parameters) (Liu et al., 2019b) and we set K = 16. We measure the average performance in terms of accuracy across 5 different randomly sampled training and development splits. Refer to Gao et al. (2020) for more experimental settings. We employ standard BERT fine-tuning (Devlin et al., 2019) 7 , the LM-BFF prompting model (Gao et al., 2020) (with both manually-compiled and automaticallymined prompts) 8 and P-tuning (Liu et al., 2021) 9 (which produces state-of-the-art performance for PLM-based few-shot learning) as single-task baselines. Because we focus on learning knowledge across tasks, we also use the multi-task versions of BERT fine-tuning , LM-BFF (Gao et al., 2020) and P-tuning (Liu et al., 2021), and Meta Fine-tuning (Wang et al., 2020a) 10 as crosstask baselines. Specifically, we employ separate prompts (either discrete prompts or continuous prompt embeddings) for different tasks in the multitask versions of LM-BFF and P-tuning. As we consider three sets of NLP tasks, we constrain that the knowledge is transferred across the same set of NLP tasks (for example, the cross-task models for sentiment analysis are jointly trained over the training sets of SST-2, MR and CR). Besides, we are interested in how TransPrompt can be applied when learning with full training sets. We follow the base-scale experimental settings in Liu et al. (2021), with the RoBERTa base model (with 109M parameters) as the underlying PLM.
For fair comparison, we re-produce all baselines based on their open-source codes under the same settings. Our own TransPrompt algorithm is implemented in PyTorch and run with NVIDIA V100 GPUs. In default, we set ζ = 0.5, γ = 0.001 and λ 2 = 0.01. The parameter regularizers are the same as in Liu et al. (2021). The model is trained with the Adam optimizer (Kingma and Ba, 2015) and a batch size of 16. The model architecture of prompt encoders is the same as Liu et al. (2021). Therefore, the increased number of parameters of TransPrompt remains minimal. We further tune the learning rates and epochs, with results reported in the following experiments.

General Experimental Results
The results of TransPrompt and all baselines on all seven testing sets for few-shot learning are shown in Table 2. From the experimental results, we have the following conclusions. i) Prompting baselines (such as LM-BFF (Gao et al., 2020) and P-tuning (Liu et al., 2021)) outperform standard fine-tuning by a large margin. This shows prompts are useful for few-shot learning. Based on our reproduction results, LM-BFF and P-tuning have similar performance. Automatically-mined prompts are sightly better than manually-compiled prompts for LM-BFF. ii) As for cross-task baselines, the  Table 2: The few-shot testing results of TransPrompt and baselines in terms of accuracy (%). "man", "auto" and "mtl" refer to manually-compiled prompts, automatically-mined prompts and multi-task learning, respectively. * refers to the multi-task variants of the original approaches. Hereinafter the same.
multi-task version of P-tuning is more effective than that of LM-BFF, which shows that continuous prompt embeddings are more suitable for multitask learning than discrete prompts. iii) The performance gains of TransPrompt over all three sets of tasks and seven datasets are consistent. Overall, the average improvement is around 3% in terms of accuracy, compared to the strongest baseline (i.e., the multi-task version of P-tuning). We also conduct paired t-tests over the results produced on all tasks. The results show that the improvement of TransPrompt is statically significant (with the p-value p < 0.01).

Detailed Model Analysis
In the following, we study how TransPrompt improves the performance in various aspects. Ablation Study. In the TransPrompt framework, we propose two de-biasing techniques to improve the effectiveness of the meta-learner, i.e, prototypebased and entropy-based. Here, we remove each one and all two de-biasing techniques, and implement three variants of TransPrompt. The results of the ablation study are in Table 3. As seen, both de-biasing techniques are proved effective for TransPrompt. Particularly, prototype-based de-biasing plays a slightly more important role than entropy-based de-biasing in 6 out of 7 tasks. We conclude that de-biasing the meta-learner is crucial for obtaining the cross-task knowledge. Parameter Tuning. We further tune the learning epoch and the learning rate during the training process of TransPrompt, and report the performance over the development sets. Due to space limitation, we illustrate the results over SST-2, MR and CR, shown in Figure 3. We fix the learning rate to be 1e-5 and tune the learning epochs. Figure 3(a) shows that the performance of the meta-learner becomes stable over 20 epochs (which is tested on the combination of three developments sets for SST-2, MR and CR). Figure 3(b) gives the results of the three tasks during TMS. We fix the learning epoch to be 20, and tune the learning rate from 1e-5 to 5e-4. As seen in Figure 3(c), the learning rate should be set in the range of 1e-5 to 5e-5.

Model Generalization to New Tasks
One advantage of TransPrompt is that it can train a meta-learner with cross-task transferable knowledge encoded. In this set of experiments, we consider three tasks of sentiment analysis: SST-2, MR and CR. Each time, we train the meta-leaner over two out of the three datsets (the MMA step), and then make the model generalized to each of the other tasks (the TMS step). For example, we train the meta-learner over the few-shot SST-2 and MR datasets and then take the TMS step over the few-shot CR dataset. Here, the meta-learner has no knowledge of CR before the TMS step. We test whether the usage of the meta-learner is better than simply applying LM-BFF (Gao et al., 2020) or P-tuning (Liu et al., 2021) initialized from PLMs. From Table 4, it is clearly reflected that the meta-learner brings improvement in all three cases, hence generalizes to new tasks accurately.

Learning with Full Datasets
Apart from few-shot learning, we also investigate how TransPrompt performs when the full training sets are available, compared to other approaches. The results are presented in Table 5. On average, TransPrompt outperforms all single-task baselines by around 1% to 5% in terms of accuracy. This shows that our proposed paradigm can be of help in   non few-shot learning scenarios by learning from a group of similar NLP tasks.
Another interesting finding is that when it comes to multi-task learning, the performance of LM-BFF (Gao et al., 2020) and P-tuning (Liu et al., 2021) drops, compared to the single-task setting. A most possible cause is that with a large amount of training data from other tasks, existing promptbased approaches may capture non-transferable knowledge that is harmful to the target task. In contrast, the two-step paradigm of TransPrompt learns different types of knowledge at different steps (i.e., the universal knowledge in MMA, and the taskspecific knowledge in TMS), hence produces better results. Overall, TransPrompt is competitive in standard fine-tuning scenarios with datasets from similar NLP tasks available.

Case Studies
For a more initiative understanding of which instances are more transferable across tasks, in Table 6, several review texts from SST-2, MR and CR with high and low prototype scores are presented. Although these texts come from different tasks, our TransPrompt algorithm is able to find texts that express general polarities instead of specific points. For instance, "time waster", "remarkable" and "5 stars" are strong indicators of polarities, which receive high scores generated by TransPrompt. In   Table 6: Cases of review texts in SST-2, MR and CR with high and low cross-task prototype scores.
contrast, review texts with low scores are overly specific and hence are less transferable across tasks. Hence, our meta-learner truly captures transferable knowledge for effective knowledge transfer.

Conclusion and Future Work
In this paper, we present the TransPrompt framework for few-shot learning across similar NLP tasks based on continuous prompt embeddings. Experimental results show that TransPrompt consistently outperforms strong baselines in both fewshot learning and standard fine-tuning settings. Additionally, we find that the meta-learner trained by TransPrompt can be adapted to previously unseen tasks easily. In the future, we will explore how TransPrompt is applied to other PLMs apart from BERT-style models and other NLP tasks.