Learning to Initialize: Can Meta Learning Improve Cross-task Generalization in Prompt Tuning?

Prompt tuning (PT) which only tunes the embeddings of an additional sequence of tokens per task, keeping the pre-trained language model (PLM) frozen, has shown remarkable performance in few-shot learning. Despite this, PT has been shown to rely heavily on good initialization of the prompt embeddings. In this work, we study meta prompt tuning (MPT) to systematically explore how meta-learning can help improve (if it can) cross-task generalization in PT through learning to initialize the prompt embeddings from other relevant tasks. We empirically analyze a representative set of meta learning algorithms in a wide range of adaptation settings with different source/target task configurations on a large set of few-shot tasks. With extensive experiments and analysis, we demonstrate the effectiveness of MPT. We find the improvement to be significant particularly on classification tasks. For other kinds of tasks such as question answering, we observe that while MPT can outperform PT in most cases, it does not always outperform multi-task learning. We further provide an in-depth analysis from the perspective of task similarity.


Introduction
Humans can easily learn to perform new tasks with only few data by leveraging previously acquired knowledge from other relevant tasks.Such capability is a hallmark of human intelligence (Carey and Bartlett, 1978).However, when it comes to the models, they often face over-fitting issues when they are tasked to learn from a few labeled examples (Lake et al., 2017;Linzen, 2020), a problem commonly termed as few-shot learning (FSL).
With the recent advancements in developing large-scale pre-trained language models (PLMs), prompt-based methods have shown promising results in FSL. Brown et al. (2020) show that by virtue of in-context (meta) learning, a frozen GPT-3 model can achieve good results on a variety of few-shot tasks through manually designed prompts, which are task instructions along with a few examples expressed in natural language.However, the performance of in-context learning has been shown to be highly sensitive to the design of such "discrete" prompts (Zhao et al., 2021).It is also limited by the maximum sequence length supported by the PLMs (Li and Liang, 2021).Down this line, efforts have been made on automatically searching and optimizing for discrete prompts (Shin et al., 2020;Schick and Schütze, 2021;Gao et al., 2021).
As an alternative to discrete prompts, recent efforts attempt to learn "soft" prompts that add additional trainable parameters (Liu et al., 2021b;Li and Liang, 2021;Lester et al., 2021), showing better results than discrete prompts (Liu et al., 2021a).Lester et al. (2021) introduce prompt tuning (PT) that prepends a sequence of tunable tokens to the input and optimize their embeddings keeping the PLM frozen.Despite its strong few-shot performance, PT has been shown to be sensitive to the initialization of the embeddings, which might limit its practical application (Qin and Joty, 2022b).To address this, Gu et al. (2022) propose pre-trained arXiv:2302.08143v2[cs.CL] 20 May 2023 prompt tuning (PPT) to pre-train soft prompts using self-supervised tasks on unlabeled data.It relies on carefully designed pre-training tasks tailored to the downstream tasks, and the pre-training objectives are only applicable to classification tasks.Vu et al. (2022) introduce soft prompt transfer (SPoT), which uses the soft prompts learned from a set of source tasks through multi-task learning to initialize the prompt for a target task.Both PPT and SPoT demonstrate cross-task generalization (Fig. 1) -learning of a new task can benefit from learning of other related tasks (Ye et al., 2021).
In a recent survey, Lee et al. (2022) claim that meta learning (Schmidhuber, 1987) can play an important role for cross-task generalization in NLP. 1ifferent from multi-task learning which considers the performance on the source tasks to learn the initial parameters, meta learning aims to find initial parameters suitable for adapting to a target few-shot task.Hence, it could outperform multitask learning in several scenarios with full-model finetuning (Dou et al., 2019;Chen et al., 2020b).However, to our knowledge, there is no systematic study on the role of meta learning on PT.In a recent work, Huang et al. (2022) adopt MAML (Finn et al., 2017) for pre-training soft prompts.One major limitation of their study is that it is limited to only one type of meta learning algorithm and only sentiment classification tasks, lacking comprehensive understanding of cross-task generalization.Min et al. (2022) and Chen et al. (2022) show the effectiveness of in-context learning for PLMs, whereas we mainly focus on optimization-based meta learning.
To systematically study meta prompt tuning (MPT) for cross-task generalization, we conduct experiments on a large collection of few-shot tasks involving different types of datasets with a unified text-to-text format (Ye et al., 2021).We investigate a wide range of adaptation settings with different source/target task types, which helps better understand the capability and limitation of meta learning in PT.With extensive experiments, we aim to address the following research questions: • Q1.Can MPT improve cross-task generalization in PT?Is it better than multi-task learning?• Q2.What happens with more labelled data for source/target tasks (beyond few-shot settings)?
• Q3.Does it help with more diverse source tasks?• Q4.Is the performance gain of MPT consistent across different backbone models?
To answer these questions, we empirically analyze MAML (Finn et al., 2017), FoMAML and Reptile (Nichol et al., 2018), which constitute a representative set of meta learning methods.Experimental results show that MPT can indeed help cross-task generalization, e.g., MAML improves the performance of PT by more than 20% on classification tasks.However, we also notice that MPT does not always outperform multi-task learning, especially on non-classification tasks.We provide an in-depth analysis from the perspective of task similarity.As for Q2, we find that MPT does benefit cross-task generalization beyond few-shot settings.For Q3, we observe that increasing the diversity of source tasks does not necessarily improve crosstask generalization.Finally, the consistent gain of MPT across different models shows its robustness to model type and size.In summary, the two main contributions of this work are: • To the best of our knowledge, we are the first to extensively explore how meta learning helps cross-task generalization in prompt tuning.

• With extensive experiments and analysis, we
show the effectiveness and limitation of meta prompt tuning in various source/target settings.

Related Work
Few-shot Learning (FSL) FSL aims to learn a task with only a few labeled examples, which often leads to the over-fitting problem.Existing methods to address this problem mainly focus on optimizing the hypothesis space of the few-shot tasks (Triantafillou et al., 2017;Finn et al., 2017;Hu et al., 2018) or augmenting the few-shot data (Gao et al., 2020;Qin and Joty, 2022a).Recently, largescale pre-trained language models (PLMs) have demonstrated strong FSL ability through promptbased methods, including both discrete (Brown et al., 2020;Ding et al., 2022) and soft prompts (Lester et al., 2021).
Prompt-based Learning (PL) PL is a new paradigm which prepends a task-specific template or prompt to the input for learning new tasks (Liu et al., 2021a).Initial PL methods mainly focus on designing, searching or optimizing discrete prompts (Brown et al., 2020;Shin et al., 2020;Gao et al., 2021).However, discrete prompts are hard to optimize.To solve this, recent PL methods attempt to optimize prompts in a continuous space, i.e., learn soft prompts (Li and Liang, 2021;Liu et al., 2021b;Lester et al., 2021), showing impressive FSL performance (Qin and Joty, 2022b).In addition to prompt design, several recent studies have explored the applications (Zhu et al., 2022;Li et al., 2022;Qin et al., 2023;Zhao et al., 2023) and analysis (Zhong et al., 2021;Le Scao and Rush, 2021) of PL.
Meta Learning Meta Learning or learning to learn, has been applied to boost few-shot performance on various NLP tasks, e.g., relation extraction (Han et al., 2018) and machine translation (Gu et al., 2018).Meta learning algorithms can be divided into three main categories.First, blackbox methods adopt additional meta learners to help adaptation (Santoro et al., 2016;Garnelo et al., 2018;Mishra et al., 2018;Brown et al., 2020).Second, non-parametric methods explore how to learn metrics that can compare the distances between different samples, i.e., learning to compare (Koch et al., 2015;Vinyals et al., 2016;Snell et al., 2017).Finally, optimization-based methods aim to learn better parameter initialization to effectively and efficiently adapt to unseen tasks, i.e., learning to initialize (Finn et al., 2017;Nichol et al., 2018;Kedia et al., 2021).Lee et al. (2022) claim that meta learning can be effective for cross-task generalization, especially the optimization-based methods.They can be applied to various problems in a model-agnostic way to improve FSL on target tasks with model fine-tuning (Ye et al., 2021).

Summary.
Existing work shows that meta learning can improve cross-task few-shot generalization with full model fine-tuning.However, there is no systematic study on whether (and how) meta learning can do so with prompt tuning of PLMs.To fill this research gap, our work provides a comprehensive understanding of the effectiveness and limitation of meta learning in prompt tuning.

Preliminaries
In this section, we revisit the basics about prompt tuning and optimization-based meta learning.

Prompt Tuning
Following Lester et al. (2021), we reframe all tasks into a text-to-text format.Given a training dataset D  = {( 1 ,  1 ), ..., (  ,   )} for a task T , differ-ent from traditional model fine-tuning, prompt tuning (PT) is a parameter-efficient learning method which freezes the PLM  and prepends the input text   with a sequence of tunable soft tokens , parameterized by prompt embeddings .The prompt embeddings  are initialized from the vocabulary of the PLM and optimized through gradient descent with the following objective:

Optimization-based Meta Learning
The main goal of optimization-based meta learning (or learning to initialize), is to learn better initial parameters that can effectively and efficiently adapt to a new task T new with limited data.We denote the initial parameters (meta-parameters) as  * .
To obtain  * , the model needs to learn from a series of meta-training tasks T meta = {T 1 , ..., T  }.The dataset D  of each task T  is divided into two disjoint sets: a support set S  and a query set Q  .The objective for learning  * is where L is the objective function defined in Eq. ( 1),  is the set of parameters to meta-learn and  is the inner learning rate.Denoting the overall loss as L T meta  = T  ∈ T meta L ( , Q  ) with  being the inner-updated value of , we use gradient descent to update  further in the meta-training stage: where  is the outer learning rate.This is actually the Model-Agnostic Meta-Learning or MAML (Finn et al., 2017).Notice that optimizing Eq. ( 3) requires calculating second-order gradients, which can be quite memory-consuming.To alleviate this, First-order MAML (FoMAML) and Reptile (Nichol et al., 2018) are proposed to use first-order approximations, allowing lower memory costs.
After the meta-training stage,  * serves as the initial parameters for learning an unseen meta-testing task T new which is usually few-shot.

Approach
In this section, we first introduce the problem setting and evaluation metric.Then, we illustrate the key methods for meta prompt tuning (MPT).

Problem Setting
To evaluate cross-task generalization in prompt tuning, we select a large and diverse collection of fewshot tasks from Ye et al. (2021), covering various types including classification, question answering and generation.We partition the set of all tasks T all into two disjoint parts: source tasks T src and target tasks T tgt .Details of the tasks and partitions are provided later in our experiment setup ( §5).
Following Min et al. (2022), we can divide the whole learning process into two stages (Fig. 1): • Upstream learning on source tasks In this stage, the model has access to T src , which is regarded as meta-training tasks T meta in Eq. ( 2).We divide the dataset D  of every source task T  into training (or support) and validation (or query) sets, and conduct optimization-based meta learning or multi-task learning on these sets to obtain metaparameters  * .Note that we use both support and query sets for model training in multi-task learning to ensure fair data access for both methods.
• Downstream learning on target tasks After the upstream learning stage, we use the learned meta-parameters  * as the initial point for learning target tasks T tgt .Every target task This two-stage learning paradigm can naturally reflect cross-task generalization where the model needs to learn an unseen task given previously acquired knowledge from other tasks.

Evaluation Metric
We evaluate the model performance on a set of target tasks T tgt .As T tgt may cover various task types, simply averaging the performance of different target tasks is unreasonable.Following Ye et al. (2021), we use average relative gain (ARG) as the main evaluation metric.We first calculate relative gain (RG) for each target task, i.e., relative performance improvement before and after applying the upstream (meta or multi-task) learning on the source tasks.Then we average the relative gains of all target tasks to obtain the final result which indicates the overall performance improvement.In the meta-training stage, we conduct optimizationbased meta learning on source tasks to obtain metaparameters (i.e., soft prompts).The meta-parameters will then be used to initialize prompt embeddings for learning unseen target tasks in the meta-testing stage.

Meta Prompt Tuning (MPT)
As shown in Fig. 2, the key idea of MPT is to apply optimization-based meta-training as upstream learning to a set of source tasks in order to learn meta parameters, which in this case are prompt embeddings.The learned prompt embeddings serve as the initialization for learning unseen target tasks, referred to as meta-testing or downstream learning.

Meta-training
We meta-train the prompt embeddings on source tasks T src .Without loss of generality, we take MAML (Finn et al., 2017) as an example.For every iteration, we first sample one source task T  which has a support set S  and a query set Q  .Then we sample a support batch B  from S  and a query batch B  from Q  .Denoting the trainable prompt embeddings as , B  and B  are used for one gradient update with the following objective: where L is the task loss defined in Eq. ( 1), and  and  are inner and outer learning rates, respectively.During the meta-training stage, we iterate over tasks in T src to update prompt embeddings  for a fixed number of steps.The learned metaparameters  * is used in the meta-testing stage.

Meta-testing
In meta-testing, the model is expected to learn unseen target tasks T tgt .For each target task T  , we use the learned meta-parameters  * to initialize the prompt embeddings for the task.Denoting the training set of T  as D tr  , the learning objective during meta testing is defined as: where  is the frozen PLM, (  ,   ) ∼ D tr  is a training sample and  * are the prompt tokens.
We evaluate the model with the best validation performance on the test set and calculate average relative gain on the test sets of T tgt .

Experimental Setup
We first describe the source/target task partitions, and then introduce methods compared in our work.Finally, we present the implementation details.

Task Partitions
We experiment with ten different source/target task partitions as shown in Table 1.Depending on the type of the target tasks, we can divide these ten settings into several groups: • R→R (Random→Random): We first experiment with the R→R setting where both source and target tasks are randomly selected, meaning that they can cover any task type.This setting mimics the learning paradigm of humans and reflects whether cross-task generalization can help obtain a general-purpose few-shot learner.
• X→Cls (X=Cls, Both, Non-Cls): The target tasks involve classification, while the source tasks can be classification, non-classification tasks or both.This setting helps us better understand the influence of the source task distribution.
• X→Non-Cls (X=Cls, Both, Non-Cls): The only difference between this and the previous setting is the type of target tasks.We investigate how meta learning improves cross-task generalization when target tasks are non-classification tasks.
• X→QA (X=QA, Non-QA): Compared to the previous one, this group is more fine-grained.We only select target tasks from question answering (QA) instead of all non-classification tasks.
We conduct experiment on different source task types, including QA and Non-QA tasks.
• NP→P (Non-Paraphrase Cls→Paraphrase): This group has the finest granularity in our setting.We choose paraphrase identification which is a sub-category of classification as the target, and non-paraphrase classification as the source.
The final two groups help understand how meta learning performs in more fine-grained scenarios.
Note that we ensure that there is no overlap between the source and target tasks.Following Ye et al. (2021), we use 16 samples per class in the training (or support) and validation (or query) sets for classification tasks, and 32 samples per set for non-classification tasks.For every task, we sample the training and validation sets 5 times with different random seeds to reduce variance in few-shot evaluation and cover more diverse samples in upstream learning.We provide full details of tasks and partitions in Appendix A.1.

Methods Compared
We mainly use T5-Large (Raffel et al., 2019) as the backbone language model and compare the following methods in our work.
• Prompt Tuning (PT) on target tasks.It is our baseline without the upstream learning.We directly apply PT (Lester et al., 2021) to target tasks and use its performance as the basis for computing average relative gain for other methods.
We apply MAML (Finn et al., 2017) in the upstream learning (meta-training) stage.The learned meta-parameters are used to initialize prompt embeddings for learning target tasks.
We also investigate two first-order meta learning algorithms: FoMAML (Finn et al., 2017) and Reptile (Nichol et al., 2018).Compared to MAML, they are more memory-efficient.
• Multi-task learning (MTL).We conduct multitask learning on source tasks instead of meta learning to obtain initial parameters.This is a straight-forward yet effective method as demonstrated by Vu et al. (2022).
• Fine-tuning on target tasks.Fine-tuning is the dominant paradigm where the whole language model is tuned for learning target tasks.We include it to verify whether cross-task generalization can help PT outperform fine-tuning.
In addition, we conduct experiments with different backbone models to verify MPT's robustness.
For downstream learning, we mainly follow the settings in Ye et al. (2021).
Since it is infeasible to search for optimal hyperparameters for each of the meta-and multi-task learning methods in each of the settings, we select them based on the R→R setting.We randomly select 5 tasks that are not in the source and target sets as validation tasks for hyperparameter search.The hyperparameters with best validation performance (ARG) are used for upstream learning.We select the inner learning rate, the outer learning rate and total training steps for MAML and adopt the same three hyperparameters for FoMAML and Reptile.

Results and Analysis
We now address the four research questions asked before in §1 with empirical results.
Q1. Can meta prompt tuning improve cross-task generalization?Is it better than multi-task learning?
The ARG of different methods w.r.t.PT in various settings are shown in Table 2; more detailed results on every target task are in Appendix A.2.
• MPT can indeed help cross-task generalization.From the results in Table 2, we observe that MPT outperforms the baseline PT in most cases with +ve ARG scores.Out of 30 different runs for three meta learning methods in ten different settings (see the 1st block of results), MPT achieves better performance than PT in 23 runs, demonstrating its effectiveness in cross-task generalization.
For the R→R setting, MAML achieves the best performance, showing that it is a good generalpurpose few-shot learner.For adapting to classification tasks, MAML outperforms PT by 20.16% if the prompt embeddings are initialized from other classification tasks.The results in a more finegrained setting (NP→P) also indicate the ability of MAML to learn classification tasks.While Reptile performs the best (20.44%) in this setting, MAML still outperforms PT by a large margin (11.14%).
However, as shown in Table 2, MAML falls behind FoMAML when adapting to non-classification tasks.Among the three meta learning methods, Fo-MAML achieves the best performance (9.81%) on non-classification target tasks in the Both→Non-Cls setting, showing effective knowledge transfer.We observe similar results in more fine-grained settings QA/Non-QA→QA, where FoMAML outperforms MAML and Reptile significantly.While Reptile is claimed empirically to be better than MAML/FoMAML (Lee et al., 2022), it falls short of MAML/FoMAML in many cases.This might be because MAML and FoMAML are more similar compared to Reptile from a gradient perspective (Nichol et al., 2018).And since the hyperparameter search is done based on MAML ( §5.3), which means Reptile's method may be suboptimal.
In addition, we can see that meta learning helps PT outperform fine-tuning in several settings including Cls→Cls (MAML, FoMAML), Both→Cls (FoMAML) and NP→P (MAML, Reptile), which demonstrates the superiority of MPT.
• MPT does not always outperform multi-task learning (MTL).While meta learning is specifically designed for quickly adapting to unseen target tasks, it does not always outperform MTL in PT.
From Table 2, we can observe that MTL achieves better performance than MPT in many cases, especially on non-classification target tasks.We analyze the reasons as follows: • Meta learning methods have been shown to be highly sensitive to the hyperparameters (Antoniou et al., 2019), which we could not tune exhaustively due to memory/time constraints (see Appendix A.5 for hyperparameter sensitivity analysis).As mentioned in §5.3, we select the hyperparameters of MAML using the R→R setting, and then use the same hyperparameters for all meta learning methods in all settings, which might limit the performance of MPT.• There might be less shared structure (or features) among non-classification tasks compared to classification.The classification tasks mostly involve sentence-level classification and in some cases the task labels correlate well (e.g., AG News and DBpedia).Thus, they share some common semantics in both source and target tasks.QA can help summarization in content selection (Arumae and Liu, 2019), it is more difficult for MPT to capture transferable knowledge as success of meta learning eventually depends on how much the tasks share (Finn, 2022).
To provide an in-depth analysis of the difference between classification and non-classification tasks, we consider from the perspective of task similarity.We follow Lin et al. (2022) which shows that the correlation between input subspaces (the norm of projected subspace onto the other subspace) for two tasks can serve as the similarity score between them.We randomly pick 5 (cls,cls) task pairs as similar tasks.For dissimilar tasks, we randomly pick 5 (QA, summarization) task pairs.The average similarity score for similar task pairs is 0.768 while for dissimilar task pairs the score is only 0.306 (see Appendix A.6 for detailed results), which verifies that classification tasks share more structure than non-classification tasks.
Given the performance gap between MPT and MTL in some settings, we believe that exploring more advanced MPT methods could be a promising research direction.
Q2.What happens with more labelled data for source/target tasks (beyond few-shot settings)?As mentioned in §5.1, we mainly explore how MPT improves cross-task generalization when both the source and target tasks are few-shot, which corresponds to the way humans learn (Lake et al., 2017).We used 16 samples per class for classification tasks, and 32 samples per dataset for nonclassification tasks.To validate whether more labelled data for source/target tasks can influence the performance of MPT, we conduct controlled experiments with {32, 64, 128, all} samples per class for source/target tasks in the Cls→Cls setting.
• Source We report the results of MAML and MTL with more labelled data for the source tasks in Fig. 3.We can observe that: (i) MPT outperforms PT (ARG = 0) and MTL in all cases including using the full dataset, showing its robustness to data sizes.(ii) Increasing the number of samples in source tasks does not necessarily lead to better cross-task generalization for MPT.The best ARG is achieved for 16-shot rather than the full dataset, which justifies using few-shot source tasks.
(iii) The performance of MTL improves with more data for source tasks, showing a different learning pattern from MPT.
• Target Table 3  MPT aims to learn to initialize the prompt embeddings from source tasks, which may cover different types.We hypothesize that the diversity of source tasks might influence its performance.To verify this, we analyze the influence of different source task selections on the same target tasks in two settings: varying the type and number of tasks.
• Type of tasks.The results of learning from different types of source tasks are reported in Table 2.
The performance of MPT on non-classification target tasks improves when using more diverse source tasks, e.g., from Non-Cls/Cls→Non-Cls to Both→Non-Cls.However, for adapting to classification task, the best ARG is achieved when all source tasks are classification, i.e., the Cls→Cls setting.Hence, we can conclude that increasing the type diversity of source tasks does not necessarily improve cross-task generalization, which is consistent with the finding in Ye et al. (2021).
• Number of tasks.To investigate the impact of the number of source tasks, we conduct controlled experiments on {12, 24} source tasks sampled from the original 45 source tasks in the Cls→Cls setting (see Appendix A.4 for a full list).From Table 4, we can observe that the performance of MPT keeps improving as the number of source tasks increases, showing better cross-task generalization.
It is worthwhile to note that while our work provides some insights on the choice of source tasks, more systematic studies on how to select the most suitable source tasks given a set of target tasks are needed.We hope that future analysis can provide a more comprehensive understanding of the relationship between source and target tasks.Our experiments and analysis so far use T5-Large as the backbone model.To verify whether the performance gain of MPT is consistent across different backbone models, we extend the experiments to T5-Base, T5-XLarge, BART-Large and GPT2-Large in the NP→P setting.From the results shown in Table 5, we can see that MPT still outperforms PT and MTL by a large margin when using other PLMs as the backbone model, showing its robustness to model size and type.In addition, the consistent gain of MPT with T5-XLarge could also verify the effectiveness of MPT for huge PLMs which have been shown to perform better in prompt tuning (Lester et al., 2021).
While PT shows strong few-shot learning ability, FT remains the dominant paradigm.As shown in Table 2, FT outperforms PT when adapting to classification tasks even in few-shot settings, which might be because PT has only a few tunable parameters.Though MPT is based on PT, its performance gain over FT in all cases suggests that it can learn to initialize the prompt embeddings from source tasks, enabling effective knowledge transfer.
Case Study To take a closer look at the influence of different source task types on a particular target task, we further conduct a case study where we ensure that the task under consideration appears in the target task partitions.2Results are shown in Table 6; for example, the first block indicates that Amazon_Polarity appears as a target task in both R→R and Cls→Cls settings.We can observe that there is no consistent conclusion on how we should choose the source tasks for a specific target task,

Conclusion
In this paper, we have introduced meta prompt tuning (MPT), which learns to initialize the prompt embeddings for adapting to a target task.We have identified key research questions and systematically studied where and how meta learning can improve cross-task generalization in prompt tuning.We have empirically analyzed a representative set of meta learning methods in a variety of adaptation settings on a large, diverse collection of few-shot tasks.Extensive experimental results and analysis verify the effectiveness of MPT.Given the findings, in the future, we would like to explore more advanced meta learning algorithms which can consistently outperform multi-task learning.

Limitations
Although comprehensive, our study of MPT in this work has couple of limitations: • As mentioned in §5.3, because of infeasiblity to search for optimal hyperparameters for each of the meta learning methods in each of the ten settings, we choose to use the R→R setting as our main representative setting.This could be one of the reasons for MPT underperforming MTL in some non-classification tasks (noted in §6-Q1).• We mainly focus on how upstream meta learning can improve the performance on target tasks.However, meta learning also enables faster convergence.We leave how it could help reduce the convergence time of PT as future work.Aside from that, meta prompt tuning (MPT) as a method has a limitation that it is Memory-intensive.Optimization-based meta learning methods, especially MAML, are memory-intensive, which limits the tuning of the inner batch size and inner update steps ( §5.3).One potential solution is to build more memory-efficient meta learning libraries.

A Appendix
A.1 Task List We report the full list of tasks used in ten different settings in Table 9.All tasks are taken from CROSSFIT (Ye et al., 2021).

A.2 Relative gain of Every Target Task
We mainly report average relative gain (ARG) in our experiments ( §6).In this section, we show detailed relative gain of each target task in Fig. 4 ∼ Fig. 13.

A.3 Absolute Scores for Every Target Task
We show detailed absolute scores for each target task in Fig. 14 ∼ Fig. 23.

A.4 Details of Sampled Tasks
We sample {12, 24} tasks from the original 45 source tasks in the Cls→Cls setting to investigate the influence of the number of source tasks.The details of sampled tasks are shown in Table 10.

A.6 Task Similarity Analysis
As discussed in §6, we use the correlation between input subspaces for two tasks as the similarity score between them.Detailed results of randomly picked similar and dissimilar task pairs are shown in Table 7.

A.7 Pilot Experiments on Prompt Transfer
We conduct some pilot experiments to explore the soft prompt transferability between different source tasks and a given single target task.We randomly pick 3 target tasks in the R→R setting and conduct prompt tuning on these tasks to obtain their corresponding prompt embeddings { 1  ,  2  ,  3  }.We then conduct prompt tuning on 30 randomly selected source tasks to obtain the soft prompts { 1  , ...,  30  }.As shown in Lin et al. (2022), the correlation between input subspaces (the norm of projected subspace onto the other subspace) for two tasks could serve as the similarity score between them, which may also indicate the transferability.For each source/target task, we regard the soft prompt as the task embedding (Zhou et al., 2022) and obtain its subspace by Singular Value Decomposition (SVD) following Saha et al. (2021).We then calculate the correlation scores between a given target task and all source tasks following Lin et al. (2022).
Finally, for each target task, we apply MPT with 3 different sets of source tasks: (i) 5 source tasks with the highest correlation scores, (ii) 5 randomly picked source tasks, and (iii) 5 source tasks with the lowest correlation scores.The relative gain of every target task is shown in Table 8.We can observe that using 5 source tasks with the highest correlation scores achieves better performance than the other two settings, indicating that input subspaces could be used to measure the soft prompt transferability between different source tasks and a given single target task.
Note that current experiments and analysis are for a single target task.For the average perfor-  mance of many target tasks, we need more exploration.

Figure 1 :
Figure 1: Illustration of cross-task generalization, where the model is expected to learn an unseen target task given the knowledge acquired from previously learned source tasks.

Figure 2 :
Figure 2: Overview of Meta Prompt Tuning (MPT).In the meta-training stage, we conduct optimizationbased meta learning on source tasks to obtain metaparameters (i.e., soft prompts).The meta-parameters will then be used to initialize prompt embeddings for learning unseen target tasks in the meta-testing stage.

Q4.
Is the performance gain of MPT consistent across different backbone language models?

Table 1 :
Statistics of ten distinct source/target task partitions.Appendix A.1 for details about each partition.

Table 3 :
ARG (%) of different methods when more labelled data is used in target tasks.

Table 4 :
ARG (%) of MPT (MAML) when using different number of source tasks in the Cls→Cls setting.

Table 5 :
Average relative gain (ARG %) of all methods with different backbone models in the NP→P setting.'MTL' stands for 'multi-task learning'.
margin in all settings.(iii)MTL is unstable in terms of ARG scores; while it outperforms PT in 64-shot (1.96%) and all samples (0.53%), it falls behind PT in all other settings, indicating that MPT is a better choice when adapting to classification tasks.Q3.Does MPT help with more diverse source tasks?

Table 6 :
Relative gain in % for MPT and MTL when the same target task appears in different patitions.

Table 7 :
Similarity scores of randomly picked similar and dissimilar task pairs.

Table 8 :
Relative gain in % for MPT when using different sets of source tasks.

Table 11 :
References for all datasets.