Self-supervised Meta-Prompt Learning with Meta-Gradient Regularization for Few-shot Generalization

Prompt tuning is a parameter-efficient method, which learns soft prompts and conditions frozen language models to perform specific downstream tasks. Though effective, prompt tuning under few-shot settings on the one hand heavily relies on a good initialization of soft prompts. On the other hand, it can easily overfit to few-shot training samples, thereby undermining generalizability. Existing works leverage pre-training or supervised meta-learning to initialize soft prompts but they fail to data-efficiently generalize to unseen downstream tasks. To address the above problems, this paper proposes a novel Self-sUpervised meta-Prompt learning framework with MEta-gradient Regularization for few-shot generalization (SUPMER). SUPMER leverages self-supervised meta-learning with a diverse set of well-designed meta-training tasks to learn a universal prompt initialization for efficient adaptation using only unlabeled data. Additionally, it jointly meta-learns a gradient regularization function to transform raw gradients into a domain-generalizable direction, thus alleviating the problem of overfitting. Extensive experiments show that SUPMER achieves better performance for different few-shot downstream tasks, and also exhibits a stronger domain generalization ability. The code for SUPMER will be available at https://github.com/beepkh/SUPMER.


Introduction
Recent NLP accomplishments witnessed the rapid development of pre-trained language models (PLMs) (e.g., BERT Devlin et al., 2019;T5 Raffel et al., 2020;GPT3 Brown et al., 2020).Finetuning, which tunes the entire PLM parameters, has achieved outstanding performances in various NLP tasks.However, as the pre-trained model scale increases, tuning the entire set of parameters would be sometimes unaffordable.More recently, prompt-based methods, which simply insert a piece of carefully designed text to the input (e.g., "It was ⟨X⟩.") and predict target words (e.g., "great" or "terrible") at the mask position with frozen PLMs, have demonstrated remarkable effectiveness.But it has been observed that the performance of promptbased methods is greatly affected by the design of prompts.In light of this, prompt tuning (PT Lester et al., 2021), as a parameter-efficient tuning method, is proposed to only prepend some additional learnable tokens called soft prompts to the input text, with all PLM parameters freezing.Though prompt tuning is an efficient and effective paradigm, Gu et al. (2022) shows it performs much worse than fine-tuning under few-shot settings.We argue that the performance is not satisfactory mainly due to two limitations: 1) The performance of PT is highly sensitive to the soft prompt initialization, especially for few-shot tasks.As shown in Figure 1 (a), different soft prompt initialization leads to significant performance variations.
2) Few-shot PT risks overfitting to some spurious correlations as soft prompts are tuned on limited training samples, thus undermining the generalizability of PLMs.As shown in Figure 1 (b), the performance of few-shot vanilla PT degrades significantly in the final training steps.tation, leveraging pre-training or supervised metalearning for soft prompt initialization.A pretrained prompt tuning method (PPT) (Gu et al., 2022) is proposed from the beginning, which utilizes self-supervised tasks to pre-train soft prompts and then applies them in the few-shot scenario.However, without explicitly optimizing the fast adaptation ability of the model, PPT suffers from a train-test mismatch between the pre-training data and the downstream data.So it limits generalization to unseen few-shot tasks, especially when there is a significant disparity in task domains or formats.MetaPrompting (Hou et al., 2022), as another effort, seeks assistance from model-agnostic meta-learning (MAML Finn et al., 2017) for fast adaptation in few-shot settings.However, in each task, MetaPrompting requires plenty of labeled data within certain classes to perform supervised meta-learning for prompt initialization, which is often inaccessible in practical few-shot scenarios.And the learned initialization can only generalize to the remaining classes of the same task in a fewshot manner, exhibiting weak task transferability.Furthermore, all these existing works ignore the second limitation, i.e., the propensity for few-shot prompt tuning to lead to overfitting.
To address the shortcomings of existing works, we propose SUPMER, a Self-sUpervised meta-Prompt learning framework with MEta-gradient Regularization.It leverages self-supervised metalearning to universally learn an efficient soft prompt initialization, also with a meta-gradient regularization function to mitigate overfitting.This comprehensive process only requires a one-time execution and enables seamless adaptation to different downstream few-shot tasks, while also facilitating faster convergence for downstream prompt tuning.
Specifically, to address the first limitation, we design a novel self-supervised meta-learning method for prompt initialization, which automatically generates a diverse set of meta-training tasks from large-scale unlabeled corpora and explicitly learns to fast adapt across these tasks.To ensure task diversity, we initially design a collection of anchor self-supervised meta-training tasks with different formats.And then a curriculum-based task augmentation method is further proposed to enrich the task distribution dynamically in terms of the current model capability.
For the second issue, we integrate a metagradient regularization function into meta-prompt learning.As we simulate distribution shift through task augmentation, the meta-gradient regularization parameters are jointly optimized to align gradient directions across different distributions within our proposed meta-prompt learning paradigm.Consequently, in downstream tasks, these optimized parameters can be directly utilized to transform raw gradients over few-shot samples into a domaingeneralizable direction, preventing prompt tuning overfitting to some domain-specific correlations.
Overall, our contributions are mainly three-fold: (1) We propose a novel self-supervised metaprompt learning framework to better initialize soft prompts, where only unlabeled pre-training data are used to construct different meta-training tasks with curriculum-based task augmentation for further task enrichment.
(2) We incorporate a novel meta-gradient regularization function into our meta-prompt learning framework, which meta-learns to transform the raw gradient during few-shot learning into a domaingeneralizable direction, thus preventing prompt tuning overfitting to domain-specific correlations.
(3) Comprehensive experiments on few-shot learning and domain generalization validate the superiority of our method, which even outperforms full-model tuning in few-shot learning.It also exhibits a stronger domain generalization ability.

Related Work
Soft Prompt Tuning.Soft prompt tuning is one of the most parameter-efficient tuning methods widely used in NLP (Liu et al., 2023) and visionlanguage tasks (Zhou et al., 2022;Li et al., 2023a), which only tunes a small number of (extra) parameters to attain strong performance.Specifically, it freezes the PLM parameters and prepends some trainable continuous embeddings (i.e., soft prompts) to the input sequence (Lester et al., 2021) or every layer of the pre-trained model (Li and Liang, 2021;Liu et al., 2022).
To efficiently train task-adaptive soft prompts in few-shot scenarios, some studies (Vu et al., 2022;Asai et al., 2022;Sun et al., 2022) employ task adaptation techniques, obtaining source prompts from source tasks in a supervised way and interpolating them into the target prompts.Other works focus on training improved prompt initializations.PPT (Gu et al., 2022) pre-trains the soft prompts with some self-supervised tasks on unlabeled corpora, but it doesn't explicitly optimize the fast adap- tation ability of the model.MetaPrompting (Hou et al., 2022) utilizes supervised meta-learning for soft prompt initialization, splitting each dataset into two sets with disjoint data classes.One split is used to initialize soft prompts while the other serves as the downstream task.In comparison, SUPMER differs from MetaPrompting in the following ways: 1) for each downstream task MetaPrompting focuses on a fixed supervised dataset to reinitialize soft prompts, whereas SUPMER can universally generalize to different unseen tasks with large-scale unlabeled corpora for initialization; 2) MetaPrompting doesn't freeze PLM parameters, while SUP-MER only tunes the soft prompts as the general soft prompt tuning methods do.
Meta-Learning.Meta-learning, also known as learning to learn, optimizes the ability to learn new tasks quickly and efficiently, utilizing experience from previously seen tasks.It can be classified into three types: metric-based methods (Koch et al., 2015;Vinyals et al., 2016;Snell et al., 2017), model-based methods (Graves et al., 2014;Mishra et al., 2018;Qiao et al., 2018), and gradientbased methods (Hochreiter et al., 2001;Ravi and Larochelle, 2017;Nichol et al., 2018;Li et al., 2020).In this work, we focus on a gradient-based meta-learning algorithm (i.e., MAML Finn et al., 2017).Compared to typical meta-learning methods that rely on human-annotated meta-training tasks, we automatically generate abundant tasks in a selfsupervised way, also integrating a meta-gradient regularization function into MAML to steer gradi-ents towards a domain-generalizable direction.

Method
In this section, we describe the whole framework of SUPMER (shown in Figure 2).With pre-defined preliminaries, we first introduce the way to construct anchor self-supervised meta tasks and the foundation of task augmentation to densify task distributions.Then we elaborate on the SUPMER model, including the meta-gradient regularization function.Finally, we upgrade the original task augmentation method into a curriculum-based one.Besides, we formalize all tasks in a text-to-text format following the T5 fashion (Raffel et al., 2020).

Preliminaries
Prompt Tuning.In prompt tuning (Lester et al., 2021), given a training sample (x i , y i ) from task D τ , we apply a prompt template P converting x i into a new sequence P (x i ) and then concatenate a set of soft prompts θ to the beginning of P (x i ).
And verbalizer V plays a role in mapping y i to some corresponding label tokens V(y i ) in the vocabulary of PLMs.So the objective of prompt tuning can be formulated as follows: where θ denotes the soft prompt embedding (the only tunable parameters in prompt tuning).⟨X⟩ let PLMs predict target tokens at the masked positions and [•; •] is the concatenation operation.
Model-Agnostic Meta-Learning.Assuming access to a task distribution p(T ), the goal of metalearning is to utilize tasks τ i ∼ p(T ), referred to as meta-training tasks or meta tasks, to train a learning procedure that generalizes to unseen tasks from the distribution.Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017) is a gradient-based bi-level optimization meta-learning method, which consists of an inner loop task-specific learning and outer loop fast adaptation across tasks.Specifically, a task τ is composed of the support set D s τ and the query set D q τ .In the inner loop of MAML, a model learns to adapt to a new task τ i using its support set in the following way: where α 1 is the inner loop learning rate and θ is the model's parameters.And the optimized parameters θ ′ i is then evaluated on the query set of task τ i with the loss function L D q τ i .In the outer loop, this loss across meta-training tasks is treated as the final training loss to update θ: where β 1 is the outer loop learning rate.

Constructing Anchor Meta Tasks
Supervised datasets with a large amount of labeled data are often unavailable in many NLP tasks.While unlabeled data is more easily accessible and generally covers broader semantic concepts.So we utilize the unlabeled data from a large corpus to create anchor self-supervised meta-training tasks.
The unlabeled data are first grouped into different clusters.We utilize PLMs to derive semantically meaningful embeddings for sentences in the corpus, and then apply unsupervised K-means to cluster these unlabeled sentences.Based on the results of K-means, we design three different formats of self-supervised meta-training tasks: sentencepair classification, multi-choice classification, and single-sentence classification.
Specifically, sentence-pair classification involves predicting whether two sentences are adjacent in the same document or from the same cluster after K-means clustering.Multi-choice classification identifies the correct sentence among several candidates, which is either adjacent to a query sentence or from its same cluster.And Singlesentence classification aims to associate each sentence with its correct cluster label, as determined by K-means.On this basis, for each task format, we distribute meta-training data into different tasks to construct anchor meta-training tasks with wellbalanced task distributions.We group samples with similar embeddings into the same task based on the results of K-means.And we give a more detailed description of anchor meta-training task construction in Appendix A.2.

Vanilla Task Augmentation
With a set of anchor meta-training tasks, in this section we first introduce the vanilla task augmentation to densify the task distribution.Extending the idea of mixup (Zhang et al., 2018), we augment the task set through task interpolation, which linearly combines features and corresponding labels of samples from the query set in different tasks.In §3.5 we further upgrade the vanilla task augmentation method into a curriculum-based one, which dynamically controls the task interpolation in terms of the current model capability.Specifically, for a task composed of the support set and the query set, we denote the hidden representations of the query set samples in task τ k as H q .Given an anchor task τ i , first we randomly select another task τ j .While retaining the support set of τ i , we reconstruct its query set by interpolating on the hidden representations (H q i , H q j ) and corresponding labels (Y q i , Y q j ) from the query sets in τ i and τ j , which can be accomplished using mixup: where the mixing ratio λ ∈ [0, 1] is drawn from a Beta distribution Beta(α, α), and α is a hyperparameter.The process of task augmentation not only enriches the task distribution, but also simulates the distribution shift between the support set and the query set within one task, as we only leverage interpolation between the query sets of different anchor meta-training tasks.And in §3.4 we will show the effect of this distribution deviation.

Meta-Prompt Learning with Meta-Gradient Regularization
In this section we introduce the algorithm of our meta-prompt learning framework, which is a bi-level meta-learning paradigm learning a taskuniversal soft prompt initialization θ for efficient adaptation.And it jointly meta-learns a metagradient regularization function ψ ϕ that transforms raw gradients into a domain-generalizable direction to prevent prompt tuning from overfitting.
Specifically, considering that the inner loop update of MAML (i.e., Eq. ( 2)) over limited samples might overfit to some domain-specific correlations, we propose to learn a gradient regularization function ψ ϕ (•), making a direct transformation to the raw gradients obtained from the support set D s τ i .The function first performs affine transformation h(•) (e.g., rotation) to modulate the raw gradients g, and then an update gate vector z is employed to combine g and h(g) into the final gradients: Obviously, the value of z can be used to control how much the transformed gradients h(g) contribute to the output of ψ ϕ (g).We hope to determine this weight based on the input samples themselves, setting z = σ(W H + b), where H is the hidden representations of input samples.Formally, now we transform Eq. ( 2) into: After adapting the soft prompt embeddings to the support set D s τ i , in the outer loop we optimize the prompt initialization θ based on these adapted embeddings θ ′ via Eq.( 3).Besides, meta-gradient regularization parameters ϕ are also optimized using the same loss to learn a better gradient transformation, with β 2 as the learning rate: Overall, the total meta-prompt learning obejective can be formulated as follows: Downstream Prompt Tuning.The above metaprompt learning framework only requires a onetime execution.The optimized prompt initialization θ * and meta-gradient regularization parameters ϕ * are then universal for different downstream tasks.
During downstream prompt tuning, we fix ϕ * and further adapt θ * to testing tasks as Eq. ( 6).

Analysis of SUPMER.
Here we give some analysis of how SUPMER could enhance generalizability, with more complete proof in Appendix A.1.
Given that x = θ − α 1 ψ ϕ (∇ θ L D s (θ)) and x 0 = θ, focusing on a single meta-training task, we can apply a first-order Taylor expansion around the point x 0 to reformulate Eq. ( 8) as: Based on the aforementioned discussion, we can reach the following conclusions: (1) The update of θ minimizes the expected loss on the query set.
(2) The optimization of both θ and ϕ maximizes the inner product between the regulated gradients from the support set and the gradients from the query set.The inner product of two vectors is larger if they are in a similar direction.Recalling that we simulate the distribution shift between the support set and the query set, the optimization of θ and ϕ tries to align the gradient directions across different distributions.To improve the alignment between the domain-specific gradients, the gradient regularization parameters ϕ are optimized to retain some domain-invariant information of metatraining data and then can be utilized to regulate raw gradients obtained from few-shot samples into a domain-generalizable direction in downstream prompt tuning, thus avoiding overfitting to some spurious correlations.

Curriculum-based Task Augmentation
In §3.4 we show that SUPMER can help align the optimization direction across two distributions with deviation, which is simulated by performing task augmentation exclusively on the support sets.From Eq. ( 4) it is evident that the mixing ratio λ of mixup controls the extent of the distribution deviation, with a larger λ resulting in a more noticeable deviation.However, in the previously discussed method, λ is sampled from a fixed Beta distribution.In this section, we propose a more flexible sampling approach, which upgrades the original task augmentation method into a curriculum-based one, gradually increasing the task difficulty and achieving a more reasonable distribution shift.
The curriculum-based task augmentation dynamically adjusts the parameters of the Beta distribution, from which we sample the mixing ratio λ.Specifically, a batch of meta tasks is sampled in each training epoch.For each task, we can obtain gradients on the support set g s i and gradients on the query set g q i , along with their cosine similarity.We leverage the average cosine similarity s k−1 of all tasks in a batch during the last epoch to derive the mixing ratio λ k for the current epoch k: where m is the curve parameter.In this way, when our model is not capable of aligning the optimization directions across different distributions at the beginning, a smaller λ is preferable to create a smaller distribution deviation.Then λ tends to gradually increase as the model's capability improves, resulting in a larger distribution deviation and a corresponding increase in task difficulty.
We present the pseudo-codes of SUPMER in Appendix A.4.

Experimental Setup
We evaluate our approach in two problem settings: 1) Few-shot learning with different NLP downstream tasks; 2) domain generalization.
We choose A as the source domain and the other five (B, D, E, K, R) constitute the target domains.On this basis, we sample 16 instances per label from the training set of the source domain to tune soft prompts.And then we directly use the soft prompts learned from the source domain to evaluate performance on the test set of each domain.
We list the details of baselines in Appendix B.
Implementation Details.We solve all downstream tasks in a text-to-text format and run each experiment with 5 different random seeds.For all prompt tuning methods, we follow Lester et al.
(2021) to design soft prompts composed of 100 soft tokens, with tunable parameters far less than full-model tuning.For our SUPMER, following PPT (Gu et al., 2022) we sample 10GB data from OpenWebText (Gokaslan et al., 2019), a largescale unlabeled corpus, to construct self-supervised meta-training tasks.The meta-training stage only requires a one-time execution.In downstream prompt-tuning, we freeze the meta-gradient reg- ularization parameters and the soft prompts are the only tunable parameters.We give more details of training hyper-parameters in Appendix C.

Main Result
Table 1 and Table 2 show the main results of fewshot learning and domain generalization.From the results, we have the following observations.First, in few-shot learning, SUPMER achieves better performance than all baselines on 10 of 12 datasets, whether using T5-base or Flan-T5-XL as the backbone.And the average accuracy of SUP-MER over all datasets reaches 71.3% on T5-base, significantly outperforming other baselines (e.g., improving the performance by +1.3 points compared to FT).Notably, when utilizing the larger Flan-T5-XL as the backbone, SUPMER demonstrates even more substantial performance gains (e.g., improving the average performance by +2.5 points compared to FT), which indicates that our approach unlocks greater capabilities for stronger models that have undergone instruction-tuning with a higher number of parameters.
Specifically, SUPMER consistently outperforms all other prompt tuning methods with the same number of tunable parameters across all datasets.This indicates that our method offers soft prompts with better few-shot generalization ability.And it is noteworthy to highlight that SUPMER utilizes ex- actly the same unlabelled data as PPT and Unified-PPT for soft prompt initialization.Yet it considerably outperforms these two baselines, demonstrating that the performance improvement is primarily attributable to our methodology rather than the meta-training data itself.Additionally, SUPMER outperforms baseline methods with more tunable parameters (e.g., full-model tuning) on the majority of datasets, achieving superior performance with fewer parameters.Second, SUPMER is superior to all baselines in almost all domain-generalization setups.For example, compared to MetaPT which meta-trains soft prompts with a supervised sentiment analysis dataset, SUPMER exhibits average gains of 1.1% on T5-base and 1.4% on Flan-T5-XL.So it can be inferred that SUPMER shows stronger robustness to domain shifts, exhibiting better generalization to unseen tasks or domains.
Third, for both few-shot learning and domain generalization on Flan-T5-XL, SUPMER demonstrates superior performance across almost all datasets and domains in contrast to few-shot inference.It provides further evidence that for LMs such as Flan-T5-XL with inherent few-shot inference capabilities, our approach can significantly enhance their abilities in a parameter-efficient tuning strategy, without providing any in-context examples during inference.
Fourth, SUPMER also results in lower variances on most datasets.Few-shot learning is often notorious for its instability.And in our method we keep few-shot prompt tuning more stable.

Ablation Study
Analysis of Generalization.Figure 3  only accelerates the convergence to the optimal performance realizing fast adaptation, but also consistently maintains its optimal performance across prolonged training periods.
Effect of Sample Size.We also discuss how the performance of SUPMER and other baselines varies when the number of training samples increases on SST-5 and SUBJ.As shown in Figure 4, with T5-base as the underlying PLM, when the number of training samples per label grows from 4 to 64, SUPMER is consistently better than other prompt tuning methods.And the performance gap between these methods is gradually reduced as the number of training data increases.
Self-Supervised v.s.Supervised.To illustrate that self-supervised meta-learning can better generalize to unseen tasks compared to supervised meta-learning, we also collect a set of labeled datasets (ensuring no overlap with downstream testing datasets) to formulate meta-training tasks for soft prompt initialization and conduct the experiments of few-shot learning on T5-base.The results are displayed in Table 3 (rows 1 and 2).As our collected labeled data contains lots of sentiment analysis datasets (e.g., Yelp5), SUPMER (only labeled) and SUPMER (only unlabeled) reveal proximity in their performance on sentiment analysis tasks (i.e., SST-2, SST-5, MR, CR).But in other tasks, using unlabeled data consistently achieves better results than utilizing only labeled data, also with a higher average accuracy over all datasets, which validates the superiority of self-supervised meta-learning.
Effect of integrating Labeled Data.To further explore the impact of integrating labeled data and substantiate the efficacy of SUPMER following this integration, we amalgamate the original unlabeled meta-training data with our collected labeled data mentioned above, with a mixing ratio of labeled to unlabeled as 1:2.The amalgamated data is employed for constructing meta-training tasks to meta-train SUPMER.Moreover, following PPT (Gu et al., 2022) and MetaPT (Huang et al., 2022), We also leverage pre-training and vanilla MAML to initialize soft prompts using the same amalgamated data.The experimental results of few-shot learning on T5-base are shown in Table 3 (rows 3-5).First, we can see that SUPMER (labeled+unlabeled) outperforms SUPMER (unlabeled) and SUPMER (labeled) as it allows us to harness the high-quality advantages of labeled data while also exploiting the broader semantic concepts encapsulated by unlabeled data.Second, After the integration of labeled data, SUPMER still consistently demonstrates significantly superior performance compared to baseline methods employing the same data for prompt initialization, which further underscores the effectiveness of SUPMER.
Effect of Individual Components.We train the following ablation models.1) only sp / mc / ss: we retain sentence-pair classification / multi-choice classification / single-sentence classification as the only anchor meta-training task format.2) w/o ta: we entirely remove the task augmentation method.
3) w/o curriculum: we only retain the vanilla task augmentation without the curriculum-based idea.4) w/o mgr: we remove the meta-gradient regularization function.All experiments follow the settings in §4.1 and are conducted on T5-base.We report the average accuracy of few-shot learning and domain generalization in Table 4.More detailed results are in Appendix D.
The results of Row 1-3 indicate that considering diversified task formats during meta-training helps efficiently generalize to different tasks as downstream tasks often contain various task formats.Row 4 and Row 5 highlight that task augmentation plays an essential role in our framework, with curriculum-based augmentation further enriching the task distribution and realistically simulating the distribution shift.Moreover, Row 6 validates the superiority of meta-gradient regularization in avoiding overfitting to some domain-specific correlations, thus achieving better performance.

Conclusion
In this paper, we present SUPMER, a selfsupervised meta-prompt learning framework with meta-gradient regularization.With a diverse set of well-designed self-supervised meta-training tasks, SUPMER jointly meta-learns a universal prompt initialization and an effective gradient regularization function for efficient few-shot generalization.Extensive experiments on few-shot learning and domain generalization show that SUPMER outperforms other prompt methods and full-model tuning, achieving state-of-the-art performance.

Limitations
Although SUPMER performs superbly in a variety of problem scenarios, there still exist some limitations in our work: 1) We did not conduct any data filtering or cleaning operations to the meta-training data, which could potentially result in the inclusion of some biased content.2) Our experiments are solely conducted on English tasks, and also do not involve some kinds of NLP tasks (e.g., language generation Li et al., 2022c) or vision-language tasks (Zhang et al., 2022b;Li et al., 2022b;Zhang et al., 2019;Li et al., 2021).
To address these limitations, in the future we plan to conduct further cleansing and filtering on the current meta-training data.Besides, we intend to evaluate the few-shot performance of our framework in the multilingual setting and also broaden the scope of tasks, including retrieval (Pan et al., 2023), language generation (Li et al., 2022c) and vision-language tasks (Li et al., 2023b;Chen et al., 2023;Li et al., 2022a;Zhang et al., 2022a).Furthermore, we hope our work could pave the way for future research on better leveraging parameterefficient methods under few-shot settings.

Appendices A Additional Information for SUPMER A.1 Complete Analysis of SUPMER
In this section, we provide a more comprehensive and complete analysis of SUPMER.We will show that during meta-training, the optimization of soft prompt embeddings θ and the meta-gradient regularization parameters ϕ tends to maximize the inner product of gradients obtained from the support set after regulation and gradients from the query set.
Specifically, to update the parameters θ and ϕ, we should evaluate their gradients at first, denoting them as g θ and g ϕ .Considering the original algorithm of MAML, each task consists of a support set and a query set.And only one step of gradient descent is applied in the inner-loop optimization.To make our statement more direct, we denote the loss function based on the support set and the query set as L 0 and L 1 .In SUPMER, ignoring the regularized loss, only L 1 is directly utilized to optimize ϕ, while θ is optimized in a bi-level meta-optimization paradigm.Here we define the following terms related to θ similar to Nichol et al. (2018): ∂Li(θi) ∂θi (gradient obtained during SGD) ∂Li(θ0) ∂θ0 (gradient at initial point) For each definition i ∈ {0, 1} and ψ ϕ (•) is the meta-gradient regularization operation.θ 0 denotes the initial soft prompt embeddings for each step, and θ 1 denotes the embeddings after the inner-loop optimization.Obviously we have g θ 0 = g θ 0 .Firstly we perform a Taylor series expansion to approximate the SGD gradients g θ 1 obtained from the query set as follows: Then we analysis the gradient descent operation in the inner-loop optimization based on the support set.Define U as the gradient descent and we have U (θ 0 ) = θ 0 − α 1 ψ ϕ ( ∂L 0 (θ 0 ) ∂θ 0 ).So we can get ∂U (θ 0 ) ∂θ 0 and ∂U (θ 0 ) ∂ϕ as follows: So based on Eq. (12, 13, 14), we can finally approximate the gradients g θ and g ϕ as: Thus, − ∂ ∂θ 0 (g θ 1 ψ ϕ (g θ 0 )) and − ∂ ∂ϕ (g θ 1 ψ ϕ (g θ 0 )) indicate the optimization direction, which increases the inner product between gradients from the query set and gradients from the support set after transformation.To further consolidate our analysis, we also track the normalized gradient inner product in the first 5000 steps during meta-training.As shown in Figure 5, it is clear that the normalized gradient inner product gradually increases during meta-training.
On this basis, since there exists distribution shift between the support set and the query set after task augmentation, our method aligns the gradient directions across different distributions, which helps enhance model generalization.In other words, the trainable parameters descend in a coordinated manner such that the input-output correspondence is as close as possible across two distributions with deviation.Besides, the meta-gradient regularization parameters ϕ also retain some domain-invariant information of the meta-training data in the above process.Considering that ϕ is fixed in downstream tasks, ϕ can be applied to encourage the alignment between the domain-specific gradients and avoid prompt-tuning overfitting to some domain-specific correlations.

A.2 Constructing Anchor Meta Tasks
Given a sentence x from unlabeled corpora, we can derive semantically meaningful sentence embedding H = f enc θ (x) with PLMs, e.g., T5 encoder.And we apply K-means to cluster these unlabeled sentences according to their embeddings: where µ c indicates the learned centroid of cluster C c and P indicates the partitions of all sentences.K-means clustering leads to more abundant formats and objectives of meta-training tasks.Based on the results of K-means, we design three formats of anchor self-supervised meta-training tasks: sentencepair classification, multi-choice classification, and single-text classification.Here we introduce of them in detail.
Sentence-pair Classification.Sentence-pair classification takes a pair of sentences (x 0 , x 1 ) as input, and x 0 is the anchor sentence.We carry on next sentence prediction task and sentence similarity task in sentence-pair classification with the label list Y = [0, 1, 2].For the former one, following Gu et al. (2022), we set two sentences next to each other as label 0, those from the same document but not adjacent as label 2, and those from different documents as label 1.And for Sample a batch of task {τi} n i=1 from p(T ) 10: for all τ i = {D s τ i , D q τ i } do to which the sentence belongs.We simply use r i as the pseudo label for meta-training and construct 4-way classification tasks.As for the designing of the verbalizer, we transform the single-sentence classification into the format of multi-choice classification.We insert the centroid of cluster µ c into the template and use it to represent the corresponding cluster.So that we have: On this basis, for each task format, we separate all data into different tasks to construct anchor meta-training tasks with good task distributions.Through K-means, sentences with similar embeddings are clustered into the same group.So in sentence-pair classification and multi-choice classification, we group samples whose anchor sentence comes from the same cluster into the same metatraining task.And in single-sentence classification, for each meta-training task, we randomly select N clusters as N classes and then sample k sentences for each cluster to construct a N -way k-shot classification task (N = 4).In this way, we completely construct all anchor meta-training tasks.

A.3 Additional Loss to Train Meta-Gradient Regularization Parameters
In the meta-training stage, we optimize the metagradient regularization parameters ϕ via Eq.( 7), utilizing the same loss which optimizes the soft prompt embeddings.Here we introduce a regularized loss to attach some additional restrictions when updating the meta-gradient regularization parameters.Notably, a higher value of b k in Eq. ( 10) indicates a higher probability of a larger distribution deviation between the support set and the query set.Furthermore, in Eq. ( 5) we also tend to increase z to achieve a more pronounced gradient transformation with a more noticeable distribution deviation.From this perspective, z has a similar monotonicity with b k , and they both range between 0 and 1.Thus we further add a regularized loss L reg = ∥z − b k ∥ 2 to constrain the value of z and finally modify Eq. ( 7) into: A.4 Pseudo-Codes of SUPMER We show the pseudo-codes for the meta-training process of SUPMER in Alg. 1.And the process of curriculum-based task augmentation is described in Alg. 2.

B Dataset & Baseline Details
Few-shot Learning.We conduct experiments of few-shot learning on 6 different downstream English tasks with 12 datasets.Since some of the test sets of the datasets are not publicly available, following Karimi Mahabadi et al. ( 2022), we leverage the original validation sets of SST-2, CB, RTE, QNLI, WiC, MRPC, and QQP1 as substitutes for the unavailable test sets.And the validation sets for few-shot learning are sampled from the original training set, ensuring no overlap with our designated test sets.Besides, we download the datasets of SST-2, SST-5, MR, CR, and SUBJ from Gao et al. (2021).And the rest of the datasets are obtained from the HuggingFace Datasets library (Lhoest et al., 2021).CB, RTE, BoolQ, and Wic are from SuperGLUE Benchmark (Wang et al., 2019), while QNLI, MRPC, and QQP are from GLUE Benchmark (Wang et al., 2018) with Creative Commons license (CC BY 4.0).We give the statistics of all these datasets in Table 5. Domain Generalization.Similar to Calderon et al. (2022), We evaluate on the sentiment analysis task including 6 different domains: Airlines (A), Books (B), DVDs (D), Electronics (E), Kitchen appliances (K), and Restaurants (R).Each domain has totally 2,000 manually labeled data of binary categories for testing, including 1000 positive and  Baselines.We first compare with baseline methods with the same number of parameters as SUP-MER.These methods utilize prompt tuning (Lester et al., 2021) to handle downstream tasks, with the key distinction lying in the initialization of the soft prompts.Vallina prompt tuning (PT Lester et al., 2021) directly tunes the soft prompts in the downstream task, which are randomly initialized from a normal distribution.PPT (Gu et al., 2022) pretrains soft prompts in a self-supervised way with 3 formats of pre-training tasks: sentence-pair classification, multiple-choice classification and singletext classification.Unified-PPT (Gu et al., 2022) formulate all these three formats into a unified task form.MetaPT (Huang et al., 2022) using a supervised sentiment analysis dataset Yelp5 as the meta-training data and directly leveraging MAML to initialize soft prompts.
To further demonstrate the effectiveness of our method, we also consider baseline methods with more tunable parameters, including Prefix-Tuning (Li and Liang, 2021) and P-tuning-v2 (Liu et al., 2022), which add prompts at each layer of PLM.We also compare with full-model tuning (FT) that fine-tunes all parameters of the PLM.
Given that FLAN-T5-XL was also designed with few-shot inference in mind, we newly compare with two baseline methods on FLAN-T5-XL: zero- shot inference and few-shot inference.For both of them, we directly employ Flan-T5-XL for downstream evaluation, coupled with carefully designed task instructions for each dataset.Furthermore, in few-shot inference, we also provide an appropriate number of few-shot examples to form a demonstration context.

C Training Details
We apply the T5 base model (Raffel et al., 2020) (220M parameters) and Flan-T5-XL model (Chung et al., 2022) (3B parameters) as the underlying PLM, and use the HuggingFace Pytorch implementation (Wolf et al., 2020).We run experiments with 8 GeForce RTX 3090 24G GPUs.And the metatraining process of SUPMER takes about 140 GPU hours.Next we will describe the details of training hyper-parameters in the case of leveraging T5-base as the PLM.

C.1 Training Hyper-parameters for Downstream Tasks
In our experiments, we leverage full-model tuning and prompt tuning to solve downstream tasks, including few-shot learning and domain generalization.In few-shot learning, following some prior work (Schick and Schütze, 2021;Karimi Mahabadi et al., 2022), we set the maximum sequence length of each example to 256 for CR, SUBJ, CB, RTE and WiC, and 128 for other datasets.While in domain generalization, the maximum sequence length of each example is set to 256.We run each experiment 5 times on the random seed [10,20,30,40,50] and report the average accuracy as well as the standard deviation.For both full-model tuning and prompt tuning, We implement AdamW as the optimizer.We use a batch size of 32 and train the model for 200 epochs, meanwhile evaluating the model every 10 steps.And we report the results for hyper-parameters performing the best on the validation set for each task.
Besides, for full-model tuning, all parameters of PLM are fine-tuned without adding soft prompts.We use the learning rate of [1e-5, 2e-5, 3e-5] and choose the one obtaining the highest validation performance.Moreover, to fine-tune the Flan-T5-XL model, we use ZeRO (Rajbhandari et al., 2020) stage-2 provided in DeepSpeed (Rasley et al., 2020) to reduce GPU memory usage.
For prompt tuning, we freeze all PLM parameters and only tune soft prompts composed of 100 soft tokens.As a result, the tunable parameters of prompt tuning are only 77K with T5-base and 205K with Flan-T5-XL, updating around 3000 and 15000 times fewer parameters on T5-base and Flan-T5-Xl, respectively, compared to full-model tuning.And we find that prompt tuning requires a much larger learning rate than full-model tuning.We search for the learning rate in [1e-1, 2e-1, 3e-1] and also choose the model with the best performance on the validation set.

C.2 Training Hyper-parameters for Prompt Initialization
Pre-training for prompt initialization.Gu et al. (2022) proposes two frameworks for unsupervised prompt pre-training, named PPT and Unified PPT.PPT designs three formats of unsupervised pre-training tasks (sentence-pair classification, multiple-choice classification and single-text classification), and Unified-PPT further formulate them into a unified task form.We implement PPT and Unified-PPT following the hyper-parameters provided in Gu et al. (2022) and reset the pretrained language model to T5-base and Flan-T5-XL.Specifically, for both PPT and Unified-PPT, we sample 10GB of unlabeled data from OpenWeb-Text to construct pre-training tasks for each task format.And 5% data are split for validation.We apply the "inverse square root" learning rate scheduler with no warm-up steps and set the learning rate as 0.1.We set the batch size to 256 with the max sequence length as 512, and train soft prompts for at most 200,000 steps.We evaluate the performance on the validation set every 2,000 steps and choose prompts with the lowest validation loss.Meta-training for prompt initialization.In our SUPMER framework, we sample 10GB of unlabeled data from OpenWebText to construct selfsupervised meta-training tasks.We split 5% data to construct tasks for validation.And for each task format, we first set the number of clusters to 250.We sample 4 meta-training tasks in a batch, and train the prompt embeddings θ and the meta-gradient regularization parameters ϕ for at most 100,000 steps.We also evaluate the performance on the validation set every 2,000 steps, choosing θ and ϕ with the lowest validation loss for downstream tasks.Moreover, to illustrate the superiority of self-supervised meta-learning, we also imitate MetaPT (Huang et al., 2022) to initialize soft prompts via supervised meta-learning.MetaPT uses a supervised sentiment analysis dataset Yelp5 as the meta-training data, which has 650,000 training samples only covering the domain of restaurants.Following Huang et al. (2022), We group all labeled data into 10 clusters through K-means.And we set the inner loop learning rate to 0.08, the outer loop learning rate to 0.025 with the early stop patience as 6.Other hyper-parameters are consistent with those in SUPMER.

D Full Results of Ablation Study
In this section, we first give detailed experimental results of the ablation study to illustrate the effect of individual components.We evaluate each ablation model over all 12 datasets of few-shot learning and all 6 domains of domain generalization, with T5-base as the underlying PLM.We run each experiment 5 times on the random seed [10,20,30,40,50] and report the average performances as well as the standard deviation.The detailed results of few-shot learning and domain generalization are shown in Table 7 and Table 8.We can see each component is critical in our framework.
Besides, in §4.4, to explore the superiority of self-supervised meta-learning and the impact of integrating additional labeled data for soft prompt initialization, we conduct experiments of few-shot learning on T5-base, considering different data and methods for soft prompt initialization.We also carry out experiments of domain generation leveraging different data with various prompt initialization methods, with the results presented in Table 9.From Table 3 and 9, it is evident that selfsupervised meta-learning utilizing unlabeled data exhibits enhanced adaptability to unseen tasks in comparison to its supervised counterparts.And amalgamating both labeled and unlabeled data for the construction of meta-training tasks emerges as a more advantageous strategy.When it comes to employing both labeled and unlabeled data for prompt initialization, SUPMER continues to showcase markedly superior results in contrast to baseline methods in the realms of both few-shot learning and domain generalization.

Figure 1 :
Figure 1: (a) Performance of PT with different prompt initialization.(b) Performance after different training steps for vanilla PT and SUPMER.

Figure 2 :
Figure2: The framework of SUPMER.We employ task interpolation to enrich the distribution of self-supervised meta-training tasks.Concurrently, we integrate a meta-gradient regularization function into meta-prompt learning.Furthermore, during meta-prompt learning we also dynamically adapt the mixing ratio of task interpolation, upgrading the vanilla task augmentation into a curriculum-based one.

Figure 4 :
Figure 3: The performance after different training steps on CB and MRPC.

Figure 5 :
Figure 5: Normalized gradient inner products in the first 5000 steps during meta-training.

Table 1 :
SUPMER95.5 0.4 55.3 0.7 91.4 0.5 90.7 0.7 90.3 0.8 93.0 1.5 87.6 1.5 81.4 1.0 88.3 0.6 65.0 1.7 78.1 0.8 85.1 0.4 83.5 0.9 Results of few-shot learning.For each dataset we report the average accuracy and standard deviation over five random seeds (zero-shot & few-shot inference produce nearly consistent results each time as they do not require parameter tuning).Bold fonts indicate the best results.We can see SUPMER achieves better performance.

Table 2 :
Results of domain generalization.For MetaPT we calculate the average performance only across domain A, B, D, E, K (without R).

Table 3 :
Results of few-shot learning on T5-base, considering different data and methods for prompt initialization.

Table 4 :
Results of ablation study to illustrate the effect of individual components.We report the average accuracy over all 12 datasets in few-shot learning and all 6 domains in domain generalization (DG).

Table 5 :
Statistics of all 12 datasets for few-shot learning.K is the number of labels.We sample N × K instances from the original training set to construct the few-shot training and validation sets.And #Test shows the size of the test set.

Table 6 :
Hyper-parameters for SUPMER.ϕ denotes the meta-gradient regularization parameters.λ is the coefficient of the regularized loss.And m is the curve parameter in the curriculum-based task augmentation.

Table 7 :
Detailed results of ablation study for few-shot learning to illustrate the effect of individual Components.In the first three rows we keep only one anchor task format during meta-training, and sp stands for sentence-pair classification, mc for multi-choice classification, ss for single-sentence classification.And w/o ta means entirely removing task augmentation, w/o curriculum only retains the vanilla task augmentation without the curriculum-based idea.w/o mgr means removing the meta-gradient regularization method.

Table 8 :
Detailed results of ablation study for domain generalization to illustrate the effect of individual Components.

Table 9 :
Results of domain generalization on T5-base, considering different data and methods for prompt initialization.As our collected labeled data includes Yelp5, a sentiment analysis dataset in the domain of restaurants, we conduct the experiments of domain generalization only across domains A, B, D, E, K (without R).
Table 6 lists all training hyper-parameters for SUPMER.It is worth noting that for most hyperparameters in Table 6, we just set a default value by experience without tuning them.we tune the hyperparameters which are also tuned in other baselines (e.g., learning rate), ensuring all methods have the same number of tunable hyper-parameters in our experiments.