Continued Pretraining for Better Zero- and Few-Shot Promptability

Recently introduced language model prompting methods can achieve high accuracy in zero- and few-shot settings while requiring few to no learned task-specific parameters. Nevertheless, these methods still often trail behind full model finetuning. In this work, we investigate if a dedicated continued pretraining stage could improve “promptability”, i.e., zero-shot performance with natural language prompts or few-shot performance with prompt tuning. We reveal settings where existing continued pretraining methods lack promptability. We also identify current methodological gaps, which we fill with thorough large-scale experiments. We demonstrate that a simple recipe, continued pretraining that incorporates a trainable prompt during multi-task learning, leads to improved promptability in both zero- and few-shot settings compared to existing methods, up to 31% relative. On the other hand, we find that continued pretraining using MAML-style meta-learning, a method that directly optimizes few-shot promptability, yields subpar performance. We validate our findings with two prompt tuning methods, and, based on our results, we provide concrete recommendations to optimize promptability for different use cases.


Introduction
Conditioning language models (LMs) on manuallywritten or learned continuous prompts allows them to solve tasks with high accuracy and minimal parameter overhead (Brown et al., 2020;Li and Liang, 2021;Lester et al., 2021, i.a.).However, prompting performance often still lags behind traditional full finetuning.Natural language prompts usually underperform trained models even when manually curated (Brown et al., 2020;Sanh et al., 2022).Similarly, while learned prompts yield higher accuracy, they do not work as well when the training data is scarce (Gu et al., 2022), when the model is small or moderately sized (Lester et al., 2021), and when the tasks are difficult (He et al., 2022).
To reduce the gap between prompt and full model tuning, past work has shown that continued pretraining on data that resembles the downstream prompting setup induces better "promptability", i.e., zero-shot performance with natural language (NL) prompts and few-shot performance of prompt tuning (Sanh et al., 2022;Gu et al., 2022).However, in this paper, we identify several shortcomings of these methods.First, continued pretraining on NL prompts (Sanh et al., 2022) sometimes causes performance degradation with prompt tuning.Second, continued pretraining approaches that learn only a universal prompt initialization (Gu et al., 2022;Vu et al., 2022) bring only marginal improvement on the P3 datasets (Bach et al., 2022).
To further improve zero-shot and few-shot promptability, we investigate gaps in existing methods with different parameter configurations and training procedures.First, we explore the effect of incorporating a learned continuous prompt into multi-task learning (MTL), and find it to significantly improve zero-and few-shot promptability across the board.In addition, we explore MAMLstyle meta-learning (Finn et al., 2017;Nichol et al., 2018) as an alternative to the standard continued pretraining paradigm, but find that it underperforms simple MTL, despite its previous success on fewshot learning tasks (Li et al., 2017;Gu et al., 2018;Qian and Yu, 2019, i.a.).We perform an analysis of this phenomenon and present several explanations.
Through large-scale experiments, each involving continued pretraining on over 9B tokens ( §A), we make several contributions: (1) we thoroughly evaluate continued pretraining methods, both existing and our proposed ones, in many setups; (2) we demonstrate that a simple continued pretraining recipe improves over existing methods by up to 31%; (3) we show that MAML-style meta-learning underperforms multi-task learning and provide explanations; (4) we provide concrete recommendations to improve promptability in various use cases.

Prompting
We review two types of prompting that we use: natural language (NL) prompting and prompt tuning.
Traditionally, NLP tasks are solved by taskspecific models that predict label y ∈ Y from input x ∈ X .We can consider LMs as functions that score any source and target text pair, LM : V * × V * → R with vocabulary V.1 Past work found that large LMs can be repurposed to solve many tasks by casting x, y into a text format using a template function f : X ∪ Y → V * and taking as prediction arg max y ′ ∈Y LM (f (x), f (y ′ )).
NL prompts, or instructions, are manually constructed f (•).Without task-specific training, they have been successfully used to elicit predictions from LMs to perform tasks with high accuracy (Brown et al., 2020;Logan IV et al., 2022).
Sharing the motivation, prompt tuning learns a continuous prompt to condition the model.It takes the source text embedded by the LM input embeddings, s ∈ R N ×d with length N and dimension d, and prepends learnable embeddings E ∈ R L×d , where L is a hyperparameter, to obtain a new (L + N )-lengthed embedded sequence.We consider hybrid prompt tuning, where s is the embedding of the templatized f (x), i.e., prompt tuning is always performed in addition to NL templates.This has been widely adopted due to demonstrated better performance (Gu et al., 2022;Min et al., 2022).We also study a variant of prompt tuning, sometimes called prefix tuning (Li and Liang, 2021), where the learnable vectors are added not only to the input but all transformer layers.See Lester et al. (2021) and Li and Liang (2021) for more details on these methods.Following the terminology of Liu et al. (2022b), we refer to the input-level method as shallow prompt tuning and the layer-specific method as deep prompt tuning.

Improving Promptability
In this section, we describe existing methods to improve promptability and a new paradigm that combines their advantages.
While prompt tuning sometimes performs close to full model finetuning (Lester et al., 2021;Liu et al., 2022b), there is often still a substantial gap, such as with limited training data (Gu et al., 2022), non-gigantic models (Lester et al., 2021), or challenging tasks (He et al., 2022).We therefore study ways to improve LMs' "promptability."We focus on a low resource setup and consider zero-shot NL prompts and few-shot learned prompts (which, again, are in conjunction with NL prompts; §2).For the former, better promptability increases the performance when LMs face textual prompts of new tasks.For the latter, it more effectively leverages limited training examples for higher accuracy.
We investigate if promptability can improve with a continued pretraining stage after LM pretraining (or LM adaptation for LM-adapted T5 (Lester et al., 2021)) and before task-specific finetuning.The model is trained on a collection of tasks that have NL prompts and evaluated on unseen tasks.The methods that we explore below differ in how the continued pretraining stage is performed.We use the notation MTL-T_P_ to abbreviate those methods that are based on multi-task learning, where the blanks _ specify different configurations of the transformer (T) and the prompt (P) components during MTL.Architecturally, a method may continue to pretrain only the T5 model without prompt parameters, in which case we use P✗ to denote the lack of them; otherwise, both transformer and prompt parameters exist during MTL.We use and to denote if the corresponding component is trained or frozen in MTL, respectively.This notation describes the continued pretraining stage only: in the final finetuning stage, all methods include both the transformer and prompt components, but only the latter is updated.
Continued pretraining has been studied in limited settings.Sanh et al. (2022) proposed T0 by multi-task training a T5 model (Raffel et al., 2020) as continued pretraining.They updated T5 parameters through learning on continued pretraining tasks, not including a prompt component, and showed that this training improves zero-shot NL promptability.Following our nomenclature, we refer to this paradigm as MTL-T P✗.Additionally, Gu et al. (2022) employed a similar stage, incorporating and multi-task training a shallow prompt as continued pretraining, while freezing the transformer parameters in this stage.They showed that this strategy helps few-shot promptability during finetuning.We refer to this paradigm as MTL-T P .
In this work, we study the gains of the previous two continued pretraining approaches, as well as a model that synthesizes them, MTL-T P , which we are the first to propose.For few-shot downstream tuning, the learned prompt can act as a good initialization compared to MTL-T P✗.In the zeroshot setup, prior work has discovered that including certain text in a prompt, such as "Let's think step by step," can adjust the reasoning of LMs to yield substantially improved performance across tasks (Kojima et al., 2022;Askell et al., 2021).The learned prompt here could function analogously.Compared to MTL-T P , on the other hand, the additional capacity brought by more updatable parameters could further boost model performance.
MAML-style meta-learning (Finn et al., 2017) directly optimizes for the downstream updates and can outperform MTL for full model finetuning (Dou et al., 2019;Bansal et al., 2020a).Yet, it similarly remains unexplored for prompting.We examine first-order MAML (FOMAML; Finn et al., 2017), performing T steps of prompt tuning in the inner loop and updating all parameters in the outer loop.We also evaluate a version of Reptile (Nichol et al., 2018) adapted for our setting that performs T steps of prompt tuning followed by one step of full model tuning, and use the resulting Reptile gradient for model updates.They have the same architecture as MTL-T P and all parameters are trainable too.We provide a detailed description and theoretical discussion of these processes in §B.See the original papers for more details.

Experimental Setup
We use P3, a collection of NL-templatized examples for a variety of datasets, for training and evaluation using the standard splits in Sanh et al. (2022).Not only is there no dataset overlap between training and evaluation, but no task overlap either (e.g., sentiment vs. QA), making it challenging.We report dataset statistics in §A.We perform continued pretraining for one epoch over all training datasets.Each dataset has multiple templates, each evaluated with accuracy.As different datasets have different numbers of answer choices and hence different baseline accuracy, we report Average Relative Gain (ARG; Ye et al., 2021) as a single summary metric by averaging across all templates the relative ac-curacy improvement over a random baseline.We perform significance testing using bootstrap with 1,000 iterations, in each iteration randomly sampling evaluation examples and comparing the two models in question.§D reports per-dataset results.
Following Sanh et al. (2022), we initialize the continued pretraining stage from T5 finetuned with an LM objective (Lester et al., 2021), making it more amenable to prompting.We experiment with two sizes: T5-Large with 770M parameters and T5-XL with 3B parameters.We retrain T0 (Sanh et al., 2022)

Results
Table 1 reports our results.From No Cont.Pretraining, we find that continued pretraining is crucial for prompt tuning with low resources-without it, only few-shot deep prompt tuning yields slightly above-random performance.These results contradict previous findings that few-shot prompt tuning works well without this stage (Min et al., 2022).We believe this is due to the challenging nature of the P3 evaluation datasets, compared to the simple sentence classification tasks previously investigated.This is consistent with what He et al. ( 2022) observed in the full-data setting where deep prompt tuning performs sub-optimally on difficult tasks.
Existing methods for continued pretraining have their drawbacks.In contrast to Gu et al. (2022), we found that MTL-T P with a shallow prompt does not substantially perform above random.We attribute this to (1) their simpler evaluation tasks which, unlike ours, have decent prompt-tuned performance without continued pretraining; and (2) their hand-designed pretraining tasks that match their evaluation tasks, while P3 conversely avoids training-evaluation task overlap, requiring generalizability.Vu et al. (2022) also found MTL-T P to be effective, though with high resources.We also compare with T0, i.e.MTL-T P✗, where both the official model and our reproduction suffer from degraded performance when few-shot shallow prompt tuned (compared to 0-shot), likely because the prompt added during finetuning is intrusive, and the limited gradient updates are not sufficient to re- 2.8 2.9 ---1.7 -1.6 --MTL-T P✗ (Sanh et al.,  cover from it.We note that the official T0 model is not well-optimized: even without hyperparameter tuning, our implementation is significantly better (p < 0.001 for all).
MTL-T P significantly outperforms MTL-T P✗, the strongest existing method we examine, across all settings (p < 0.005 for all) except for few-shot deep prompt tuning on T5-XL (p = 0.21).For zero-shot NL promptability, the improvement could be due to the extra model capacity, or the multi-task trained prompt adjusting the reasoning of the LM, analogous to the text-based "Let's think step by step" effect (Kojima et al., 2022).For fewshot shallow prompt tuning, unlike MTL-T P✗, MTL-T P does not degrade in performance, resulting in 31% higher ARG than MTL-T P✗ on T5-Large.This is likely because of the model's familiarity with the prompt, though the limited capacity of shallow prompt tuning does not yield benefits either.Nevertheless, with deep prompt tuning, which gives the model sufficient conditioning capacity, few-shot tuning does lead to performance increase, again outperforming MTL-T P✗.Here, MTL-T P provides a good prompt initialization and alleviates its intrusiveness.These results emphasize the importance of continued pretraining being aware of the downstream finetuning process.Interestingly, however, the gap between these two models shrinks as the model size increases, no longer significant at T5-XL (p = 0.21).Also, notably, pretraining with a shallow prompt has better 0-shot performance than a deep prompt.This high-lights that higher pretraining capacity is not always beneficial, and matches our motivation from textbased conditioning which also happens at the input level.
FOMAML and Reptile surprisingly underperform MTL-T P in few-shot prompt tuning, even though they specifically optimize for this procedure and have demonstrated success in NLP for full model finetuning (Dou et al., 2019;Bansal et al., 2020bBansal et al., , 2021, i.a.) and few-shot learning (Gu et al., 2018;Qian and Yu, 2019;Mi et al., 2019, i.a.).While Ye et al. ( 2021) also found FOMAML to underperform MTL, they sub-optimally only performed one inner loop update.Here, we show that this comparison holds for more appropriate hyperparameters.This could be due to the fewer number of gradient updates: to perform one gradient update, MTL uses one training batch, while FOMAML with T inner loop steps or Reptile with T prompt tuning steps use T + 1 batches.Not only might this be an inefficient use of training examples, but compute FLOPs too, since each inner loop/prompt tuning step involves a full forward-backward pass.We attempt using a T + 1 times smaller meta batch size (see §B for more detail) to pretrain a deep T5-Large-sized Reptile.When prompt-tuned, it achieves 22.8 ARG, which is even lower, possibly due to higher gradient estimation noise.Alternatively, other factors could affect the performance of meta-learning.It is, for example, well-known that MAML-style meta-learning can be unstable and sensitive to architectures and hyperparameters (Antoniou et al., 2019).This instability is likely amplified by our large heterogeneous multi-task setup and our inability to afford hyperparameter search.Furthermore, its theoretical foundation has mostly only been examined through simple optimizers, predominantly SGD (Finn et al., 2017;Nichol et al., 2018).How it interacts with optimizers more common in modern NLP, such as Adafactor (which we use), remains to be explored.
Recommendations.Based on our findings, we recommend practitioners to always incorporate a prompt during continued pretraining and to train the entire model.Without downstream taskspecific tuning, such as when there is no training data or sufficient compute, a shallow prompt yields better accuracy.When few-shot task-specific prompt tuning is affordable, continued pretraining with a deep prompt enables the best performance.

Conclusion
We demonstrated that the simple recipe of continued pretraining with a prompt significantly improves zero-shot NL promptability and fewshot learned promptability.MAML-based metalearning, on the other hand, obtains worse performance, for which we provided several explanations.Nonetheless, we believe future efforts to leverage their conceptual advantage could be fruitful, perhaps aided by our observations.We also hope to study the effect of continued pretraining with other parameter injection methods (Houlsby et al., 2019;Hu et al., 2022;Liu et al., 2022a).

Limitations
Due to the expensive nature of our experiments, each involving continued pretraining on over 9B tokens ( §A), we could not afford to perform hyperparameter tuning, and instead took hyperparameters from prior work.It is, nevertheless, possible that careful hyperparameter tuning might yield slightly different trends from what we observed.Furthermore, because of computational constraints, we were unable to perform experiments on the largest released T5 model with 11B parameters.Though we validated our findings on two model sizes, it has been found that larger models sometimes demonstrate qualitatively different results (Srivastava et al., 2022;Lampinen et al., 2022;Wei et al., 2022).We would be excited to see if our experiments could be reproduced at a larger model scale.

A Dataset Details
We use P3 as our training and evaluation datasets (Bach et al., 2022).It contains 35 datasets grouped into 8 tasks: Multiple-Choice QA, Extractive QA, Closed-Book QA, Sentiment, Topic Classification, Structure-To-Text, Summarization, and Paraphrase Identification.Examples in each dataset are templatized using multiple human-written templates.Across the 35 datasets, there are a total of 313 templates.For continued pretraining, we follow Sanh et al. (2022) and only use the training split of each dataset.Four tasks are held out for evaluation in P3: Sentence Completion, Natural Language Inference, Coreference Resolution, and Word Sense Disambiguation.They consist of 11 evaluation datasets (considering the three splits of ANLI as separate datasets) and 116 templates in total.We use the training split of each dataset for few-shot experiments, and, following Sanh et al. (2022), evaluate on the validation splits.The only exception is StoryCloze which does not have a training split, so we use its validation split for training and evaluate on its test split.Unlike T0, we do not evaluate on the BIG-Bench datasets (Srivastava et al., 2022) as they had not stabilized as a collection of datasets at the time of this work.All the prompts in P3 are collected in English.
To make training more efficient, we righttruncate all source sequences to 768 tokens and target sequences to 192 tokens.

B Meta-Learning Details
In this section, we elaborate on our meta-learning training procedures.Algorithm 1 contains pseudocode for our first-order MAML (FOMAML) procedure.In the inner loop, we perform T steps of prompt tuning on a cloned model using support data.In the outer loop, we use query data to evaluate the prompt-tuned model and compute gradients.We use the first-order approximation where the gradient is not taken with respect to the entire prompt tuning process but only the forward pass with query data because it is computationally more tractable, and past work has shown that this first-order approximation does not hurt performance much, if at all (Finn et al., 2017;Dou et al., 2019).Theoretically, to perfectly simulate the downstream prompt tuning procedure, we should use the same batch of support data for the T steps of update.Nevertheless, this would traverse the training data much more slowly, so we use different support batches.Our theoretical analysis through the perspective of Reptile below also justifies this.
In preliminary experiments, we found a naïve adoption of Reptile (Nichol et al., 2018) to yield subpar performance.As there is no inner-and outer-loop distinction in Reptile, doing prompt tuning leads to only the prompt parameter being updated throughout the entire continued pretraining stage, likely causing the performance degradation.This effect is also seen in our multi-task learning setup with the MTL-T P model.Thus, we propose to adapt Reptile to better suit prompt tuning, which we illustrate in Algorithm 2. It is similar to FOMAML, but instead of considering the outer loop's gradient as the meta-learning gradient, it uses The Reptile gradient, in our adaptation, takes the prompt's gradient during the T steps and the entire model's gradient for one step.Taking the Reptile gradient from Algorithm 2 and using ϕ T +1 to represent the parameters after the outer loop fullfinetuning update: We can see that all three meta-learning gradients have a similar effect: they only contain a mixture of lone gradients terms (g, g 0 ), which act as a pure multi-task learning objective, and terms that are Hessian times gradient, which Nichol et al. (2018) termed "AvgGradInner" and showed to encourage the expected similarity between different data batches, improving generalization.
Back to our use of different data batches in FO-MAML's inner loop and Reptile's prompt tuning steps.If the inner loop uses the same support (i.e., training) data, as in the derivation above, the "Avg-GradInner" terms become somewhat degenerate, with the same term scaled T or T (T −1) 2 times.With different inner loop batches, on the other hand, there would be more diverse Hessian-gradient interactions between different batches of data and hence encouraging generalization between more tasks.

C Training Details
Due to the expensiveness of our experiments, we did not perform any hyperparameter tuning.For all continued pretraining runs, we follow Raffel et al. (2020) and Sanh et al. (2022) and use Adafactor (Shazeer and Stern, 2018) with a 0.001 learning rate.We use a batch size of 4,096 which we calculated to be close to what Sanh et al. (2022) used. 2  We clip gradients to unit norm.For shallow prompt tuning, we follow Min et al. (2022) and use L = 20 prompt tokens, each with the same dimension as the word embedding size, on the source side only.For deep prompt tuning, we similarly use 20 hidden vectors that are prepended in every transformer layer, on both the source and target side for added capacity.For meta-learning, we use a batch size of 16, simulating our 16-shot evaluation (see below), and a meta batch size of 128.We perform 7 steps of inner loop updates (FOMAML) / prompt tuning (Reptile), following Bansal et al. (2020b) and Bansal et al. (2021), and similarly using Adafactor with learning rate 0.001.All continued pretraining experiments run for one epoch over the training datasets with no checkpoint selection.In few-shot finetuning, we train on one batch of 16 randomly selected examples for 100 epochs (the same batch throughout training), following Min et al. (2022).Like Min et al. (2022), we do not manually balance the label distribution in these examples, unlike in prior work (Gao et al., 2021;Logan IV et al., 2022).
We perform all experiments on 80GB A100 GPUs.Each continued pretraining run takes four (sometimes eight) of them.The largest MTL model takes 10 days to pretrain with four GPUs, while the largest meta-learning model takes 14 days.

D Per-Dataset Results
In Figures 1 to 3, we compare the per-dataset accuracy of MTL-T P✗ (our reproduction), MTL-T P , FOMAML, and Reptile.We omit MTL-T P due to its near-random performance.
, i.e.MTL-T P✗, to eliminate confounding factors in the training procedure.We also reproduce Gu et al. (2022)'s experiment in our setup, i.e.MTL-T P , pretraining a shallow prompt with other parameters frozen.During few-shot finetuning, we train on the same 16 examples for 100 epochs.§C reports additional hyperparameters.

Table 1 :
Average Relative Gain (ARG) on the P3 evaluation datasets.Random performance is 0.0 ARG.Each row in Our methods represents four pretrained models, for Large/XL × Shallow/Deep, while MTL-T P✗ shares the pretrained model for Shallow and Deep as it has no pretrained prompt.We bold the highest number in each column.
* : These models have no prompt and hence no Shallow/Deep distinction in 0-shot experiments.
For the continued pretraining stage, this affects 2% of all training Input: Number of inner loop steps T , meta batch size B, inner loop learning rate α