Late Prompt Tuning: A Late Prompt Could Be Better Than Many Prompts

Prompt tuning is a parameter-efficient tuning (PETuning) method for utilizing pre-trained models (PTMs) that simply prepends a soft prompt to the input and only optimizes the prompt to adapt PTMs to downstream tasks. Although it is parameter- and deployment-efficient, its performance still lags behind other state-of-the-art PETuning methods. Besides, the training cost of prompt tuning is not significantly reduced due to the back-propagation through the entire model. Through empirical analyses, we shed some light on the lagging performance of prompt tuning and recognize a trade-off between the propagation distance from label signals to the inserted prompt and the influence of the prompt on model outputs. Further, we present Late Prompt Tuning (LPT) that inserts a late prompt into an intermediate layer of the PTM instead of the input layer or all layers. The late prompt is obtained by a neural prompt generator conditioned on the hidden states before the prompt insertion layer and therefore is instance-dependent. Through extensive experimental results across various tasks and PTMs, we show that LPT can achieve competitive performance to full model tuning and other PETuning methods under both full-data and few-shot scenarios while possessing faster training speed and lower memory cost.


Introduction
Pre-trained models (Devlin et al., 2019;Radford et al., 2019;Yang et al., 2019;Raffel et al., 2020;Lewis et al., 2020;Liu et al., 2022a;Qiu et al., 2020;Lin et al., 2021) have pushed most NLP tasks to state-of-the-art.Model tuning (or fine-tuning) is a popular method for utilizing PTMs on downstream tasks that needs to tune all parameters of PTMs for every task.Despite the welcome outcome, it leads to prohibitive adaptation costs, especially for supersized PTMs (Brown et al.,  The details can be found in Section 5. Wang et al., 2021a).Parameter-efficient tuning (PETuning) is a new tuning paradigm that can adapt PTMs to downstream tasks by only tuning a very small number of internal or additional parameters.
Prompt tuning (Lester et al., 2021) is a simple and popular PETuning method that prepends a sequence of soft prompt tokens to the input and only optimizes the prompt to adapt PTMs to downstream tasks.It has an absolute advantage in parameter efficiency and facilitates mixed-task inference, which makes the deployment of PTMs convenient.However, compared with other advanced PETuning methods, e.g., Adapter (Houlsby et al., 2019;Mahabadi et al., 2021), LoRA (Hu et al., 2022), andBitFit (Zaken et al., 2022), prompt tuning suffers from lower performance and convergence rate.Compared with full model tuning, although the number of trainable parameters in prompt tuning reduces by ∼17,000× (from 355M to 21K on RoBERTa LARGE ), the training speed only increases by ∼1.5×, and the memory cost only reduces by 29.8%. 1 P-tuning v2 (Liu et al., 2022b) improves the performance of prompt tuning by inserting soft prompts into every hidden layer of PTMs, but it is difficult to optimize and needs more training steps to attain competitive performance.
In this paper, we explore why prompt tuning performs poorly and find there is a trade-off between the propagation distance from label signals to the inserted prompt and the influence of the prompt on model outputs.The key to prompt tuning is to make the soft prompt carry task-related information through downstream training.The trained prompt can interact with text inputs during the model forward pass to obtain text representations with taskrelated information.Since the prompt is inserted into the input in prompt tuning, it has a strong ability to influence the outputs of PTM through sufficient interactions with text inputs.However, there is a long propagation path from label signals to the prompt.It leads us to ask the question: Does this long propagation path cause a lot of task-related information to be lost during propagation and thus affect performance?To verify the impact of the propagation distance on performance, we conduct pilot experiments by shortening it in Section 4 and find that the performance first increases then decreases with the shortening of the length.This finding inspires us to present the late prompt (i.e., inserting the prompt into an intermediate hidden layer of PTM).The late prompt not only receives more task-related information at each update due to the shorter propagation path of task-related information but also maintains the adequate ability to influence the outputs of PTM.Despite the higher performance and faster convergence rate of late prompt than prompt tuning, the hidden states produced by PTM before the prompt insertion layer are underutilized.To further improve performance and take full advantage of these contextual hidden representations, we introduce a prompt generator to generate the soft prompt (termed as instanceaware prompt) for each instance using the corresponding hidden states.
Based on the late and instance-aware prompt, we present Late Prompt Tuning (LPT) to improve prompt tuning.Since the soft prompt is inserted into an intermediate layer of PTM, we have no need to compute gradients for model parameters be-1 Refer to Section 6.5 for details.low the prompt insertion layer, and therefore speed up the training process and reduce memory costs.Extensive experimental results show that LPT outperforms most prompt-based tuning methods and can be comparable with adapter-based tuning methods and even full model tuning.Especially in the few-shot scenario with only 100 training samples, LPT outperforms prompt tuning by 12.4 points and model tuning by 5.0 points in the average performance of ten text classification tasks.Besides, it is 2.0× faster and reduces by 56.6% than model tuning in terms of training speed and memory cost on RoBERTa LARGE , respectively.Figure 1 shows an overall comparison between LPT and its counterparts.To sum up, the key contributions of this paper are: • We explore why prompt tuning performs poorly and find that it is due to the long propagation path from label signals to the input prompt and present a simple variant named late prompt tuning to address the issue.
• Combining the late and instance-aware prompts, we present LPT, which not only attains comparable performance with adapterbased tuning methods and even model tuning but also greatly reduces training costs.
• We verify the versatility of LPT in the fulldata and few-shot scenarios across 10 text classification tasks and 3 PTMs.Code is publicly available at https://github.com/xyltt/LPT.

Related Work
Adapter-based tuning.One research line of PETuning is adapter-based tuning (Ding et al., 2022) that inserts some adapter modules between model layers and optimizes these adapters in downstream training for model adaptation.
Adapter (Houlsby et al., 2019) inserts adapter modules with bottleneck architecture between every consecutive Transformer (Vaswani et al., 2017) sublayers.AdapterDrop (Rücklé et al., 2021) investigates the efficiency through removing adapters from lower layers.Compacter (Mahabadi et al., 2021) uses low-rank optimization and parameterized hypercomplex multiplication (Zhang et al., 2021) to compress adapters.Adapter-based tuning methods have comparable results with model tuning when training data is sufficient but don't work well in the few-shot scenario (Wang et al., 2022).
Prompt-based tuning.Another main research line of PETuning is prompt-based tuning that inserts some additional soft prompts into the hidden states instead of injecting new neural modules to PTMs.Prompt tuning (Lester et al., 2021) and Ptuning (Liu et al., 2021) insert a soft prompt to word embeddings only, and can achieve competitive results when applied to supersized PTMs.Prefixtuning (Li and Liang, 2021) and P-tuning v2 (Liu et al., 2022b) insert prompts to every hidden layer of PTM.BBT (Sun et al., 2022b) optimizes the inserted prompt with derivative-free optimization.Some prompt-based tuning methods, like prompt tuning and BBT, formulate downstream tasks as pre-training tasks (e.g., masked language modeling task) to close the gap between pre-training and downstream training (Sun et al., 2022a).There are also some prompt-based methods with instanceaware prompt.IDPG (Wu et al., 2022) uses the prompt generator with parameterized hypercomplex multiplication (Zhang et al., 2021) to generate a soft prompt for every instance.Contexttuning (Tang et al., 2022) uses BERT model (Devlin et al., 2019) as the prompt generator and focuses on NLG tasks.IPL (Jin et al., 2022) first calculates relevance scores between prompt tokens and inputs, then uses the scores to re-weight the original prompt tokens.But it tunes all parameters of PTM.All the above methods with instanceaware prompt have the same weakness because they need to encode the inputs using an extra encoder, which slows down the training and increases inference latency.There are also some other popular PETuning methods, such as BitFit (Zaken et al., 2022) which only tunes the bias terms, LoRA (Hu et al., 2022) which optimizes low-rank decomposition matrices of the weights within self-attention layers.

Problem Formulation
Given a PTM M, in the setting of model tuning, we first reformulate the inputs with single sentence as E([CLS] ⟨S 1 ⟩ [SEP]) and the inputs with sentence pair as E( where E is the embedding layer of M. The final hidden state of [CLS] token will be used to predict label.In the setting of prompt tuning, we insert a randomly initialized soft prompt p into word embeddings, and also modify the original inputs using different manual templates with a [MASK] token for different tasks.For example, the inputs with single sentence from a sentiment analysis task will be transform into concat(p, E([CLS] ⟨S 1 ⟩ It was [MASK].[SEP])).Then, we map the original labels Y to some words in the vocabulary V of M, which formulates downstream tasks as a language modeling task to close the gap between pre-training and downstream training.The final hidden state of [MASK] token will be used to predict label.
In the setting of our proposed method LPT, we use a prompt generator (PG) to generate an independent prompt p for every input.In addition, the layer that the prompt inserts into is an intermediate layer of PTM instead of word embeddings, and we refer to the layer as the prompt layer (PL).

Why Prompt Tuning Performs Poorly?
The workflow of prompt tuning is to make the inserted soft prompt carry task-related information through downstream training.In the inference phase, this prompt can interact with test inputs during layer-upon-layer propagation so that the hidden representations of these inputs also contain taskrelated information.There are strong interactions between the prompt and text inputs because prompt tuning inserts prompt into word embeddings.However, there is a long propagation path from label signals to the prompt.Therefore, we speculate that the poor performance of prompt tuning is due to the long propagation path of task-related information, which causes a lot of task-related information to be lost during propagation in the frozen model and thus affect performance.To verify this conjecture, we conduct some pilot experiments on TREC (Voorhees and Tice, 2000) and RTE (Dagan et al., 2005) datasets using RoBERTa LARGE (Liu et al., 2019).

Does shortening the propagation distance improve performance?
We start by considering a simple experiment setting where the soft prompt is inserted into different layers of RoBERTa LARGE then we look at how performance changes as the prompt layer changes.As shown in the left plots of Figure 2, we can observe that the performance first increases and then decreases with the rise of the prompt layer and obtain the highest performance when the prompting layer is in the range of 12 to 14.In addition, we also explore the convergence rates at different prompt layers.For simplification, we only consider three different prompt layers 1, 13, and 24.The middle plots in Figure 2 show that the model has the fastest convergence rate when the Middle: Comparison of convergence rates for different prompt layers.Right: The estimated mutual information between hidden states of each layer and label.'PL' denotes the prompt layer.'PL = 1' denotes the traditional prompt tuning (Lester et al., 2021).We show mean and standard deviation of performance over 3 different random seeds.
prompt layer is 13.The trend is consistent with the performance trend shown on the left plots.We can preliminarily identify that properly shortening the propagation distance can improve performance according to these results.However, the performance starts to degrade when we extremely shorten the propagation path of task-related information.We attribute this to the interaction between the prompt and inputs becomes very weak when we unduly shorten the propagation path, which leads to the slighter influence of the prompt on model outputs and the gradual decline of performance.
Task-related information in hidden states.To quantify the task-related information carried in the soft prompt, we follow Wang et al. (2021b) and adopt the mutual information I(h, y) between the hidden states and label of each input.The estimate method of I(h, y) is provided in Appendix A. The right plots of Figure 2 show the I(h, y) at different layers.We note that I(h, y) gradually increases with the forward pass of prompt (i.e., the effect of the prompt on the hidden states gradually increases) when the prompt layer is 13.And its I(h, y) in the last layer is the highest among the three different prompt layer settings, which means that the soft prompt carries more task-related information.The other two prompt layer settings all collapse, espe-cially on the RTE task, because there is no better trade-off between the propagation distance and the effect of prompt on hidden states.
The above observations suggest that our conjecture about the poor performance of prompt tuning is correct.The long propagation path of task-related information leads to poor performance and low convergence rate.And we find that properly shortening the propagation distance can improve performance.

LPT: Late Prompt Tuning
From the experiment results in Section 4, we observe that using late prompt can greatly improve the performance of prompt tuning.Moreover, late prompt can bring two other advantages: (1) No gradient calculation for model parameters below the prompt layer; (2) The hidden states produced by the model before the prompt layer can be used to generate a great independent prompt for each instance.Based on these advantages, we propose an efficient prompt-based tuning method LPT which combines late and instance-aware prompts.An illustration of LPT is shown in Figure 3.In this section, we will introduce two different prompt generators used in LPT and how to determine the prompt layer.

Prompt Generators
Naive prompt generator (NPG).The prompt generator is a simple feed-forward layer with bottleneck architecture.Assume the prompt length is l, then we can generate an independent prompt for each instance as below: where b 1 and b 2 are bias terms.d is the dimension of hidden states.Since m ≪ d, the prompt generator doesn't have too many parameters.However, the number of parameters within W 2 will increase with the prompt length l.
To tackle this problem, we propose the following pooling prompt generator.
Pooling prompt generator (PPG).PPG introduces a pooling operation between downprojection and up-projection operations, which directly obtains the prompt with length l through pooling on input sequences (i.e., pooling the input with length n to the prompt with length l).The generator is more lightweight to generate a prompt, and h ∈ R d×n here.n is the length of the original input.In this paper, we consider both Average Pooling and Max Pooling, referred to as APPG and MPPG, respectively.

How to Determine Prompt Layer?
Generating a good prompt needs a good contextual representation for the input.In this sub-section, we will explore how to choose the prompt layer to guarantee that LPT can attain a good trade-off between performance and efficiency through some pilot experiments on TREC (Voorhees and Tice, 2000) and RTE (Dagan et al., 2005) datasets.As shown in Figure 4, the performance of NPG has a significant decline when the prompt layer is in the range from 14 to 24.However, different from NPG, APPG and MPPG retain high performance as the prompt layer approaches the output layer, especially on TREC dataset.We believe that this is due to the hidden states from the higher layers can help generate a better prompt, while NPG only uses [CLS] token as the representation of the entire input when generating the prompt, which leads to the loss of information.According to the above observations, LPT with APPG and MPPG can achieve a better trade-off for both relatively simple (TREC) and difficult (RTE) tasks.But in this work, to ensure that all methods (NPG, APPG and MPPG) can achieve a good performance while maintaining a relatively low training costs, we simply choose the most intermediate layer of PTM as the prompt layer.That is, we choose the 13-th layer as the prompt layer for RoBERTa LARGE .

Evaluation Datasets
We evaluate our method on 5 single-sentence and 5 sentence-pair classification tasks, including 6 tasks from GLUE benchmark (Wang et al., 2019) and 4 other popular tasks include MPQA (Wiebe et al., 2005), MR (Pang and Lee, 2005), Subj (Pang and Lee, 2004) and TREC (Voorhees and Tice, 2000) tasks.All details about data statistics and splits can be found in Appendix B.

Experiment Settings
We evaluate our method in both full-data and few-shot scenarios on three PTMs, including RoBERTa LARGE (Liu et al., 2019), DeBERTa LARGE (He et al., 2021) and GPT2 LARGE (Radford et al., 2019).According to the conclusion from the Section 5.2, we choose the 13-th layer as the prompt layer for RoBERTa LARGE and DeBERTa LARGE , and the 19-th layer for GPT2 LARGE except special explanation.More implementation details are provided in Appendix C.

Baselines
We from original training sets.Besides, we randomly sample 1000 samples from the original training sets as development sets and there is no overlap with sampled training sets.For the tasks from GLUE benchmark (Wang et al., 2019), the original development sets are used as the test sets and the test sets remain unchanged for 4 other tasks.
Table 2 and 3 show the overall comparison of all the methods in the few-shot scenario.LPT w/ NPG outperforms all the baselines in two different fewshot settings.Especially when the training set has only 100 samples, LPT w/ NPG outperforms model tuning by 5 points and Adapter by 7.1 points.This indicates that our method has better generalization performance when the training data is very scarce.However, we note that LPT w/ MPPG and LPT w/ APPG don't perform as well in the few-shot scenario as they do in the full-data scenario.We speculate that this is owing to the optimal state of the pooling layer to retain only useful information, and sufficient training data is needed to achieve this state.Nevertheless, both LPT w/ MPPG and LPT w/ APPG are also superior to all the baselines when the training set has 100 samples.

Results on other PTMs
To verify the generality of our conclusion about why prompt tuning performs poorly and the versatility of the proposed method LPT, we also conduct experiments on two other popular PTMs, DeBERTa LARGE (He et al., 2021), and GPT2 LARGE (Radford et al., 2019).The results are shown in Table 4.Only using the late prompt to shorten the propagation path of taskrelated information (i.e., LPT w/o PG) is also far superior to the traditional prompt tuning method on these two PTMs.This result enhances the reliability of our conclusion.Moreover, LPT with different prompt generators further improves the performance, closing the gap with model tuning.

Efficiency Evaluation
We compare the efficiency of our method with all the baselines on RoBERTa LARGE (Liu et al., 2019) and GPT2 LARGE (Radford et al., 2019)  such that model tuning method can fit the fixed budget of a NVIDIA GTX 3090 GPU (24GB) and other methods use the same batch size as model tuning.We set the length of all inputs to 256 and evaluate the accuracy in the few-shot scenario that the number of training data is 100 for all methods.
In Table 5, we report accuracy, tuable parameters, training speed (tokens per millisecond) and memory cost (GB) of each method.Our methods not only outperform all prompt-based methods considered in terms of efficiency and memory cost, but obtain the highest performance.Compared with AdapterDrop that has similar efficiency with LPT, our method LPT w/ NPG outperforms it by 20.1 and 7 points on RoBERTa LARGE and GPT2 LARGE , respectively.In addition, we also explore the impact of the choice of prompt layer on all efficiency metrics, and the specific experiment results are in Appendix D. Overall, given a large scale PTM with millions or billions of parameters, such as RoBERTa (Liu et al., 2019), DeBERTa (He et al., 2021), and GPT2 (Radford et al., 2019), higher training speed and lower memory cost is a paramount importance for practical applications.And LPT offers a better trade-off in terms of training budget and performance.

Analyses
Effect of prompt layer.To enhance the reliability of the conclusion (i.e., the most intermediate layer of PTM is the optimal choice of the prompt layer) from Section 5.2, we also conduct the same experiments with Section 5.2 on the other two PTMs that include DeBERTa LARGE (He et al., 2021) and GPT2 LARGE (Radford et al., 2019)   layer is also the optimal choice of the prompt layer on DeBERTa LARGE and GPT2 LARGE models, especially for LPT w/ NPG.These results enhance the reliability of our conclusion that a better tradeoff between performance and efficiency can be achieved by selecting the most intermediate layer of PTM as the prompt layer.
Visualization of instance-aware prompt.We selected the subj dataset (Pang and Lee, 2004) with 1000 development samples for this analysis.For the sake of simplification, we only visualize the instance-aware prompt of LPT w/ NPG method.As shown in Figure 6, we use the same color to mark the samples that their representations are close.We can clearly observe that our method can generate similar prompts for the instances with relatively similar sentence representation.On the contrary, the independent prompts of instances with quite different sentence representations are also quite different.The visualization result indicates that our method learns a special prompt for each instance and can be aware of the important information of the instance to drive PTMs better.

Conclusion
In this paper, we explore why prompt tuning performs poorly and find there is a trade-off between the propagation distance from label signals to the  inserted prompt and the influence of the prompt on model outputs.With this discovery, we present a more efficient and effective prompt tuning method LPT with late and instance-aware prompts.Experiment results in full-data and few-shot scenarios demonstrate LPT can achieve comparable or even better performance than state-of-the-art PETuning methods and full model tuning while having higher training speed and lower memory cost.

Limitations
Although we showed that our proposed method can greatly improve performance and reduce training costs for diverse NLU tasks on three different PTMs (i.e., RoBERTa LARGE , DeBERTa LARGE and GPT2 LARGE ), the larger PTMs with billions of or more parameters and NLG tasks were not considered.But our main thought of using late and instance-aware prompt is simple and can be easily transferred to other backbone architectures and different types of tasks.It would be interesting to investigate if our findings hold for other backbone models and types of tasks.And we will explore it in future work.

A Details for Mutual Information Estimation
Because the mutual information cannot be be calculated directly, we estimate it by training a new classifier using the hidden states h as inputs and the original labels of inputs as outputs.Then, we estimate I(h, y) using the performance achieved by the classifier.Since I(h, y) = H(y) − H(y|h) = H(y) − E (h,y) [−log p(y|h)] (Wang et al., 2021b), we can train a new classifier q ψ (y|h) to approximate p(y|h), such that we have Because H(y) is a constant, we are going to ignore it here.Based on the above conditions, we can use the loss of q ψ (y|h) (i.e., − 1 N [ N i=1 − log q ψ (y i |h i )]) as the estimate of I(h, y).Further simplification, we use the performance of this new classifier to estimate mutual information I(h, y).Because RoBERTa LARGE (Liu et al., 2019) has 24 layers totally except embedding layer, we can obtain 24 hidden states for each input.Hence, we need to train 24 new classifiers for each method.To speed up the training process, we use a 6-layer RoBERTa LARGE as q ψ .

B Datasets
For SST-2 (Socher et al., 2013), MNLI (Williams et al., 2018), MRPC (Dolan and Brockett, 2005), QNLI (Rajpurkar et al., 2016), QQP3 and RTE (Dagan et al., 2005) datasets which are from GLUE benchmark (Wang et al., 2019), we use their original data splits.For 4 other datasets, we select a certain number of samples from the training set as the development set, and the number of samples for each label is determined according to its proportion in the original training set.The dataset statistics after split are shown in Table 6

C Implementation Details
The search space of hyperparameters considered in this paper is shown in Table 7.As an additional note, we use the same number of training epochs or steps for all the methods.For adapterbased tuning methods, we set the down-projection size m to 16.We set the prompt length to 20 for prompt tuning (Lester et al., 2021) and P-tuning v2 (Liu et al., 2022b), and 5 for S-IDPG-PHM (Wu et al., 2022) and LPT w/ NPG.For LPT w/ MPPG and LPT w/ APPG, due to the number of tunable  We use AdamW optimizer (Loshchilov and Hutter, 2019) for all the methods in this work.We use Pytorch (Paszke et al., 2019) and HuggingFace's Transformers (Wolf et al., 2020) libraries to implement all the methods in this work.All experiments are conducted on 8 NVIDIA GTX 3090 GPUs.We follow Gao et al. (2021) and show the used manual templates and label words in Table 8 and  Table 9, respectively.Note that, since the vocabulary of the GPT2 model doesn't have the [MASK] token, we justly use it to represent the positions that are needed to predict.

D Efficiency Evaluation on Different
Prompt Layers.
We select the prompt layer in the range of {7, 13, 19} to explore the influence from different prompt layers for the trade-off between efficiency and performance.The experiment settings are consistent with those described in Section 6.5.
Table 10 shows the performance, the number of tunable parameters, training speed, and memory cost for LPT with three different prompt layers.
When the prompt layer is the 13th layer, both performance and training efficiency are better than when it is the 7th layer.When the prompt layer is the 19th layer, the efficiency is further improved while the performance degrades a lot.

Figure 1 :
Figure 1: Overall comparison between LPT and baselines of only 100 training samples for each task.All methods are evaluated on 10 text classification tasks using RoBERTa LARGE .The radius of every circle indicates training speed (tokens per millisecond).LPT w/ NPG and LPT w/o PG represent LPT with naive prompt generator and without prompt generator, respectively.The details can be found in Section 5.

Figure 2 :
Figure 2: Left: The performance achieved by inserting a soft prompt into different layers of RoBERTa LARGE .Middle: Comparison of convergence rates for different prompt layers.Right: The estimated mutual information between hidden states of each layer and label.'PL' denotes the prompt layer.'PL = 1' denotes the traditional prompt tuning(Lester et al., 2021).We show mean and standard deviation of performance over 3 different random seeds.

Figure 3 :
Figure 3: An illustration of LPT.Left: Naive (NPG) and pooling (PPG) prompt generators.Right: The forward and backward pass of LPT.

Figure 4 :
Figure 4: The change trend of performance with different prompt layers for three different prompt generators.The backbone model is RoBERTa LARGE .We show mean and standard deviation of performance over 3 different random seeds.

Figure 5 :
Figure 5: The change trend of performance with different prompt layers on DeBERTa LARGE (upper) and GPT2 LARGE (lower).We show mean and standard deviation of performance over 3 different random seeds.

Table 1 :
Overall comparison in full-data scenario.All the methods are evaluated on test sets except the tasks from GLUE benchmark.We report mean and standard deviation of performance over 3 different random seeds for all the methods except model tuning.The best results are highlighted in bold and the second best results are marked with underline.Prompt Tuning-256 indicates the prompt tuning method with prompt length 256.All the results are obtained using RoBERTa LARGE .

Table 2 :
Results in the few-shot scenario of 100 training samples.We report mean and standard deviation of performance over 4 different data splits for all the methods.Bold and Underline indicate the best and the second best results.All the results are obtained using RoBERTa LARGE .
Wu et al. (2022)mpt, that is S-IDPG-PHM.And we don't use supplementary training likeWu et al. (2022)to enhance performance.2https://github.com/thunlp/OpenDelta6.4 Main ResultsResults in full-data scenario.The overall comparison of the results in full-data scenario is shown in Table1.We can observe that: (i) Our method with only late prompt, that is LPT w/o PG can greatly improve the performance of the traditional prompt tuning under the same number of tunable parameters and even is comparable with P-tuning v2 which inserts prompts to each layer of PTM.(ii)Results in few-shot scenario We further evaluate our method in few-shot scenario.FollowingWu et al. (2022), we consider two settings where the number of training data is 100 and 500, respectively.We randomly sample the training samples

Table 3 :
Results in the few-shot scenario of 500 training samples.We report mean and standard deviation of performance over 4 different data splits for all the methods.Bold and Underline indicate the best and the second best results.All the results are obtained using RoBERTa LARGE .

Table 4 :
models.For each backbone, we select the largest batch size Results on two single-sentence and two sentence-pair tasks using DeBERTa LARGE and GPT2 LARGE models as the backbone.Bold and Underline indicate the best and the second best results.
models.As shown in Figure5, the most intermediate

Table 5 :
Comparison of parameter efficiency, training efficiency and memory cost for all the methods on two different backbone models.All methods are evaluated on RTE dataset.

Table 10 :
Trade-off between performance and training efficiency.'PL'denotes the prompt layer.Bold and Underline marks the best and the second best results, respectively.All methods are evaluated on RTE dataset using RoBERTa LARGE model.parametersbeinginvariable with prompt length changes, we also search the prompt length in the range of {10, 15, 20} for them.Besides, we set the down-projection size m of S-IDPG-PHM and LPT to 256 and 128, respectively.The hyperparameter r and α in LoRA are set to 8 and 16 on RoBERTa LARGE , 4 and 32 on GPT2 LARGE .For the batch size of GPT2 model listed in Table7, it refers to the number of samples in a single forward pass.Due to the large scale of GPT2 LARGE , we use gradient accumulation technique to avoid outof-memory, and the accumulation step is 2 or 4.