XPrompt: Exploring the Extreme of Prompt Tuning

Prompt tuning learns soft prompts to condition the frozen Pre-trained Language Models (PLMs) for performing downstream tasks in a parameter-efficient manner. While prompt tuning has gradually reached the performance level of fine-tuning as the model scale increases, there is still a large performance gap between prompt tuning and fine-tuning for models of moderate and small scales (typically less than 11B parameters). In this paper, we empirically show that the trained prompt tokens can have a negative impact on a downstream task and thus degrade its performance. To bridge the gap, we propose a novel Prompt tuning model with an eXtremely small scale (XPrompt) under the regime of lottery tickets hypothesis. Specifically, XPrompt eliminates the negative prompt tokens at different granularity levels through a hierarchical structured pruning, yielding a more parameter-efficient prompt yet with a competitive performance. Comprehensive experiments are carried out on the SuperGLUE tasks, and the results indicate that XPrompt is able to close the performance gap at smaller model scales.


Introduction
Pre-trained Language Models (PLMs) have been widely applied and achieved a remarkable success in various NLP tasks (Devlin et al., 2019;Raffel et al., 2020;Zhou et al., 2020) under the pretrain-then-finetune paradigm (Liu et al., 2019).Despite of its compelling performance, fine-tuning is parameter-inefficient for large scale PLMs due to the fact that the memory footprint is proportional to the number of trainable parameters whose gradients and optimizer states need to be stored (Guo et al., 2021).
Dawei Song and Jingang Wang are the corresponding authors. 1 The code is available at https://github.com/BD-MF/XPrompt. Figure 1: XPROMPT outperforms the vanilla Prompt-Tuning (Lester et al., 2021) and can significantly improve over Prompt-Tuning across tasks and model scales.It is worth noting that there is a small performance gap between prompt tuning and fine-tuning on T5-XXL (11B) due to different hyperparameter settings and initialization.Similar observations have been found in Figure3-a and Figure3-b of Lester et al. (2021).Prompt-Turning uses all prompt tokens.Negative Prompt Masking masks selected (negative) prompt tokens with low importance scores.Random Prompt Masking randomly masks the same number of tokens as in Negative Prompt Masking.impact.Figure 2 provides a preliminary result of this observation.These negative prompt tokens can be circumvented under the regime of LTH.Essentially, LTH states that an over-parameterized network contains a sub-network that, when initialized and trained in isolation, can match or exceed the test accuracy of the original network after training for at most the same number of iterations.The sub-network is called lottery ticket, and the collection of the tickets is referred to as winning tickets in PLMs (Liang et al., 2021).In the problem of prompt-tuning, we refer the winning tickets to as the collection of positive prompt tokens that can achieve the same performance as using the entire collection of prompts, while the losing tickets as the collection of negative prompt tokens.
Therefore, the key is to identify the winning tickets and eliminate the losing ones, in the collection of trained prompt tokens.In particular, we propose to eliminate the losing tickets through a hierarchical structured pruning, which first removes negative tokens at the token-level and then prunes the remaining ones at a finer granularity level, i.e., the piece-level, for a better trade-off between effectiveness and efficiency.In line with LTH, weight rewinding (Renda et al., 2020) is adopted to retrain the identified positive soft prompts.With the elimination of negative prompt tokens, a more parameter-efficient PROMPT of an eXtremely small scale (XPROMPT) is obtained.
To verify the effectiveness of XPROMPT, we conduct an extensive set of experiments on Super-GLUE (Wang et al., 2019) in both high-resource and low-resource scenarios.As shown in Figure 1 and Table 1, the results demonstrate that XPROMPT significantly improves the prompt-tuning methods across tasks and model scales.For models of moderate scales, XPROMPT closes the gap and achieves a performance comparable to fine-tuning.For models of large scales, XPROMPT also leads to large performance gains over Prompt-Tuning, and even exceeds fine-tuning for most tasks.

Pre-trained Language Models
Pre-trained Language Models (PLMs) have achieved remarkable success in various NLP tasks (Zhou et al., 2020;Raffel et al., 2020;Brown et al., 2020).BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) are two pioneers that learn contextual representations with masked language model (MLM) and next sentence prediction pre-training tasks.Recently, a series of large scale PLMs have emerged with different pre-training designs, such as GPT-2 (Radford et al., 2019), GPT-3 (Brown et al., 2020), ELECTRA (Clark et al., 2020), XL-Net (Yang et al., 2019), BART (Lewis et al., 2020) and T5 (Raffel et al., 2020).However, with the exploding number of parameters, fine-tuning models become parameter-inefficient and computationally expensive due to the maintenance of all parameters in the PLMs.Moreover, one has to fine-tune different models for different tasks and store them separately, which is resource-intensive.

Prompt Learning in NLP
With the development of GPT-3 (Brown et al., 2020), prompt learning has drawn much attention in the NLP community (Liu et al., 2021a;Ding et al., 2022;Zhang et al., 2022), which enables efficient learning by adding a number of prompt tokens to the input.Prompt learning has been proven to be effective in various downstream tasks (Davison et al., 2019;Gong and Eldardiry, 2021;Radford et al., 2019;Wang et al., 2021;Khashabi et al., 2020).Recently, prompt has been extended from discrete tokens (tokens in the vocabularies) to continuous tokens (trainable embeddings), i.e., soft prompt (Li and Liang, 2021;Zhong et al., 2021;Qin and Eisner, 2021).For example, (Lester et al., 2021) proposes a parameter-efficient prompt tuning approach by only tuning soft prompts and fixing the entire parameters in PLM.Prompt tuning achieves great success and shows that it can reach the performance of fine-tuning with large PLM.However, there is still a large performance gap

Hierarchical Structured Pruning Rewinding
The input sequence.…

… … …
The input sequence.… between prompt tuning and fine-tuning for models of moderate scales.More recently, (Vu et al., 2021) proposes a prompt-based transfer learning approach, SPOT, to improve the performance of prompt tuning, which learns a prompt on source tasks and then applied to initialize the target task's prompt.Most recently, (He et al., 2022) proposes HyperPrompt which uses the hypernetworks to generate hyper-prompts and obtains superior performance.However, it needs to tune all parameters and shows that only tuning task-conditioned parameters is not enough to achieve competitive results as full model fine-tuning for multi-task learning.

Lottery Ticket Hypothesis
The lottery ticket hypothesis (Frankle and Carbin, 2019) finds that an over-parameterized network contains a subnetwork that is initialized such that -when trained in isolation -it can match the test accuracy of the original network after training for at most the same number of iterations.The subnetwork is called lottery ticket.In NLP, the collection of lottery tickets is referred to as winning tickets in highly over-parametrized models, e.g., PLMs (Liang et al., 2021;Yang et al., 2022b,a).Such winning tickets have demonstrated their abilities to transfer across tasks and datasets (Morcos et al., 2019;Yu et al., 2020;Desai et al., 2019).Recently, Chen et al. (2021) has shown the existence of the winning tickets in PLMs.Liang et al. (2021) observes that the generalization performance of the winning tickets can even exceed that of the full model.

Preliminary
Built upon the text-to-text approach of T5 (Raffel et al., 2020), prompt tuning formulates all tasks as text generation by prepending additional l tunable soft prompt tokens to the input and only updating the parameters of the inserted soft prompt tokens.Specifically, given a series of n input tokens X = {x 1 , x 2 , ..., x n }, T5 first generates the token embeddings X e ∈ R n×e , where e is the dimension of the embedding space.It also generates soft prompt embeddings P e = {p 1 , p 2 , ..., p m } ∈ R m×e , where m is the length of the soft prompt.Then the soft prompts are prepended to the input sequence as [P e ; X e ] ∈ R (m+n)×e .The goal of prompt tuning is to maximize the likelihood of the labels Y by only optimizing over P e : Prompt tuning becomes more effective as the model scale increases.However, there is still a significant performance gap between prompt tuning and fine-tuning especially for models of small and moderate scales.Our hypothesis is that not all soft prompt tokens contribute equally to the performance after training on the target task.There exist certain soft prompt tokens that may have negative impacts on the task.Therefore, combining the idea of the lottery ticket hypothesis, we propose XPROMPT with hierarchical structured pruning to identify the optimal soft prompts and bridge the performance gap.

XPROMPT
The overall process of XPROMPT is illustrated in Figure 3, which consists of three main stages: Prompt-Tuning, Hierarchical Structured Pruning and Rewinding.Specifically, the prompt tuning learns an initial set of values for all soft prompt tokens on the target task.During the hierarchical structured pruning, token-level and piece-level Token-level Pruning Piece-level Pruning The illustration of Hierarchical Structured Pruning.Among them, the shade of the color indicates the level of the importance score, and the darker the color, the higher the importance score of the corresponding structure (token or piece).
pruning processes are repeatedly conducted to identify the optimal soft tokens and pieces at different compression ratios.Finally, a weight rewinding technique is applied to re-train the soft prompts.

Prompt Tuning
Prompt tuning approaches prepend a number of soft prompt tokens to the input, and only tune soft prompts by fixing the entire parameters in PLM.
Prompt tuning has been proven to be effective in various downstream tasks.In our prompt tuning stage, following previous work (Liang et al., 2021), we conduct a complete tuning on the target task to obtain the embeddings for all the soft prompt tokens.These trained soft prompts are used as initialization in the hierarchical structured pruning.

Hierarchical Structured Pruning
Hierarchical structured pruning is designed to separate negative prompt tokens from the trained prompt tokens, and identify an optimal set of soft prompts.The approach is illustrated in Figure 4.The token-level pruning is first used to identify negative prompt tokens, however, the rest prompt tokens may still contain negative pieces.Thus, the piece-level pruning is then applied to identify more fine-grained negative prompt pieces within each prompt token.Token-level and piece-level pruning together play a better trade-off between effectiveness and efficiency.

Token-level Pruning
To identify negative prompt tokens in the trained prompt tokens, we associate mask variable γ i to each soft prompt token vector p i : where γ = {γ 1 , γ 2 , ..., γ m }, γ i ∈ {0, 1}, and a 0 value indicates that the corresponding soft prompt token is pruned.
We then calculate the importance score (Michel et al., 2019) of each token to distinguish the negative prompt tokens from the other ones.The importance score is defined as the expected sensitivity of the model outputs to the mask variables.Formally, the importance score I p i of each soft prompt token p i is calculated as: where L is the loss function and D x is the training data distribution.
Essentially, the importance score of each soft prompt token indicates its individual contribution to the model performance.A low importance score means that the corresponding soft prompt token has a small or even negative contribution to the model.In other words, such a soft prompt token contains negligible prompt information for generating the outputs.On the contrary, a large importance score implies a major contribution with more meaningful prompt information.Therefore, the prompt tokens with low importance scores are most likely negative prompt tokens, which are pruned during the tokenlevel pruning stage.

Piece-level Pruning
Token-level pruning finds the most important soft prompt tokens.However, it may not be sufficient as there are still fine-grained negative prompt pieces remaining in the embedding of each soft prompt token.Different pieces of the embedding may lead to different effects on downstream tasks.Therefore, we further conduct piece-level pruning to eliminate the negative prompt pieces within each token.In particular, we divide the embedding vector of each soft prompt token p ie into k pieces with equal scale, q e = {q 1e , q 2e , ..., q ke }, and treat each piece as an independent unit that can be optimized with gradient updates.Mask variable ζ i is associated with each piece in the soft prompt token to identify the negative prompt pieces: where , and 0 value indicates that the corresponding piece is pruned.
We then calculate the importance score I q i of each piece for every prompt token embedding to prune the low-importance pieces: Similar to the token-level importance score, a low piece-level importance score indicates that the piece has a small or even negative contribution towards the model performance.Such lowimportance pieces contain limited information for generating the outputs.We repeatedly conduct both token-level and piece-level pruning to obtain the sub-prompt tokens and pieces at different compression ratios.

Rewinding
The lottery ticket hypothesis (LTH) (Frankle and Carbin, 2019) states that sparse subnetworks (the unpruned prompts) can be trained in isolation to the same accuracy as the original network (all prompts), and proposes training to pruning and then rewinding the unpruned weights.Following the idea in LTH, we adopt the weight rewinding technique (Renda et al., 2020) to re-train the soft prompts after the two-level hierarchical structured pruning.Specifically, we reset the parameters of the selected optimal soft prompts using their values after the prompt tuning stage.The other soft prompts are pruned by setting the corresponding mask variables to 0. Finally, we re-train the soft prompts using the original learning strategies in prompt tuning.

Datasets
To cover broad and diverse NLP tasks in our experiments, we evaluate our method on various datasets of SuperGLUE benchmark (Wang et al., 2019) in both high-resource and low-resource scenarios.Due to restricted test access for SuperGLUE, following previous works (Lester et al., 2021;Ding et al., 2021), we tune the prompt model on the training set for a fixed number of steps and report results on the validation set using the best checkpoint.The detailed description, statistics and metrics of Super-GLUE tasks are provided in Table 9 of Appendix E. The soft prompt templates and generation verbalizers are provided in Table 10 of Appendix E.

Baselines
Fine-Tuning We compare with the standard finetuning approach (Raffel et al., 2020;Aribandi et al., 2021) of T5, where all the pre-trained parameters are fine-tuned on each target task separately.

Prompt-Tuning
The vanilla prompt tuning approach of (Lester et al., 2021) showed that prompt tuning is a competitive technique for adapting frozen PLMs to downstream tasks.P-Tuning (Liu et al., 2021c) is a prompt-based method that uses the masked PLM to convert the target task into a cloze problem.It employs softprompting techniques to optimize prompts in the continuous space.We also compare with its second version P-TuningV2 (Liu et al., 2021b).
Prefix-Tuning (Li and Liang, 2021) is a lightweight alternative to fine-tuning for natural language generation tasks, which only optimizes a small continuous task-specific vector (called prefix).Prefix-Tuning prepends the prefix to inputs of every transformer layer independently.

Implementation
Our method is implemented with the OpenPrompt library (Ding et al., 2021), which is a unified and extensible toolkit for prompt learning.We translate each SuperGLUE dataset into a text-to-text format following (Raffel et al., 2020), except that we omit the task names prepend to inputs indicating which SuperGLUE task an example belongs to.
Our XPROMPT is built on top of the pre-trained T5 checkpoints of three scales: Large, XL, XXL with 770M, 3B and 11B parameters, respectively.Following previous studies (Lester et al., 2021;Ding et al., 2021), we train our prompts for 100 epochs with a constant learning rate of 0.3 and a batch size of 16. (Lester et al., 2021) shows that an increase beyond 20 tokens only yields marginal gains, so throughout our experiments, we set the default number of prompt tokens to 20 to control the number of trainable parameters and use sampled vocabulary to initialize the prompt parameters.The number of pieces in each token is set to 16.The pruning frequencies are linearly searched from {10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%}.The weight rewinding is applied only once to re-train the pruned soft prompts.The best checkpoints are selected via early stopping on the development set.The models are trained using the Adafactor (Shazeer and Stern, 2018) optimizer with weight decay 1e-5.

Results on High-resource Scenarios
XPROMPT significantly improves the performance of prompt tuning and helps close the gap with fine-tuning across all model scales.XPROMPT yields an improvement of 3.26 %, 2.96 %, and 1.88 % in terms of average score on T5-Large, T5-XL, and T5-XXL, respectively.We also observe that the performance of Prompt-Tuning and P-Tuning are comparable at the same model scale.Moreover, P-TuningV2 outperforms Prompt-Tuning and P-Tuning on CB, RTE, and Boolq.However, XPROMPT achieves more predominant performances than P-TuningV2 at similar model scales, demonstrating its effectiveness.It is worth noting that Prefix-Tuning is less performable on most NLU tasks, since it is designed for natural language generation (NLG) tasks.
It is clear from Table 1 that XPROMPT enables prompt tuning to match the fine-tuning performance on all tasks with T5-XL, and even exceeds fine-tuning performance on most tasks at the T5-XXL scale.For example, XPROMPT achieves the best average score of 89.32% with T5-XL, leaving only 0.25% gap to fine-tuning.It is worth mentioning that XPROMPT significantly outperforms fine-tuning on WiC, CB and RTE with T5-XL, as well as COPA and WiC with T5-Large.Especially for T5-XXL, XPROMPT achieves the best score of 97.11%, 100.00%, 94.94%, 90.87% and  88.90% on WSC, CB, RTE, Boolq, MultiRC respectively, leading to +1.91%, +0.0%, +2.84%, +0.47%, +0.30% improvements over fine-tuning.We also observe that there are certain gaps between prompt tuning and fine-tuning, especially for small and moderate scale models (see Figure 1).However, our XPROMPT narrows down the gap significantly across all model scales, demonstrating that it learns efficient and informative soft prompts which empower downstream tasks effectively.

Results on Low-resource Scenarios
XPROMPT performs much better in low resource scenarios.Since prompt learning is surprisingly effective in low-resource regime (Schick and Schütze, 2021), we also explore the effect of XPROMPT in low-resource scenarios.Following the setting used in (Schick and Schütze, 2021), we randomly select 32 examples as the new training set for each task using a fixed random seed.We tune the prompt model on the 32-shot training set and directly report the full dev set results using the best checkpoint.
As demonstrated in Table 2, our XPROMPT further improves the performance of prompt tuning and outperforms the baseline models at the same scale on Boolq, WiC, and RTE.For example, XPROMPT achieves the best score of 62.85% on WiC, +2.04% improvement over Prompt-Tuning.These few-shot results suggest that although overfitting is severe especially when training with limited data, XPROMPT consistently lifts the performance of prompt tuning.

Analysis and Discussion
To better understand the effectiveness of the XPROMPT and explore the impact of various factors in XPROMPT, we further conduct a series of ablation studies and analysis.

Do Positive Prompts and Negative
Prompts Exist?
We identify both positive and negative prompts through hierarchical structured pruning.For positive prompts, the first evidence is the large performance improvement of XPROMPT over vanilla prompt tuning across all tasks and model scales, which shows the effectiveness of these positive prompts.Another evidence is the high sparsities of pruning.Figure 9 and Figure 10 in Appendix D show the original and pruned gradient saliency maps (Simonyan et al., 2014) of the importance scores on WSC task, i.e., the gray elements in Figure 10 indicate that the prompt tokens or pieces are pruned due to low importance scores, and the remaining parts are the winning tickets.The performance of XPROMPT with 15% positive subprompts is 4.8% higher than the full prompt tuning.
The negative prompts perform worse than Prompt Tuning and XPROMPT.To further investigate the existence and effect of negative prompts, we conduct another experiment to compare prompt tuning performances with different configurations.Specifically, in addition to the vanilla Prompt-Tuning (using all prompts) and our XPROMPT, we introduce three variations -Reversed XPROMPT, Random Prompt and Length Prompt.The Reversed XPROMPT reverses the masked sub-prompt structures in XPROMPT, which essentially uses all the low score prompt tokens and pieces.For Random Prompt, we mask tokens and pieces randomly at the rewind stage.The Length Prompt retrains prompt tuning with the same prompt length of the result-  Table 4: The results of different pruning levels on four SuperGLUE tasks using T5-Large and T5-XL models.

Granularity of Pruning
Token-level pruning and fine-grained piece-level pruning are both important.To further investigate the effects of the two-level pruning, we conduct extensive ablation experiments on four Super-GLUE tasks, whose results are included in Table 4.
In general, both two levels of structured pruning outperform vanilla Prompt-Tuning, demonstrating the effectiveness of both token-level and piece-level pruning.The results also show the existence of subprompt structures in trained prompts that can be further optimized.Obviously, XPROMPT outperforms individual one level pruning, which suggests the combination of the two levels of structured pruning further benefits the training of the soft prompts for downstream tasks.Table 5: The results of different prompt lengths on four SuperGLUE tasks using the T5-XL model.

Prompt Length
Increasing prompt length (beyond 20) only yields marginal gains for XPROMPT.To explore the effect of prompt length on XPROMPT, we train XPROMPT for the T5-XL model with different prompt lengths in {10, 20, 100}.The results are reported in Table 5.From these results we can see that although prompt length plays an importance role for XPROMPT and Prompt-Tuning, the improvements are limited when increasing the prompt length to beyond 20 tokens.This observation is consistent with the findings in (Lester et al., 2021), and that is why we set the number of prompt tokens to 20 in all our experiments.The results of XPROMPT Transfer on two SuperGLUE tasks using T5-XL model.XPROMPTTransfer o only uses the resulting prompts of the source task through XPROMPT to initialize the prompts of the target task, without the rewinding phase.

Prompt Initialization and Transfer
Motivated by the soft prompts transfer approach (SPOT) (Vu et al., 2021), to explore the effect of task transfer and different prompt initialization methods, we introduce a XPROMPT based transfer learning approach -XPROMPT Transfer.It first trains the prompts through XPROMPT on the source task and then uses the learned prompts to initialize the prompts on the target task.More details are provided in Appendix C.
Prompt initialization plays an important role in XPROMPT, and XPROMPT Transfer can lead to performance gains.We compare two sample initialization methods for XPROMPT, including random uniform and sampled vocabulary, the results are shown in Table 6.We observe that sampled vocabulary performs best, and our XPROMPT can also lead to performance gains for the random uniform initialization.Furthermore, we compare our XPROMPT Transfer with the TaskTransfer, which only uses the resulting prompts of the source task to initialize the prompts of the target task, the results are shown in Table 7.We can see that XPROMPT Transfer without rewinding stage outperforms the TaskTransfer, resulting in large performance gains through the pruning and rewinding.These results further validate our hypothesis and the effect of our XPROMPT Transfer.

Conclusions
This paper aims to close the large performance gap between prompt tuning and fine-tuning, especially for models of small and moderate scales.By exploring the lottery ticket hypothesis in the con-text of prompt tuning, we have proposed a novel hierarchical structured pruning approach, namely XPROMPT, to separate the positive prompts from the negative ones at both token-level and piecelevel.Extensive experimental results have demonstrated that XPROMPT yields a more parameterefficient prompt at an extremely small scale, yet with a competitive performance in effectiveness.Taken as a whole, our work sheds light on the development of more efficient and effective promptbased learning approaches.

Limitations
Eliminating negative prompt tokens at different granularity levels through hierarchical structured pruning requires rewinding the pruned model at different compression ratios.Therefore, a key question is left under-explored: how to find the optimal compression ratio without trial training, which can automate the training process and improve the efficiency.Moreover, there are other scenarios in prompt tuning that we plan to further investigate, including the multi-task learning scenario (He et al., 2022), out-of-domain (domain shift) scenario (Lester et al., 2021) and prompt ensembling scenario (Lester et al., 2021).We leave these for future research.

A More Results of P-TuningV2
We observe that the performance of Prompt-Tuning and P-Tuning are comparable at the same model scale.Moreover, P-TuningV2 outperforms Prompt-Tuning and P-Tuning on CB, RTE, and Boolq.However, XPROMPT achieves more predominant performances than P-TuningV2 at similar model scales, demonstrating its effectiveness.

B Token and Piece Importance Score Distribution
Figure 6 and Figure 7 show the distribution of prompt tokens' and prompt token pieces' importance scores on the WSC task.It is clear that most prompt tokens have a low importance score, and only a few prompt tokens have a large importance score.These results further demonstrate our hypothesis that the existence of negative prompts, and their stability.

C XPROMPT Transfer
As shown in Figure 8, given a source task and a target task, XPROMPT Transfer first trains the prompts through our XPROMPT on the source task 0-0.1 0.1-0.20.2-0.30.3-0.40.4-0.5 0.5-0.6 0.6-0.7 0.7-0.80.8-0.90.9 and then uses the resulting prompts to initialize the prompts of the target task, followed by the XPrompt training on the target task.Different from SPOT, we do not use the trained prompts to initialize the prompts for the target task, and our approach can provide more cross tasks information to the prompts.The results of different prompt initialization methods are shown in Table 6, and the results of XPROMPT Transfer are shown in Table 7.

XPrompt on Source Task A
The input sequence.…

Figure 2 :
Figure 2: The performance comparison of Prompt-Tuning, Negative Prompt Masking and Random Prompt Masking with T5-XL(3B) on three SuperGLUE tasks.Prompt-Turning uses all prompt tokens.Negative Prompt Masking masks selected (negative) prompt tokens with low importance scores.Random Prompt Masking randomly masks the same number of tokens as in Negative Prompt Masking.

Figure 3 :
Figure3: The illustration of our proposed XPROMPT approach.XPROMPT consists of three stages, namely Prompt-Tuning, Hierarchical Structured Pruning and Rewinding.Among all the stages, the parameters of T5 are frozen -only the parameters of the prompts are tuned.The prompts trained in the previous stage are fed into the next stage as the initialization prompts.The change of color represents the process that the parameters of the prompts are tuned or pruned.

Figure 6 :
Figure 6: The distribution of prompt tokens' importance scores on WSC task.

Figure 7 :
Figure 7: The distribution of prompt token pieces' importance scores on WSC task.

Figure 8 :
Figure 8: The illustration of XPROMPT Transfer approach.XPROMPT Transfer first trains the prompts through XPROMPT on the source task A and then uses the resulting prompts to initialize the prompts of the target task B, followed by the XPrompt training on the target task B.

Table 1 :
Aribandi et al. (2021)lts (%) on seven SuperGLUE tasks.Our method and better results are in bold (the larger, the better).The small number next to each score indicates performance improvement (↑) compared with the vanilla Prompt-Tuning.Methods with ' * ' indicate the results reported inAribandi et al. (2021).We only present the results of Prefix-Tuning on T5-Large, since it can diverge with larger models(Ding et al., 2022).The '-' results in Prefix-Tuning indicate diverged results in the corresponding task.
Table1 and Table 8(in the appendix) present the main results on SuperGLUE.We compare XPROMPT with strong prompt learning baselines, including Prompt-Tuning, Prefix-Tuning, P-Tuning and P-TuningV2 for different PLMs and model scales.It can be seen that XPROMPT outperforms vanilla Prompt-Tuning by a large margin across all tasks and model scales.For instance,

Table 2 :
Schick and Schütze (2021) results (Acc, %) on three SuperGLUE tasks for the T5-XL model with 20 soft prompt tokens.Methods with ' ‡ ' indicate results reported inSchick and Schütze (2021).XPROMPT is better than vanilla Prompt-Tuning and P-Tuning in low resource scenarios.

Table 3 :
The number of tunable parameters comparison for T5-XL model with 20 prompt tokens.The percentage means the number of tunable parameters in XPROMPT compared to Prompt-Tuning.

Table 6 :
The results of different prompt initialization methods for XPROMPT on two SuperGLUE tasks using T5-XL model.

Table 8 :
Liu et al. (2021b)ee SuperGLUE tasks for different models and similar model scales.The better results are in bold.Methods with ' † ' indicate results reported inLiu et al. (2021b).XPROMPT surpasses P-tuningV2 on models with similar scales.