On Transferability of Prompt Tuning for Natural Language Processing

Prompt tuning (PT) is a promising parameter-efficient method to utilize extremely large pre-trained language models (PLMs), which can achieve comparable performance to full-parameter fine-tuning by only tuning a few soft prompts. However, PT requires much more training time than fine-tuning. Intuitively, knowledge transfer can help to improve the efficiency. To explore whether we can improve PT via prompt transfer, we empirically investigate the transferability of soft prompts across different downstream tasks and PLMs in this work. We find that (1) in zero-shot setting, trained soft prompts can effectively transfer to similar tasks on the same PLM and also to other PLMs with a cross-model projector trained on similar tasks; (2) when used as initialization, trained soft prompts of similar tasks and projected prompts of other PLMs can significantly accelerate training and also improve the performance of PT. Moreover, to explore what decides prompt transferability, we investigate various transferability indicators and find that the overlapping rate of activated neurons strongly reflects the transferability, which suggests how the prompts stimulate PLMs is essential. Our findings show that prompt transfer is promising for improving PT, and further research shall focus more on prompts’ stimulation to PLMs. The source code can be obtained from https://github.com/thunlp/Prompt-Transferability.


Introduction
Pre-trained language models (PLMs), such as BERT (Devlin et al., 2019) and GPT (Radford et al., 2018) have achieved great performance on various natural language processing (NLP) tasks (Han et al., 2021). Recently, people have found that extremely large PLMs can achieve remarkable improvements, and various large PLMs are continually developed (Brown et al., 2020;Raffel et al., 2020;Zhang et al., 2021;Zeng et al., 2021;Sun et al., 2021), which contain up to hundreds of billions of parameters.
Considering the extremely large scale of these state-of-the-art PLMs, conventional full-parameter fine-tuning methods become extremely expensive. Hence, various parameter-efficient tuning methods (Houlsby et al., 2019;Ben Zaken et al., 2021;Li and Liang, 2021;Liu et al., 2021) are explored, among which prompt tuning (PT) has attracted broad research attention. PT prepends some soft prompts, which are essentially learnable virtual tokens, into the input sequences and only trains them while keeping all the PLM's parameters fixed. The training objective is to generate desired outputs in the same way as the pretraining tasks. PT can match the downstream task performance of fine-tuning with only thousands of tunable parameters  when the PLM has billions of parameters.
Although PT is an effective approach to utilizing extremely large PLMs, it requires much more training time than fine-tuning to reach the convergence as shown in Figure 2; hence, it is worthwhile to explore how to improve the efficiency of PT. In this work, we attempt to improve PT via prompt transfer across different tasks and models. Knowledge transfer across tasks (Vu et al., 2020) and models (Qin et al., 2021) have been widely used to improve the efficiency and effectiveness of NLP systems. Intuitively, soft prompts are the only tuned parameters in PT and thus shall concentrate the knowledge required to solve tasks conditioned on PLMs. Thus transferring the trained prompts is promising to accelerate PT.
As shown in Figure 1, we empirically analyze the transferability of prompts across different tasks (cross-task transfer setting) and PLMs (crossmodel transfer setting) in this paper. The empirical analysis is conducted on 17 NLP tasks of 6 types and two representative PLM series: RoBERTa (Liu et al., 2019b) and T5 (Raffel et al., 2020). In crosstask transfer, the prompt transfer can be done by directly reusing the trained prompts of the source task on the target task. However, in cross-model transfer, directly reusing prompts is intractable since the semantic spaces of different PLMs are inconsistent; hence, we develop various prompt projectors to project the soft prompts trained on the source PLM to the semantic space of the target PLM. We conduct two lines of experiments: (1) We investigate the zero-shot transfer performance and find that the transferability of prompts is influenced by task types. In cross-task transfer, the soft prompts can directly transfer to sametype tasks and achieve non-trivial performance, but poorly transfer to different-type tasks requiring different language skills. In cross-model transfer, we can successfully train a prompt projector with PT on a task, but the trained projector also only well generalizes to the same-type tasks of the projector-training task. (2) To accelerate PT, we propose to transfer prompts with initialization. In cross-task transfer, we start PT with the trained soft prompts of similar tasks as initialization. While in cross-model transfer, the initialization is the projected prompts of the same task trained on the source PLM. The two methods are dubbed as TPT TASK and TPT MODEL , which are short for transferable prompt tuning. Experiments show that they can both accelerate PT to some extent and also achieve a certain performance improvement.
Furthermore, we explore why can the prompts transfer and what decides their transferability. To this end, we design various prompt similarity metrics from different perspectives and examine how well they can serve as transferability indicators, i.e., how well they correlate with prompt transfer performance. Experiments find that our novel method of measuring prompt similarity via model activations in feed-forward layers is better correlated with prompt transferability than prompt embedding distance-based metrics. This suggests the prompts are essentially stimulating PLM's inner ability distributing among neurons to do specific NLP tasks, and future prompt transfer works should focus more on how the PLMs respond to different prompts' stimulation rather than the prompts' embedding properties.
To summarize, our contributions are three-fold: (1) We thoroughly analyze the transferability of prompts across different tasks and models, and show that improving PT with prompt transfer is possible and promising. (2) We propose to transfer prompts with initialization, which enhances both PT's efficiency and effectiveness. (3) We explore the effectiveness of various prompt similarity metrics serving as transferability indicators and demonstrate how the prompts stimulate PLMs to decide the transferability, which may facilitate further transferrable PT research.

Related Work
Prompt Tuning GPT-3 (Brown et al., 2020) demonstrates remarkable few-shot performance by prepending textual prompts before the inputs and thus helps the PLM to generate desired outputs of NLP tasks directly. Motivated by this, many works have tried to improve various NLP tasks by creating manually-crafted (Schick and Schütze, 2021a,b;Mishra et al., 2021) or automaticallysearched (Jiang et al., 2020;Shin et al., 2020;Gao et al., 2021) hard prompts, which are discrete to-kens but not necessarily human-readable. Furthermore, soft prompts (Hambardzumyan et al., 2021;Qin and Eisner, 2021;Zhong et al., 2021;Liu et al., 2021) are proposed, which are tuneable embeddings rather than tokens in the vocabularies and can be directly trained with task-specific supervision.  demonstrate that prompt tuning (PT) method can match the performance of full-parameter fine-tuning when the PLM has billions of parameters. This suggests that PT is promising to utilize extremely large PLMs. However, the much more training time needed to reach the convergence makes PT inefficient. In this work, we show that prompt transfer can improve the effectiveness to some extent with knowledge transfer, and empirically analyze the transferability of prompts across tasks and PLMs.
Knowledge Transfer Cross-task knowledge transfer (Ruder, 2017) has been a long-standing way to improve the effectiveness and efficiency of NLP systems. In the PLM era, some works propose to tune the PLMs on intermediate tasks (Phang et al., 2018;Pruksachatkun et al., 2020;Gururangan et al., 2020;Wang et al., 2019a;Vu et al., 2020;Poth et al., 2021) before fine-tuning on specific target tasks to achieve certain benefits. Vu et al. (2020) empirically analyze the transferability between tasks in this setting.
These explorations are all for fine-tuning. Considering the potential of PT, we believe the transferability and knowledge transfer methods for PT are worth exploring. As a prior attempt,  demonstrate that PT's cross-domain transferability is stronger than fine-tuning.
Similar to our work, recent work (Vu et al., 2021) also explores the cross-task transfer with prompt initialization and prompt similarity metrics based on cosine similarity. However, Vu et al. (2021) focus on improving the effectiveness of PT but we attempt to improve the efficiency. Additionally, we explore more transferability indicators, especially the overlapping rate of activated neurons, and also investigate cross-model transfer, which is inspired by previous cross-model knowledge transfer works such as Net2Net (Chen et al., 2016), knowledge distillation (Hinton et al., 2015) and knowledge inheritance (Qin et al., 2021).

Preliminary
Here we introduce the basic knowledge about PT ( § 3.1) as well as the downstream tasks ( § 3.2) and models ( § 3.3) investigated in experiments.

Prompt Tuning
In this work, we study the PT method that is capable of tuning large PLMs Liu et al., 2021), i.e., we only explore the PT method freezing PLM parameters. PT prepends some virtual tokens, i.e., the soft prompts, into the inputs of the PLM to provide knowledge about downstream tasks. The soft prompts are essentially tunable embedding vectors, which are trained with the objective enforcing the PLM to generate desired outputs of the downstream task in the same way of the pre-training objective.
Formally, given an input sequence with n tokens X = {x 1 , x 2 , . . . , x n }, we first prepend l randomly initialized soft prompts P = {p 1 , p 2 , . . . , p l } before them, where p i ∈ R d is an embedding vector, and d is the input dimension of the PLM. The training objective is to maximize the likelihood of decoding the desired output y: where only P is learnable. For the language understanding tasks, y is the label token corresponding to the label of X. For the conditional generation tasks, y is a sequence. Especially, for the models pre-trained with the masked language modeling objective like RoBERTa, we additionally prepend a special [MASK] token before the prompts and train the prompts to let the PLM fill y into it.

Investigated NLP Tasks
To comprehensively study the prompt transferability across various NLP tasks, we involve 17 diverse tasks, which can be divided into 6 types:

Investigated Models
We investigate prompt transferability for two series of PLMs: RoBERTa (Liu et al., 2019b) and T5 (Raffel et al., 2020), which represent two mainstream pre-training types: masked language modeling and sequence-to-sequence pre-training. Considering RoBERTa can only predict a single token (or a fixed length of tokens) under prompt tuning paradigm, for the conditional generation tasks (QA and SUM) that output multiple tokens, we only investigate T5. We mainly report results for the two largest versions of PLMs, i.e., RoBERTa LARGE and T5 XXL . The more detailed results for the other sizes are attached in appendix.

Cross-Task Transfer
We empirically study the cross-task transferability of soft prompts ( § 4.1) and try to improve the effectiveness and efficiency of PT with transfer ( § 4.2).

Zero-shot Transfer Performance
To study the cross-task transferability, we first examine PT's zero-shot transfer performance, i.e., we conduct PT on a source task, then directly reuse the trained prompts on other target tasks and evaluate their performance. The results are shown in Figure 3 1 , from which we can observe that: (1) For the tasks within the same type, transferring soft prompts between them can generally perform well and may even outperform vanilla PT on the target task, especially when the source task has more data (the case of transferring from IMDB to Movie in Figure 3 (a) and transferring from restaurant to laptop in Figure 3 (b)), which demonstrates that it is promising to improve PT's effectiveness and efficiency with knowledge transfer from similar tasks.
(2) For the tasks of different types, the transferability of soft prompts among them is generally poor, and transferring soft prompts often achieve similar performance to randomly initialized prompts.
(3) However, some tasks can transfer to differenttype tasks to some extent, such as the QA and SUM tasks to SA tasks in Figure 3  this, it is worthwhile to explore what controls the transferability between prompts, and we do some preliminary study in § 6.

Transfer with Initialization
To improve the effectiveness and efficiency of PT with cross-task transfer, we explore a crosstask transferable prompt tuning (TPT TASK ) method, which initializes soft prompts with well-trained prompts of the most similar task and then starts PT.
For a target task, we start TPT TASK with trained prompts of the source task achieving the best zeroshot transfer performance in Figure 3. From the results of the performance and training time comparisons 2 in Table 1, we can see TPT TASK can mostly achieve better or comparable performance to vanilla PT starting from random initialization, and TPT TASK generally takes less training time.

Cross-Model Transfer
We further study the cross-model transferability of soft prompts. Intuitively, cross-model transfer allows us to train prompts on a small and computationally efficient PLM and use them on a massive and computationally expensive PLM, which will be much more efficient and environmentfriendly. We investigate the feasibility of crossmodel transfer on transferring from a source PLM (RoBERTa LARGE ) to a larger and heterogeneous target PLM (T5 XXL ), which shall be the most difficult setting. Appendix C shows the experimental results of other settings. Directly reusing trained soft prompts between different PLMs is infeasible since their embedding spaces are different. Hence, we investigate how to do cross-model prompt projection ( § 5.1) and see the transfer performance ( § 5.2). Furthermore, we explore to improve PT with cross-model transfer initialization ( § 5.3).

Cross-Model Prompt Projection
To project the trained soft prompts of a PLM to the semantic space of a different PLM, we train projectors with various objectives and examine their effectiveness. A good way to train the cross-model projectors may need some task-specific supervisions, but the trained projector shall generalize to different tasks so that the efficiency for learning the new tasks on the target model could be improved.
Formally, given the prompt of the source PLM P s = {p s 1 , . . . , p s l }, we concatenate the l virtual    Table 1: Performance on 17 NLP tasks of vanilla prompt tuning (PT) and prompt tuning with transferring initialization (TPT TASK ), which initialize PT with the one performing best in zero-shot transfer, as well as the convergence speedup (the quotient of the training steps of PT by the training time of TPT TASK reaching convergence) and comparable-result speedup (the quotient of the training time of PT by the training time of TPT TASK achieving comparable performance to PT). N/A represents the tasks that RoBERTa LARGE cannot conduct, or we fail to speed up training with TPT TASK .
tokens into a unified vector P s ∈ R lds . The projector Proj(·) is to project it toP s ∈ R ldt in the semantic space of the target PLM, where d s and d t are the input embedding dimensions of the source and target PLM, respectively. We parameterize the projector with a two-layer perceptron as follows: where is a non-linear activation function. We investigate two learning objectives to train the projector 3 : Distance Minimizing We firstly try to learn cross-model projections by minimizing the distance between the projected prompt and the parallel prompt P t originally trained on the target PLM with the same task, i.e., the training objective is to minimize their L 2 -distance Proj(P s ) − P t 2 .
Task Tuning We then try to train the cross-model projector with task-specific supervision signals on  the target PLM. Specifically, we directly tune the projected prompts on some tasks and back propagate the supervision signals to train the projector weights, so that the projector can learn how to stimulate the target PLM and thus may generalize to transfer the prompts of other tasks. These methods rely on some tasks (parallel trained soft prompts or training data) to train the projector. The projector learning methods are agnostic to the specific training tasks used, and we choose laptop and MNLI in experiments.

Zero-shot Transfer Performance
The zero-shot transfer performance of various projector-learning methods are shown in Table 2 4 (a). We can observe that: (1) Distance Minimizing works well to transfer the prompts of the projectortraining task, but falls back to random performance on the other unseen tasks, which is not practically usable. This is consistent with our findings in § 6 that the embedding distances do not strongly correlate to prompt transferability. (2) Task Tuning performs better and successfully generalizes to same-type unseen tasks of the projector-training tasks (e.g. NLI tasks for the projectors trained with MNLI), which proves the feasibility of practical cross-model prompt transfer. (3) The projectors trained with Task Tuning still cannot work for different-type tasks, which may be limited by the cross-task prompt transferability investigated 4 More results on other PLMs are left in appendix C.2. in § 4.1. This urges further attention to developing universal cross-model projections.

Transfer with Initialization
Similar to § 4.2, we further study whether the projected soft prompts can initialize PT on the target PLM and accelerate training as well as improve performance. We propose cross-model transferable prompt tuning, TPT MODEL , which adopts the Task Tuning projectors to project the soft prompts trained on the source PLM into the target PLM and initialize PT with the projected prompts.
The performance and speedup are shown in Table 2 (b). We can see that, for the tasks within the same type of the projector-training task, compared to vanilla PT, TPT MODEL can mostly achieve comparable or better performance with much less training time, which demonstrates that practical cross-model prompt transfer is promising for improving the efficiency and effectiveness of PT. this metric decide the prompt transferability. Moreover, the prompt similarity metrics can qualify task similarities using the trained soft prompts as task embeddings and may help in developing cross-task transfer methods. As a straightforward example, if we build a prompt warehouse containing prompts of diverse tasks, we can retrieve prompts of similar tasks for a new task with a certain similarity metric and better improve PT with TPT TASK .

Prompt Similarity Metric
We explore the following two kinds of metrics: Embedding Similarity We firstly regard the trained soft prompts as only embeddings in the vector space and calculate their Euclidean similarity and cosine similarity, among which cosine similarity is also explored by Vu et al. (2021).
Given two groups of trained prompts containing l virtual tokens: P t 1 = {p t 1 1 , . . . , p t 1 l } and P t 2 = {p t 2 1 , . . . , p t 2 l }, which correspond to tasks t 1 and t 2 . Firstly, we concatenate the l virtual tokens for each group and get two concatenation embeddings P t 1 , P t 2 ∈ R ld , then we compute Euclidean similarity and cosine similarity of them: We further explore a simple way to make the metrics invariant to token positions. We compute Euclidean distances and cosine similarities for every virtual token pairs in the two groups and use the averaged results in the final similarity metrics: Model Stimulation Similarity In the second way, we depict their similarities based on how they stimulate the PLMs, i.e., we examine the similarities between the responses of PLMs to the two soft prompts. Motivated by Geva et al. (2021) and Dai et al. (2021), which both find that the activation of the neurons in the feed-forward layers of Transformers (Vaswani et al., 2017) corresponds to specific model behaviors, we propose to use the overlapping rate of activated neurons as a similarity metric of prompts. Specifically, the feedforward network FFN(·) in a Transformer layer is: where x ∈ R d is the input embedding, W 1 , W 2 ∈ R dm×d are trainable matrices, and b 1 , b 2 are bias vectors. The max(xW 1 + b 1 , 0) can be regarded as the non-negative activation values for d m hidden neurons (Geva et al., 2021). We then change all the positive elements of max(xW 1 + b 1 , 0) to 1 and get the one-hot activation state vector s. We feed an input sequence {P,<s>} into the PLMs, where <s> is the special token indicating the start of a sentence. For RoBERTa, a [MASK] is additional prepended. This sequence is in the format of PT inputs but without specific input sentences.
We use the activation states of the positions used to decode outputs, which shall be more taskspecific. Specifically, for T5, we use the decoder module's activation states at the first position. For RoBERTa, we use the activation states of [MASK]. Finally, we concatenate the activation states of PLM's L layers to get the overall activation states: We can only retrieve the activation states of a part of layers in the similarity computation. In experiments, we find that the higher layers tend to be more task-specific, which is consistent with the probing results (Liu et al., 2019a). Hence we use the activation states of the top 3 layers 5 in experiments below. We calculate the overlapping rate of activated neurons ON(P t 1 , P t 2 ) between   the trained soft prompts of task t 1 and t 2 with the cosine similarity:

Experimental Results
To evaluate the effectiveness of the above similarity metrics of soft prompts, we (i) test whether the similarity metrics can distinguish the trained prompts of the same tasks and different tasks, and (ii) examine whether these metrics align with the zero-shot transfer performance. Regarding (i), we compare the similarities of the investigated metrics for two trained prompts within the same task (trained with different random seeds) and between different tasks in Table 3. From the results, we can observe that all the metrics work well to distinguish the prompts of the same task and different tasks. This suggests that the trained soft prompts of different tasks form distinguishable clusters in the embedding space and also stimulate different abilities within the PLM.
Moreover, to evaluate (ii), how well the similarity metrics align with the cross-task transfer performance, we quantify the correlations between the similarities and zero-shot transfer performance in Figure 3. Specifically, for each target task's prompt, we rank various source tasks' prompts  with similarity scores and zero-shot transfer performance and then compute the Spearman's rank correlation (Spearman, 1987) between the two ranks generated by these two ways. The overall results are shown in Table 4 6 . We can see that: (1) The overlapping rate of activated neurons (ON) metric works better than all the embedding similarities, which suggests that model stimulation is more important for prompt transferability than embedding distances. (2) ON works much worse on T5 XXL (11B parameters) than on RoBERTa LARGE (330M parameters). We guess this is because larger PLMs have higher redundancy (Aghajanyan et al., 2021), which means prompts can activate different redundant neurons to do similar jobs and thus influence the sensitivity of ON metric. This is supported by the experiments showing that the Spearman's correlation scores of ON drop with the increase of PLM scales (Figure 4), from which we can see C average also exhibits a similar trend. We encourage future work to explore how to overcome the PLM redundancy for better transferrable PT. As a preliminary trial, we find that by taking the intersection of activation states of 3 prompts trained with different random seeds, ON's correlation score on T5 XXL raises from 36.9% to 46.3%.
We further explore whether the prompt similarity metrics also work in the cross-model transfer setting by testing whether they work between the projected prompts and original prompts of the same task. In Table 5, we show the similarities of prompts projected with Task Tuning projectors by the two best metrics C average and ON. We can see: (1) ON metric shows that the projected prompts are highly similar to the original prompts within the same type of projector-training tasks but are not so similar to different-type tasks, which is quite consistent with the cross-model zero-shot transfer performance in Table 2. (2) However, C average cannot reflect this phenomenon, which shows that the perspective of model stimulation is more promising for understanding transferability again.

Conclusion
We empirically investigate the transferability of prompts in this paper. In the cross-task setting, we find that soft prompts can transfer to similar tasks without training. In the cross-model setting, we successfully project prompts into the space of other PLMs. Further, we utilize trained prompts of other tasks or other PLMs as initialization to significantly accelerate training and improve effectiveness. Moreover, we explore various prompt transferability indicators and show that how the prompts stimulate PLMs is important to transferability. We hope the empirical analyses and the model stimulation idea can facilitate further research on transferable and efficient PT.

A.2 Evaluation Metrics
For classification tasks (SA, NLI, EJ, and PI), we use accuracy (Acc.) as their evaluation metric. As for generation tasks (QA and SUM), we utilize F1 and ROUGE-L (Lin, 2004), respectively.

A.3 Prompt Tuning Setting
In the experiments, for all the investigated tasks, we use AdamW (Loshchilov and Hutter, 2019) as the optimizer and set the learning rate as 0.001. We set the length of soft prompts l as 100. All the soft prompts are randomly initialized and optimized with Equation 1. In the inference stage, RoBERTa predicts the label tokens at the [MASK] position and T5 directly uses its decoder to do generation. For the classification tasks (SA, NLI, EJ and PI), we obtaining answers in a ranking manner, i.e., we rank the label tokens by their likelihoods and regard the PLMs as predict the label of the label token with highest likelihood. For the conditional generation tasks (QA and SUM), we directly take the outputs of PLMs as their answers.

B.2 Unifying Label Tokens
We hypothesize that the poor transferability between different task types may result from the fact that different-type tasks usually use different label tokens, e.g., yes and no are for NLI tasks while positive and negative are for SA tasks. To verify whether this factor influences the transferability, we unify the label tokens of different tasks into the same set of numbers (1, 2, . . .) and choose RoBERTa BASE for the experiments. In Figure 6, we can observe that the transferability between different-type tasks are generally not improved in this way. This indicates that different-type tasks surely require distinct abilities, which prohibits reusing prompts between them.

B.3 Speedup Calculation
In this paper, we compute convergence speedup and comparable-result speedup as follows: Comparable-result Speedup(x) = PT convergence time time of TPT achieving comparable result to PT .
We calculate the training loss and the evaluation score per 100 steps during the training. When the training loss stops dropping and the evaluation score stops increasing for 300 steps, we set the point as the convergence point. For the convergence speedup in Equation 8, the PT convergence time is divided by the TPT convergence time. As for the comparable-result speedup in Equation 8, the PT convergence time are divided by the time of TPT achieving comparable performance to PT.

C.1 Implementation Details of Projector
As mentioned in § 5.1, we give the prompt of the source PLM, P s = {p s 1 , . . . , p s l }, and concatenate its l virtual tokens into a unified vector P s ∈ R lds , where d s is the hidden size of the source PLM. To transfer P s to the target PLM whose hidden size is d t , we design a projection function Proj(·) parameterized by a two-layer perceptron as follows: P s = Proj(P s ) = W 2 (σ(P s W 1 +b 1 ))+b 2 , (9) where W 1 ∈ R d h ×lds , W 2 ∈ R ldt×d h are trainable matrices, b 1 ∈ R d h , b 2 ∈ R ldt are biases, σ is a non-linear activation function. For training configurations of projector, the optimizer is AdamW (Loshchilov and Hutter, 2019), the training batch size is 16, the learning rate is 0.005,   (Xu et al., 2015), and find their performance on various PLMs are similar. The reported results are based on LeakyReLU.

C.2 More Zero-shot Transfer Performance
In § 5.2, we showed the zero-shot transfer performance of various projector-learning methods in the setting of transferring from RoBERTa LARGE to T5 XXL . We explore more cross-model transfer settings here, which are transferring between various PLMs in different scales and heterogeneous frameworks, including from BERT BASE to RoBERTa BASE , from RoBERTa BASE to RoBERTa LARGE , and from T5 BASE to T5 XXL . We can find that the results in Table 7 are all consistent with § 5.2.

C.3 Technical Details of TPT MODEL
In § 5.3, we demonstrate cross-model transferrable prompt tuning (TPT MODEL ) can well improve per-formance and reduce training time. However, when we apply TPT MODEL to more PLMs, we find that the projected prompts may have quite different L 2 norm values with the original prompts, especially for the small-scale PLMs (e.g., from BERT BASE to RoBERTa BASE ). Specifically, we obtain the projected prompts with the trained Task Tuning projector, and find that the projected prompts are hard to optimize in some tasks as shown in Figure 7 [Without LayerNorm]. Thus, we attempt to add the layer normalization operation (Ba et al., 2016) LayerNorm into the projectors to regularize the norm of the projected prompt as follows: By the LayerNorm, the projected prompts can work well on TPT MODEL and achieve better performance and speedup as shown in Figure 7 [  We categorize all prompts into three groups: same tasks (prompts trained with different seeds on the same dataset), same-type tasks, and different-type tasks. Table 9 shows that all the similarity metrics successfully distinguish task types.

D.2 Correlation Between Prompt Transferability and Prompt Similarity
In § 6, we provide the overall averaged Spearman's rank correlation scores (%) between various similarity metrics and zero-shot transfer performance    of soft prompts for RoBERTa LARGE and T5 XXL .
Here, we further show Spearman's rank correlation scores grouped by the task types on more PLMs. The results are shown in Table 10 and Table 11.

D.3 PLMs' Redundancy Influence Indicators
From Table 10, we find that the correlation between prompt transferability and prompt similarity will drop with the increase of PLM size. We guess that this phenomena may result from PLMs' high redundancy (Aghajanyan et al., 2021).
To try to overcome this, we simultaneously utilize the prompts trained with three random seeds on the same dataset and take their intersection of activation states as the activated neurons into the similarity (ON) computation. This similarity is called ON I . By using it, the correlation score of ON can significantly raise as shown in Table 10.

D.4 Overlapping Rate of Activated Neurons in Different Layers
To further understand model stimulation in PLMs, we investigate ON in different layers of PLMs. Specifically, on RoBERTa BASE , we measure the similarity between different prompts with activation states of from 1 to 3 layers (Figure 8), from 4 to 6 layers (Figure 9), from 7 to 9 layers (Figure 10), from 10 to 12 layers (Figure 11), and all 12 layers (Figure 12), respectively.