Parameter Efficient Multi-task Fine-tuning by Learning to Transfer Token-wise Prompts

Prompt tuning has been proven to be successful on various tasks by incorporating a small number of trainable parameters while freezing large pre-trained language models (PLMs). However, it is still unsettled how to generate more proper prompts for any individual examples and how to extend prompt tuning to multi-task learning scenarios by leveraging cross-task features. To address these challenges, we propose a token-wise prompt tuning (TPT), in which a bank of finer-grained soft prompt tokens is built for multi-task learning by memory network. The tokens are retrieved from the bank against an input example and assembled to an instance-dependent prompt. Extensive experimental results on 14 datasets demonstrated that the models enhanced by our TPT performed far better than full parameter fine-tuned models and achieved state-of-the-art by tuning only 0 . 035% parameters. 1


Introduction
The architecture of Transformers (Vaswani et al., 2017) has yielded impressive performances on various natural language processing (NLP) tasks and has been widely established as the building block for PLMs.The dominant paradigm is to pre-train on large-scale unlabeled datasets and then fine-tune on task-related datasets (Devlin et al., 2019;Raffel et al., 2019;Radford and Narasimhan, 2018).However, performing full parameter fine-tuning for each task would be prohibitively expensive with the growing model scale.Thus, there has been growing interest in developing parameter-efficient fine-tuning (PEFT) methods (Houlsby et al., 2019;Hu et al., 2021;Ben-Zaken et al., 2021) that strive to achieve results comparable to full parameter finetuning with a small number of trainable parameters.
Prompt learning (Lester et al., 2021), as a new effective method of PEFT, can boost the model's  An input text of length l, {x 1 , x 2 , • • • , x l }, is sent to a pooling module that produces a feature vector x for the input.A retrieval module retrieves the most similar prompt tokens, {t 3 , t 5 , • • • , t n }, from a prompt bank based on the similarity scores estimated between x and n soft prompt tokens {t 1 , t 2 , • • • , t n } stored in the bank.The retrieved prompt tokens are assembled to an instance-dependent prompt and concatenated with the original input.The concatenated result is sent to a frozen LM for both inference and training, in which only the retrieved prompt tokens are tuned through error back-propagation.
performance on various tasks by simply adding additional context to the input.Although promising, there are at least two limitations: (a) It overlooks the inherent differences among instances.Even a well-learned prompt might not be suitable for all data instances within a large population, as highlighted by Scao and Rush (2021).(b) It fails to leverage the rich cross-task features, as the learned prompts were exclusively designed for individual tasks, making it difficult for these prompts to be reused or transferred across tasks (Vu et al., 2021).
To overcome the first limitation, we propose a novel approach to automatically generate a more suitable prompt for each input example.Bari et al. (2022) proposed to retrieve non-trainable tokens, also referred to as hard prompt, from the embedding layer of the language model, which can enhance both the training and inference processes of the model.Compared to directly copying the fixed embedding layer as a source of extra information for each example, training an additional embedding layer on the target task can provide more appropri-ate information.Therefore, we go further on this line by decomposing the trainable soft prompt into finer-grained soft prompt tokens, these tokens constitute the token-wise prompt bank, which can be viewed as a trainable embedding layer.Memory network (Weston et al., 2014) is also used to store, tune and combine these tokens, as illustrated in Figure 1.In contrast to previous methodologies that consider the soft prompt as a single whole, our approach dissects it into fine-grained prompt tokens.This refined breakdown of the prompt widens the scope of search space, facilitating a more exhaustive combination of soft tokens, and ultimately leading to the generation of superior prompts.
To address the second issue, we extend the process of constructing a token-wise prompt bank (i.e., the memory network of token-wise soft prompts) to multi-task learning scenarios.There are many features that can be shared between different tasks, and these features can be learned through multitask learning (Mahabadi et al., 2021).Additionally, prior work by Vu et al. (2020) demonstrated that performing prompt tuning on intermediate tasks before doing it on the target task can yield even better results.Following their recipe, we first pre-train the token-wise prompt bank across multiple source tasks, then utilize this resulting bank as initialization to train the token-wise prompt bank specifically for the target task.
Finally, we extend our approach, which is called token-wise prompt tuning (TPT), by combining token-wise prompt bank with task-specific prompt tuning, as illustrated in Figure 2. In our approach, all examples within a given task share a task-level prompt, which is generated by way of task-specific prompt tuning.Additionally, for each individual example, an instance-level prompt is retrieved based on the similarity between the input example and tokens in the token-wise prompt bank.These two prompts are concatenated together, incorporating both instance-level and task-level features, as part of the input to facilitate model inference.Moreover, extensive experimental results on 14 datasets demonstrated the effectiveness of our methods.
The contribution of this study can be summarized as follows: • This study is among the first ones to introduce token-wise prompt tuning by decomposing soft prompts into tokens and constructing a bank of trainable tokens by memory network.• We extend the token-wise prompt bank to multi-task learning scenarios, which demonstrates a remarkable boost in transfer learning on both seen and unseen tasks.
• Empirical results on 14 different datasets demonstrate the effectiveness of TPT that outperforms existing prompt-based methods by a significant margin in accuracy and even outperforms the full parameter fine-tuning on both GLUE and SuperGLUE datasets by tuning only 0.035% parameters.

Related Work
Task-dependent prompt.This line of research focuses on enhancing the generation of more effective prompts for specific target tasks.Specifically, Brown et al. (2020) introduced the utilization of a small set of manually crafted sentences as prompt, typically consisting of task descriptions and relevant examples.The prompt is fed to the frozen model as part of the input, offering the potential to enable the pre-trained model to achieve comparable performance to fine-tuned models, particularly when well-designed.
Auto-prompt (Shin et al., 2020), LM-BFF (Gao et al., 2021), and EFL (Wang et al., 2021a) extend this direction by automating the process of generating discrete prompts.However, optimizing prompts within discrete spaces presents challenges and is likely to be sub-optimal.Prompt tuning (Lester et al., 2021), Prefix tuning (Li and Liang, 2021), and P-tuning (Liu et al., 2021) adopt an alternative strategy by introducing continuous vectors, known as soft prompt, in front of the input sequence.Only these continuous vectors need to be adjusted during training, so the optimization problem in discrete spaces is converted to a continuous optimization task, which can be handled through simple gradient descent.
Moreover, later work has started to consider prompt tuning in the transfer learning scenario.Su et al. (2021) and SPoT (Vu et al., 2021) explore the transferability of prompts learned from different tasks, and address the sensitivity of prompt tuning to initialization through transfer learning.Wang et al. (2023) and PANDA (Zhong et al., 2022) proposed to learn a transfer prompt on the source tasks through knowledge distillation.
Instance-dependent prompt.This line of research takes into account the individual characteristics of different examples and generates distinct prompts tailored to each specific example.
In particular, Levine et al. (2022) and IDPG (Wu et al., 2022) generate instance-wise prompts via multi-layer perceptions (MLPs) based on the input encoded by the language model.Li et al. (2022) and Wang et al. (2021b) maintain a prompt pool to store the prompts learned over source tasks, where each prompt is classified into specific categories and assigned key vectors.The encoded input serves as the query vector, and the target prompts are obtained by weighing the prompts in the prompt pool according to the results of the query and key vector calculations.In addition, ATTEMPT (Asai et al., 2022) calculates weights simply based on the similarity between the input and the prompts learned from the source tasks, without the need for pre-computed clusters.
Unlike these approaches that weighted the soft prompt as a whole, we instead utilize the finergrained prompt tokens for combination, and only the tokens retrieved according to the input receive the gradient during training.Therefore, a more appropriate prompt can be generated for each example.SPT (Bari et al., 2022) proposed to use retrieved non-trainable hard prompt as a prefix to guide the training of the prompt.In contrast, our method is to retrieve the trainable soft prompt and can be extended to scenarios of multi-task learning to incorporate cross-task features.

Preliminaries
Prompt Tuning.Given a pre-trained LM with parameters θ, and a target task T target with training data D = {X i , y i } N i=1 , conventional full parameter fine-tuning (FT) seeks to maximize the likelihood of decoding the desired output y i given input X i over training data D: Unlike FT, prompt tuning freezes the pre-trained language model and only needs to train a very small number of parameters.Specifically, it prepends m randomly initialized vectors, also known as soft prompt P = {p 1 , p 2 , • • • , p m }, where p i ∈ R d , before the input X i , the optimization goal of prompt tuning as follows: where θ is frozen, and only P is trainable.
Prompt Transfer.Transfer learning methods attempt to learn a new target task given a collection of source tasks T source = {T 1 , T 2 , • • • , T t }, which have been a long-standing way to improve the effectiveness and efficiency of NLP systems (Ruder, 2017).Recent studies such as (Vu et al., 2021;Su et al., 2021) have demonstrated the applicability of transfer learning in the context of prompt tuning, also referred to as prompt transfer.Instead of training the prompt from scratch over target task, these approaches employ the source prompts

Method
Our proposed method TPT (As illustrated in Figure 2) consists of two stages: pre-training token-wise prompt bank (Section 4.1) and jointly prompt tuning (Section 4.2).
TPT pre-trains a token-wise prompt bank that integrates the cross-task features on various source tasks T source = {T 1 , T 2 , • • • , T t } and then utilizes this resulting bank as the initial token-wise prompt bank of the next stage to generate instance-level retrieved prompt for each example.In addition, all examples of the target task T target also share the same task-level soft prompt.These two kinds of prompts are concatenated with the input as the final input of the frozen LM and provide both additional instance-level and task-level features for that input, which enhance the model's training and inference processes, leading to improved performance.

Pre-training Token-wise Prompt Bank
We first pre-train a token-wise prompt bank over t high-resource tasks T source via the memory network.The examples from multiple datasets are mixed together, enabling the implicit integration of cross-task features within the learning process of the token-wise prompt bank, thereby endowing it with a powerful capacity for knowledge transfer.
Formally, given the training data where l is the input length, d is the dimension of hidden state.The input X i and a randomly initialized token-wise prompt bank B are simultaneously sent to the re- The overall process of TPT.The first step is to pre-train a token-wise prompt bank that absorbs cross-task features on multiple source tasks.The second step utilizes this prompt bank as initialization, transfers the knowledge of the source tasks to the target task and generates a retrieved prompt for each example, and jointly trains with the soft prompt on the target task.Instance-level retrieved prompt and task-level soft prompt provide richer contextual information for input to help the model train and infer better.
trieval module for calculation and an instance-level retrieved prompt R i is generated for X i according to the similarity results.The retrieved prompt R i is prepended in front of the input sequence X i to form [R i ;X i ], which together serve as the final input of the frozen LM.The training objective is to maximize the likelihood of conditional generating over

Soft Prompt Decomposition
Unlike the previous prompt-based method, which weights or combines prompts as a whole (Asai et al., 2022;Li et al., 2022;Wang et al., 2021b), TPT disassembles prompts into finer-grained prompt tokens to complete these operations, enabling a more comprehensive amalgamation of soft tokens, and thereby expanding the range of possible combinations.
For this reason, what is stored in the bank is not the soft prompts trained on the source tasks, but smaller units of prompts, that are, n soft prompt tokens, The n soft prompt tokens in the token-wise prompt bank will calculate the similarity scores with the input, and the k tokens with the highest scores will be retrieved and concatenated into the instancelevel retrieved prompt indicates that the k-th token of the retrieved prompt R i generated for the i-th example X i corresponds to the index of the token stored in the token-wise prompt bank.
During the training process, only the k tokens } retrieved from the token-wise prompt bank will attain the gradient for adjustment, and the other tokens that have not been retrieved remain untouched.

Similarity Score Estimation
The retrieval module controls which tokens to select from the token-wise prompt bank each time an instance-level retrieved prompt is generated by calculating the similarity between the example and prompt tokens in the token-wise prompt bank.
Specifically, the retrieval module will generate the similarity scores S i = {s i 1 , s i 2 , • • • , s i n } between input x i and the n tokens in the bank, where s i j denotes the similarity score between X i and j-th token in the bank.The tokens located at the k indexes {r i 1 , r i 2 , • • • , r i k } of that token-wise prompt bank, which possesses the highest similarity scores with the input are retrieved and concatenated into the retrieved prompt of that example, as follows: where T opK() function returns the largest k values of the given input, and Index() function returns the subscript of the given input value, which is the index in the bank.
To deal with inputs of various lengths, we apply max-pooling over input X i to generate the pooled input xi ∈ R d with the same length of soft prompt tokens t j in the token-wise prompt bank and the generated xi will be fed into the retrieval module to calculate similarity scores between xi and t j based on their inner product, which is: where ⟨, ⟩ denotes the operation of inner product.

Jointly Prompt Tuning
When jointly prompt tuning, the token-wise prompt bank trained in the first step is utilized as initialization to generate the instance-level retrieved prompt according to the method described in section 4.1.
Transfer learning is performed based on it, and the knowledge of the source tasks is transferred to the target task.
In addition, for all examples of the target task, we initialize a task-level soft prompt P = {p 1 , p 2 , • • • , p m }, where P ∈ R m×d , which is shared by them.Instance-level retrieved prompt R i and task-level soft prompt P are concatenated in front of the input to form [R i ; P; X i ], and fed into frozen LM together as the contextual information.
During the training process, retrieved prompt R i and soft prompt P adjust simultaneously, and the optimization objective is transformed to maximize the likelihood of decoding the desired output y i given input X i over training data D = {X i , y i } N i=1 , as follows: In contrast to vanilla prompt tuning, which only provides a shared task-level soft prompt P for all examples of the target task, TPT provides an additional instance-level retrieved prompt R i specific to each example X i of the target task as a complementary.This instance-level prompt R i captures the particular information related to the input X i , while the task-level prompt P encompasses the overall information from the training data D to which X i belongs.By incorporating these distinct levels of features, a more comprehensive contextual framework is established for X i , which can facilitate enhanced model training and inference capabilities.

Experiments
Following the previous prompt-based methods (Lester et al., 2021;Asai et al., 2022), we perform our experiments on 14 different datasets with fulldataset and few-shot settings, and the experimental results show the effectiveness of our TPT in various scenarios.

Datasets and Tasks
The TPT method is divided into the first stage of pre-training on source tasks, and the second stage of task adaptation on the target task.Specifically, we utilize 6 high-resource datasets as source tasks and select 14 tasks from GLUE (Wang et al., 2018), SuperGLUE (Wang et al., 2019) and Sci-Tail (Khot et al., 2018) as target tasks for evaluation.

Models
Following the standard approach in previous prompt-based method (Lester et al., 2021;Asai et al., 2022;Wang et al., 2023), we mainly experiment using the publicly available pre-trained T5-base model with 220M parameters.In our ablation study, we also consider T5-Small (60M) and T5-Large (770M) models.

Baselines
We compare TPT with the following baselines: (1) full parameter fine-tining (FT), where all the model parameters are tuned during adaptation on each downstream task, while other methods solely focus on adjusting the specific components mentioned below.Specifically, (2) prompt tuning (PT) (Lester et al., 2021), where target prompt vectors are initialized by randomly sampled top vocabularies; (3) SPoT (Vu et al., 2021) and ATTEMPT (Asai et al., 2022) initialize target prompts by retrieving or aggregating prompts trained over source tasks; (4) Adapter (Houlsby et al., 2019) and AdapterDrop (Rücklé et al., 2020)  ( Wang et al., 2023) learns target prompt through knowledge distillation.

Implementation Details
To pre-train the token-wise prompt bank, we conduct a 5-epoch training phase on a mixture of 6 high-resource source tasks.For jointly prompt tuning, we reuse the trained token-wise prompt bank to generate the instance-level retrieved prompt and use the prompt trained on the target task or intermediate task to initialize the soft prompt.Unless specified, we use T5-base as our base LMs for TPT and more details in Appendix A.2.If a dataset does not have public test split with annotations, we use a development set as our test set or split the development set into our development and test sets, following (Davison, 2021).
In few-shot experiments, for each number of shots k, following (Mahabadi et al., 2021;Asai et al., 2022), we randomly sample 3 times from the training set with different random seeds and report the mean performances.In addition, the task-level soft-prompt is initialized randomly or following (Vu et al., 2021) to use the prompt trained on the MNLI dataset as initialization.

Results
We present main results, which are the full-data adaption in Section 5.5.1, few-shot adaption in Section 5.5.2, and parameter efficiency in 5.5.3.

Full-data adaption
Table 1 presents the per-task performance of different methods on all datasets.
As shown in Table 1, TPT has established stateof-the-art (SOTA) performances on the datasets of GLUE and SuperGLUE compared with these baselines.Specifically, we achieved SOTA on highresource datasets: MNLI (93.2%),SST-2 (94.7%), and on low-resource datasets: RTE (82.3%),WSC (67.3%),CB (94.6%), which demonstrate the effectiveness of TPT across various data resource scenarios.Compared to vanilla prompt tuning, TPT obtains a relative improvement of 13.4% on GLUE and 19.4% on SuperGLUE, surpassing the performance of vanilla prompt tuning across all datasets by a large margin.This result further shows that the instance-level retrieved prompt composed of prompt tokens is complementary to the task-level soft prompt generated by prompt tuning, and the supplementary effect is universal.Moreover, TPT outperforms full parameter fine-tuning (FT) on GLUE by 0.7% and SuperGLUE by 4.7%, despite tuning 0.245% as many task-specific parameters.

Few-shot adaption
Following (Mahabadi et al., 2021;Asai et al., 2022;Wang et al., 2023), we conduct few-shot experiments on BoolQ, CB and SciTail, to further verify the effectiveness of TPT under the resourceconstrained setup.Table 2 shows the results of our approach and other baselines, which includes full parameter fine-tuning, Adapter, prompt tuning, SPoT, HyperFormer, ATTEMPT, and MPT.To be specific, TPT outperforms other methods in certain cases, achieving both SOTA on BoolQ (4, 16-shot) and CB (16, 32-shot).These results clearly indicate that TPT can effectively use cross-task features in source tasks target tasks in few-shot domain adaptation.

Parameter efficiency
Figure 3 compares the performance of different models versus their number of updated parameters on GLUE and SuperGLUE.In addition, TPT-f is a variant of TPT, that is, the parameters of the bank are frozen during the joint training process, and only the parameters corresponding to the task-level soft prompt need to be adjusted.Specifically, TPT outperforms all other baselines on both GLUE and SuperGLUE with only a small number of parameter adjustments, especially over full-parameter fine-tuning.TPT-f still maintains very high accuracy (y-axis) when adjusting a smaller number of parameters per task (xaxis).TPT-f adjusts as many parameters as vanilla prompt tuning, but the performance on GLUE and SuperGLUE is more than prompt tuning by a large margin, which proves that the TPT and TPT-f have a high degree of parameter effectiveness.

Ablation Study
Model Scaling.We empirically analyze how increasing the backbone LM size affects TPT performance.Figure 4 shows the performance of TPT as well as full parameter fine-tuning, Adapter, AT-TEMPT, prompt tuning and MPT with three different T5 models (T5-small, T5-base, T5-large).These results show that TPT largely benefits from backbone LM size increase, which is aligned with the finding of (Lester et al., 2021).Furthermore, TPT demonstrates effectiveness across a wide range of model scales, spanning from 60M to 770M parameters.As the model size increases, TPT exhibits increasingly pronounced advantages.Particularly, when the model size is large, TPT surpasses other baselines on all three datasets.In addition, we also conducted experiments on a larger scale model (T5-3B), and table 3 shows that our TPT method still has strong competitiveness in the era of large models. 2ethod GLUE PT 79.7 TPT 88.1 Table 3: Experimental results on large language model (T5-3B).
Effectiveness of token-wise prompt bank.We also conduct experiments to assess the effectiveness of solely utilizing the retrieved prompt, abbreviated as RP, generated from the token-wise prompt bank.Instead of concatenating the instance-level retrieved prompt with the task-level soft prompt and jointly performing prompt tuning, we exclusively prepend the instance-level retrieved prompt, which is composed of tokens retrieved from the tokenwise prompt bank, to the input during task adaptation.To be specific, RP-S signifies training the prompt bank for the target task from scratch, while RP-M involves pre-training a token-wise prompt bank on multiple source tasks and subsequently employing it as initialization to train the token-wise prompt bank for the target task.In addition, the training method of PR-W is similar to that of RP-S.But unlike RP-S, which decomposes soft prompts into finer grained prompt tokens and then retrieves and adjusts these tokens in the token-wise prompt bank, RP-W treats soft prompts as a whole and then retrieves and adjusts these prompt in the prompt bank.
The results presented in Table 4 reveal that when using only the retrieved prompt, RP-S outperforms vanilla PT and RP-W by a large margin and RP-M surpasses ATTEMPT by a large margin, which validates that the method of dismantling soft prompts into finer-grained prompt tokens and then combining them can generate a more suitable prompt for each example and also demonstrates the effectiveness of our token-wise prompt bank.Moreover, performs better than RP-S, which indicates that multi-task learning on source tasks can facilitate a beneficial transfer effect on both seen and unseen target tasks.Table 4: The effectiveness of token-wise prompt bank."PR" indicates that only the instance-level prompt retrieved from the token-wise prompt bank is used."-S" means to train the bank from scratch on the target task, "-M" means to perform multi-task learning on multiple source tasks, and then perform transfer learning on the target task to train the bank, and "-W" means to treat the soft prompt as a whole.

Method
Combination Methods.We also explore the impact of the two different methods of combining instance-level prompts and task-level prompts on performance: (1) Following the approach of AT-TEMPT (Asai et al., 2022), the values of corresponding positions in the two prompts are directly added.
(2) Prepending the instance-level prompt in front of the task-level prompt as (Bari et al., 2022).The results of table 5 show that the second method yields superior performance.This finding suggests that processing the task-level features and instance-level features separately, rather than directly adding them to an agreement vector, leads to better outcomes.Prompt Initialization.We explore the impact of soft prompt initialization in the context of the joint prompt tuning process.Our investigation focuses on three distinct initialization strategies: (1) Random Initialization: This approach involves replicating embeddings from the most frequent tokens in the vocabulary.

Method
(2) SPoT Initialization: Following the methodology of SPoT, we employ the prompt trained on the MultiNLI dataset as the initialization for the sentence-level classification target task.(3) Target Task Initialization: We utilize the prompt trained specifically for the target task as the initialization.By examining these different strategies, we aim to understand the effects of soft prompt initialization on the overall performance of the joint prompt tuning process.The results presented in table 6 demonstrate that initializing the task-level soft prompt in three different ways for joint prompt tuning is much better than vanilla prompt tuning, which shows that the soft prompts initialized by these different methods have significantly improved performance on all datasets after being prefixed with the instance-level prompt we proposed for jointly prompt tuning, thus verifying that our proposed instance-level retrieved prompt is complementary to all these different tasklevel soft prompts.
Furthermore, employing the prompt trained on the target task as the initialization yields the most favorable outcomes, while the randomly initialized prompt exhibits relatively poor results.This observation also indicates the more task-related soft prompt can play a greater role during the joint prompt tuning process.

Conclusions
In this study, we have introduced TPT, a novel parameter-efficient fine-tuning method designed to address the challenges of generating more suitable prompts for individual examples and extending prompt tuning to multi-task learning scenarios for capturing cross-task features.TPT harnesses the power of a memory network to construct a finer-

Limitations
We have demonstrated the potential of integrating instance-dependent prompts, derived from tokenwise prompts, with task-specific prompts to enhance performance.It would be intriguing to examine the feasibility of generating task-specific prompts on-the-fly, leveraging the assembly and retrieval of their token-wise, fine-grained prompts.Additionally, our future research will focus on the creation of a generalized token-wise soft prompts model, which can be applicable across a wide spectrum of NLP tasks, rather than being restricted to a select few.The following provides more detailed information for the experimental section of this paper.
Table 7 provides a comprehensive breakdown of the outcomes obtained from GLUE and Super-GLUE evaluations, specifically focusing on the task-level soft prompt initialization.This approach involves combining the task-specific soft prompt with the instance-dependent retrieved prompt in order to optimize prompt tuning.The table compares the results for three distinct methods employed in this initialization process.
Table 8 presents a comprehensive analysis of the outcomes obtained from GLUE and Super-GLUE assessments when exclusively relying on the instance-level retrieved prompt as the supplementary context.The aim of this investigation is to evaluate the efficacy of the token-wise prompt bank in generating appropriate prompts for each input example.
Table 9 provides a detailed presentation of the results obtained from GLUE and SuperGLUE eval-uations, specifically focusing on two distinct combination methods employed to combine the instancedependent prompt and target-specific prompt.The table highlights the outcomes achieved by utilizing these combination approaches.
Finally, Table 10 presents a comprehensive analysis of TPT-f, a variant of TPT, in the context of GLUE and SuperGLUE evaluations.TPT-f effectively reduces the number of adjustable parameters in comparison to TPT, while demonstrating comparable performance in terms of achieved results.

A.2 Training details
Hyperparameters.As used in TPT, we use the prompt length of m = 100 for each prompt and use the learning rate of 0.3 for prompt tuning to train the task-specific prompt and set weight decay to be 1 × 10 −5 .In addition, we also utilize learning rate of 0.3 for pre-training token-wise prompt bank and jointly prompt tuning and optimize the objective function using Adam (Kingma and Ba, 2014).In particular, we use the learning rate of 0.1 for SuperGLUE, and Yelp, WinoGrande, SciTail and PAWS multi-task experiments, and 0.3 for the other experiments.At the same time, we also try different schedulers and when we train the task-level soft prompt, we choose the constant learning rate of 0.3 and for the other experiments, we also try the linear scheduler.
Few-shot Adaptation Experiments Details.Following (Mahabadi et al., 2021), we run few-shot adaptation experiments three times and take the mean of the performance.We cite the performance of the full parameter fine-tuning (FT), Adapter (AD), HyperFormer (HF) from (Mahabadi et al., 2021), prompt tuning (PT), SPoT (ST), ATTEMPT (Asai et al., 2022) (ATP), and MPT (Wang et al., 2023) and random initialize the task-level soft prompt or utilize the prompt trained on MNLI.
Per-device batch size for TPT and prompt tuning.For T5 small and base, we set per-GPU batch size to be 100 and 32, while for T5-large, we use the batch size of 16.

Figure 1 :
Figure1: Token-wise prompting solution.An input text of length l, {x 1 , x 2 , • • • , x l }, is sent to a pooling module that produces a feature vector x for the input.A retrieval module retrieves the most similar prompt tokens, {t 3 , t 5 , • • • , t n }, from a prompt bank based on the similarity scores estimated between x and n soft prompt tokens {t 1 , t 2 , • • • , t n } stored in the bank.The retrieved prompt tokens are assembled to an instance-dependent prompt and concatenated with the original input.The concatenated result is sent to a frozen LM for both inference and training, in which only the retrieved prompt tokens are tuned through error back-propagation.
which are trained according to equation (2) over source tasks.These source prompts P source can then serve as either initialization vectors or weighted vectors for training the target prompt.

Table 1 :
(Ben-Zaken et al., 2021) in the middle of the model; (5) BitFit(Ben-Zaken et al., 2021)only needs to adjust the bias term.(6) MPT Results on GLUE and SuperGLUE.All of the results are based on T5-base models.The middle of the table shows the results of the prompt-based method, the top of the table shows the results of other PEFT methods, and the bottom of the table is the result of our proposed TPT.For these experiments, we exclude SQuAD and ReCoRD from source prompts inventories for comparison with prior work.We use Pearson Correlation for STS-B, F1 for MultiRC (Multi), and accuracy for other tasks as metrics."param/task" denotes the number of parameters trained for each task in GLUE.

Table 5 :
The impact of different combinations of instance-dependent prompts and task-specific prompts.

Table 6 :
The impact of different initialization methods.The results are indicated by "Intermediate task", where the prompts are initialized with those trained on the intermediate task, and by "Target taks" where the prompts are initialized with those tuned on the target task.

Table 7 :
Experimental results of different initialization methods of task-level soft prompt to perform TPT on GLUE and SuperGLUE.

Table 8 :
Experimental results of a single instance-level retrieved prompt on GLUE and SuperGLUE.

Table 9 :
Experimental results of TPT on GLUE and SuperGLUE through different prompt combination methods.

Table 10 :
Experimental results of TPT variant TPT-f on GLUE and SuperGLUE.