FPT: Improving Prompt Tuning Efficiency via Progressive Training

Recently, prompt tuning (PT) has gained increasing attention as a parameter-efficient way of tuning pre-trained language models (PLMs). Despite extensively reducing the number of tunable parameters and achieving satisfying performance, PT is training-inefficient due to its slow convergence. To improve PT's training efficiency, we first make some novel observations about the prompt transferability of"partial PLMs", which are defined by compressing a PLM in depth or width. We observe that the soft prompts learned by different partial PLMs of various sizes are similar in the parameter space, implying that these soft prompts could potentially be transferred among partial PLMs. Inspired by these observations, we propose Fast Prompt Tuning (FPT), which starts by conducting PT using a small-scale partial PLM, and then progressively expands its depth and width until the full-model size. After each expansion, we recycle the previously learned soft prompts as initialization for the enlarged partial PLM and then proceed PT. We demonstrate the feasibility of FPT on 5 tasks and show that FPT could save over 30% training computations while achieving comparable performance.


Introduction
The emergence of pre-trained language models (PLMs) has broken the glass ceiling for various NLP tasks (Han et al., 2021).Versatile semantic and syntactic knowledge acquired during pre-training could be leveraged when PLMs are adapted towards a specific downstream task to boost performance.The de facto strategy for such an adaptation is full-parameter fine-tuning, which is computationally expensive and profligate since it Average performance Fine-tuning Prompt tuning requires tuning and storing all the parameters in the PLM for each downstream task.To remedy this, several delta tuning (Ding et al., 2022) (also known as parameter-efficient tuning) algorithms are proposed in place of the vanilla fine-tuning (Houlsby et al., 2019;Li and Liang, 2021;Hu et al., 2022;Ben Zaken et al., 2022), among which prompt tuning (PT) (Lester et al., 2021) has gained increasing attention recently.PT prepends a few virtual tokens to the input text, these tokens are tuned during training while all the other PLM parameters remain frozen.Despite its simple form, PT has been demonstrated to achieve remarkable performance in various NLP tasks.Especially when the scale of the PLM becomes extremely huge, PT could achieve comparable performance to finetuning (Lester et al., 2021).Despite extensively reducing the number of tunable parameters and achieving satisfying performance, PT is criticized to be training-inefficient due to the slow convergence (Su et al., 2022) as illustrated in Figure 1, and such incompetence would limit the practical application of PT.Hence in this paper, we explore how to improve PT's training efficiency.
Our motivation is based on novel observations about the prompt transferability among "partial PLMs".Here a partial PLM is defined by com- pressing a PLM in depth or width, which is implemented by dropping several layers or masking part of the connections in the feed-forward network (FFN) in each Transformer (Vaswani et al., 2017) layer.We observe that the soft prompts of the same task learned by different partial PLMs of various sizes tend to be close in the parameter space, implying that these soft prompts could potentially be transferred among different partial PLMs.Inspired by the above observations, we propose Fast Prompt Tuning (FPT), which starts by conducting PT using a small-scale partial PLM to obtain the corresponding soft prompts.After that, we progressively expand the partial PLM's depth and width until the full-model size by rehabilitating the dropped layers and masked neurons.After each expansion, we recycle the previously learned soft prompts as initialization for the enlarged PLM and then proceed PT.Since the partial PLM requires fewer computations for each step, keeping the total training steps unchanged, we could reduce the overall computations consumed, and in the meantime, achieve comparable PT performance.In experiments, we demonstrate the feasibility of FPT on 5 NLP tasks.The experimental results show that FPT could save around 30% training computations and achieve satisfying downstream performance.
2 Prompt Tuning on a Partial PLM

Prompt Tuning
For a given input sequence X = {x 1 , x 2 , ..., x n } and its target label Y, PT first converts X into a matrix X ∈ R n×d , where d is the hidden size.After that, PT prepends l tunable soft prompt tokens P ∈ R l×d before X, creating a new input matrix [P; X] ∈ R (l+n)×d , which is then processed by the PLM.The training objective is to maximize P(Y|[P; X]), where only P is optimized during training and the parameters of PLM are frozen.Although PT is applied to the entire PLM by default, in this section, we investigate how the performance would become if we conduct PT on a partial PLM, i.e., only part of the parameters in the PLM participate in the computation.

Partial PLM Construction
Using partial parameters in a PLM is typically applied to reduce the inference computation for finetuning, such as early exit (Teerapittayanon et al., 2016;Xin et al., 2020) and model pruning (Chen et al., 2020;Sun et al., 2020;Fan et al., 2020), which assume that the features produced by a part of a PLM may already suffice to classify some input examples.In this paper, we investigate its application in reducing the training computation of PT, and propose to construct partial PLMs by shrinking the original PLM in both depth and width, as illustrated in Figure 2 (a, b).Details are listed in appendix B.
Layer Dropping.Based on previous findings (Clark et al., 2019;Jawahar et al., 2019) that adjacent layers in PLMs generally have similar functionalities, we hypothesize that removing part of these layers may not significantly hurt the overall performance, and we propose to drop a PLM's layers uniformly to construct a partial PLM consisting of fewer layers than the original PLM.After that, we directly build connections among the remaining layers keeping the original order, which is found empirically to work well although such connections do not exist during pre-training.
FFN Reduction.Jaszczur et al. (2021) and Zhang et al. (2022) indicate that only part of the neurons in the FFN layers will be activated for a given input.Such a sparse activation phenomenon inspires us to reduce the computation in FFN by shrinking the width of the FFN layer.Specifically, the FFN layer consists of two fully connected networks with a nonlinear activation function σ, and it processes an input representation x ∈ R d as: we feed a few downstream examples prepended by randomly initialized soft prompts into the fullsize PLM and record the neuron activation of each dimension of d .
Compound Reduction.Since the above methods are compatible with each other, we try to combine them to form a partial PLM smaller than the original PLM in both depth and width.

Observations
To explore PT's performance on a partial PLM, we conduct experiments on T5 LARGE (Raffel et al., 2020).We choose 5 representative NLP datasets in English, covering the tasks of natural language inference (MNLI (Williams et al., 2018)), paraphrase (QQP (link)), reading comprehension (SQUAD2.0(Rajpurkar et al., 2018) and RECORD (Zhang et al., 2018)), and summarization (XSUM (Narayan et al., 2018)).For both layer dropping and FFN reduction, we evaluate the performance when we reduce the number of Transformer layers or FFN intermediate dimension to We train all models using the same steps and the details are described in appendix B.
Overall Performance.The overall results are shown in Table 1.We observe that for each method, despite abandoning a large portion of parameters, a partial PLM reserves most of the PT performance of the full-size PLM.As expected, the performance becomes better when more parameters are retained.In addition, we find that the performance drop is less sensitive to FFN reduction than layer dropping.Specifically, there is only 1.10% performance drop on average when 25% neurons are masked.These results indicate that the resulting partial PLM still retains most of the functionalities of the original PLM.
Prompt Embedding Visualization.Taking a step further, we visualize the learned prompt embeddings of different partial PLMs using t-SNE (van der Maaten and Hinton, 2008) in Figure 3, and describe the details in appendix C. We observe that for the same task, the soft prompts obtained by different partial PLMs tend to form a compact cluster in the parameter space.This phenomenon implies that the soft prompts corresponding to the same task (1) have a great potential of transferring among different partial PLMs, and (2) could serve as a better initialization that leads to faster convergence.Apart from the visualization, we further report the cosine similarity of the learned prompts in appendix D to verify the above phenomenon from another aspect.3 Fast Prompt Tuning In this section, we propose Fast Prompt Tuning (FPT), which aims at accelerating PT via progressive training (Gong et al., 2019).Progressive training is typically leveraged for improving pre-training efficiency (Chen et al., 2022;Qin et al., 2022), instead, we focus on its application in PLM's downstream adaptation.

Methodology
Formally speaking, as visualized in Figure 2 (c), we split the original PT training process into N stages.
We start with a small-size partial PLM M 1 and then progressively rehabilitate its depth and width until the full-size model M N , creating a series of partial PLMs {M i } N−1 i=1 with growing sizes.The architectures of the partial PLMs are constructed using the same method in § 2.2.
During each training stage i, we conduct PT on a partial PLM M i and obtain the learned soft prompts P i .Based on the observation that M i retains a large portion of functionalities of the fullsize PLM M N , we conjecture that M i could serve as a perfect substitute for M N and learn how to deal with the downstream task.In addition, considering that the soft prompts learned by different partial PLMs are close in the parameter space, we could transfer the knowledge learned by M i to M i+1 through recycling P i .Specifically, after each model expansion, we directly use P i as initialization for training M i+1 in the next stage.Since for each partial PLM, fewer parameters participate in both the forward and backward process, the computations could be reduced.Keeping the total number of training steps the same, FPT could accelerate training compared with vanilla PT.

Experiments and Analyses
We follow most of the experimental settings in § 2 and also describe the training details in appendix B. We report FLOPs and training wall clock for the vanilla PT and FPT to compare training efficiency.We evaluate both T5 LARGE and T5 XL (a larger T5 model) on each task and train for 30k and 15k steps, respectively.We test FPT's performance when we progressively expand the model's depth, width, and both of them.Unless otherwise specified, for most of FPT's methods, we split the training process into 4 stages.Each of the first three stages takes 20% steps, while the last stage takes 40% steps.
Results.We list the results in Table 2, from which we observe that (1) on average, all three variants of FPT achieve comparable performance with PT and utilize fewer computations (e.g., FPT CR saves around 30% FLOPs).On several tasks (e.g., MNLI and SQUAD2.0),FPT even exceeds PT's performance; (2) combining both layer dropping and FFN reduction (i.e., FPT CR ) is more trainingefficient.However, we also observe that saving more computations generally leads to poorer performance.Among all three variants of FPT, FPT FR strikes the best balance between performance and training efficiency; (3) moreover, we compare both PT and FPT's performance when PT consumes the same computations as each variant of FPT.As reflected in the column "Improve↑", controlling the training computations the same, our FPT outperforms PT, and the improvement is more significant for T5 XL than T5 LARGE , showing that FPT has a great potential to apply to large-scale PLMs.(4) except for using FLOPs as a theoretical analysis of computation resources, we also compare wall clock training time among different FPT methods and vanilla PT.The wall clock time can be also saved at most 30% with our FPT CR method.Besides, the gap between relative FLOPs and relative wall clock time shrinks with the model's size increasing for each FPT method.
We also verify the effectiveness of our partial model construction designs in appendix E, and show in appendix F that the performance of FPT is not sensitive to the duration of each training stage.We leave the explorations on other tasks and the effect of training budgets as future work.

Conclusion
In this work, towards improving PT's training efficiency, we first make several insightful observations by conducting PT on partial PLMs, and then propose FPT based on the observations.The results on 5 datasets demonstrate the feasibility of FPT in saving the training computations.Being the first attempt towards accelerating PT, we encourage future work to design more sophisticated algorithms to further improve PT's training efficiency.

Limitations
For the current FPT method, there exist two main limitations: (1) FPT requires choosing a proper hyperparameter of the progressive training steps (i.e.duration of each training stage).For each experiment, we have to pre-define the duration of each stage empirically.Although in appendix F, we have shown that within a reasonable range, the duration of each training stage is not that important.
(2) FPT can not be directly applied to other delta tuning methods (e.g., adapter and prefix-tuning).Since prompt tuning only adds trainable parameters in the embedding layer, when partial model's size increases, the trained soft prompt can be directly transferred to a larger partial model without any modification.But for other popular delta tuning methods, when the layer of partial model increases, we have to add newly initialized parameters.

Appendices A Related Work
Prompt Tuning.PLMs have achieved excellent performance on many NLP tasks relying on their powerful natural language understanding and generation capabilities (Devlin et al., 2019;Liu et al., 2019).However, with the emergence of largescale PLMs like T5 (Raffel et al., 2020) and GPT-3 (Brown et al., 2020), tuning all the parameters of a PLM (i.e., full-parameter fine-tuning), which requires huge storage and memory costs, is not flexible for real-world applications on massive downstream tasks.Therefore, parameter-efficient delta tuning methods (Ding et al., 2022;Houlsby et al., 2019;Hu et al., 2022;Ben Zaken et al., 2022;He et al., 2022) attract more and more attention, among which prompt tuning (PT) (Lester et al., 2021) is a simple and effective one.By prepending a few trainable embeddings before the input sequence, PT can achieve comparable performance to fullparameter fine-tuning.With the size of PLM getting larger, the performance of PT gets closer to vanilla fine-tuning (Lester et al., 2021), showing great potential to utilize extremely large PLMs in future.Besides, PT is also shown to have excellent cross-task transferability (Su et al., 2022;Vu et al., 2022), and thus gains more and more attention in exploring the relation among tasks (Qin et al., 2021).However, due to the slow convergence shown in Figure 1, PT's training efficiency becomes a serious drawback and may limit its practical application.
Progressive Training.Considering that pretraining usually requires tremendous computational resources, researchers propose progressive training to improve training efficiency (Gong et al., 2019;Zhang and He, 2020).Progressive training starts training using a shallow model, and gradually grows the depth of the model along the training process by replicating existing layers (parameter recycling).In this way, the pre-training efficiency can be improved a lot.To further improve training efficiency, later works propose to progressively grow PLMs in both depth and width (Gu et al., 2021), and design better initialization methods to inherit the functionality of existing models (Chen et al., 2022).Instead of leveraging progressive training during the process of pre-training, we apply it to PLM's downstream adaptation, with a focus on PT.Furthermore, conventional progressive training du-plicates existing parameters to grow a PLM's size until the full-model's size.Instead, we have already obtained a full-size PLM, and propose to construct partial models with growing sizes by dropping / masking existing parameters.

B Implementation Details
Our implementation is based on PyTorch (Paszke et al., 2019) and transformers (Wolf et al., 2020).The experiments are conducted with 8 NVIDIA 32GB V100 GPUs, and each experiment requires fewer than 10 hours to finish.
Partial PLM Construction.As mentioned in § 2.2, we design three methods to construct partial PLMs.Specifically, for layer dropping, we select layers uniformly.For example, to select 3 layers out of a 24-layer PLM, we will select layer {1, 12, 24} to construct the partial PLM.For FFN reduction, to estimate the activation of each neuron (dimension) in FFN layer l, we first randomly sample 1, 000 examples to form a small dataset D. We prepend each example X (without the label) in D with randomly initialized soft prompts and feed it into the full-size PLM to obtain the input representation x l of FFN layer.After that, we obtain the activation score of each neuron using the following equation , where W l 1 , b l 1 are the parameters in FFN layer l, and |X | denotes the sequence length.The neurons (dimensions) with smaller activation score (i.e., seldom activated) will be masked.Note that the T5 model is composed of both an encoder and a decoder, due to the difference in the input length and output length on various tasks, the computation overload of the encoder and decoder may vary a lot.Therefore, for the tasks (MNLI and QQP) that have a lighter computation overload on the decoder (i.e., small output length), shrinking the decoder model size has little impact on saving the computational costs, hence we retain the whole decoder under this setting; for other three tasks (SQUAD2.0,RECORD and XSUM), the output length on decoder is much longer and we conduct partial PLM construction on both the encoder and decoder.We calculate FLOPs for each experiment using the ptflops tool1 , and report the average FLOPs of 5 tasks in Table 1 and Table 2. Partial PLM Prompt Tuning.We use T5 LARGE for our experiments of partial PLM PT.Following Lester et al. (2021), we leverage the LM-adapted version of T5 checkpoints, which are additionally trained for 100k steps.The adapted version of T5 has been demonstrated to achieve stable and better PT performance.For the implementation of PT, we set the prompt length to 20 and randomly initialize the soft prompts.The optimizer is chosen as Adafactor (Shazeer and Stern, 2018) and the learning rate is set to 0.3.We choose a batch size of 32 and greedy decoding to generate the predictions.The training steps are set to 30k to ensure that PT will not get stuck in a local optimum.We run all the experiments 3 times with different random seeds and report the average results.
Fast Prompt Tuning.For the implementations of FPT, we train T5 LARGE / T5 XL with a total step of 30k / 15k.The number of training steps of T5 XL is chosen smaller than T5 LARGE since we find empirically that T5 XL converges faster than T5 LARGE .As mentioned in § 3.2, unless otherwise specified, we split the whole training process into 4 stages.Each of the first three stages takes 20% of the training steps, while the last stage (full-model PT) takes 40% training stages.Except for layer dropping on T5 XL , we find that a partial PLM, with fewer than 12 layers in either the encoder or decoder, achieves poor PT performance.Therefore, we only use two training stages where the first stage takes 60% training steps and the second stage takes 40% training steps.More detailed settings about the partial model construction are shown in Table 3.The experiments with T5 LARGE are run three times with different random seeds and the average results are reported while experiments with T5 XL are conducted once due to their huge computation con-sumption.

C Prompt Embedding Visualization
In Figure 3, we visualize the soft prompts of different partial PLMs and tasks in Table 1.The embedding used for visualization is derived by averaging soft prompt along the token length dimension.As described in § 2.3, we run each experiment three times with different random seeds to get stable results.Therefore, we plot 30 points (3 runs × (9 partial PLM + 1 full-size PLM)) for each task in Figure 3.And the size of the marker in the figure denotes the performance of the soft prompts on corresponding partial PLMs.Larger size indicates better performance.We can observe that soft prompts with better performance will be easier to form a compact cluster.

D Prompt Embedding Similarity
To further gain insights on the transferability of the soft prompts learned by T5 LARGE 's different partial PLMs defined in Table 3, in addition to the visualization conducted in § 2.3, we calculate the average cosine similarity of the soft prompts corresponding to different tasks in Table 4. Specifically, for different partial PLMs M 1 , M 2 , ..., M N−1 and the fullsize PLM M N , we conduct PT with each model M i on the task T j and obtain the corresponding soft prompts P j i ∈ R l×d .Then we average P j i along the token length dimension and get a vector After that, we calculate S(T P j , T F k ) (average cosine similarity between (1) task j's partial PLMs' prompts and (2) task k's full-size PLM's prompts) using the following metric: (1) From the results in Table 4, we observe that the highest similarity is achieved when j = k, showing that the prompts of the partial PLMs are closer to the same task's prompts of the full-size model.This phenomenon is aligned with the observation in Figure 3, implying that on the same task, the soft prompts learned by partial PLMs could be potentially transferred to the full-size PLM.

E Effect of Partial Model Construction Designs for FPT
We construct a partial PLM by dropping a few layers or masking some neurons.As mentioned in § 2.2, for layer dropping, we retain the layers uniformly; for FFN reduction, we mask the neurons that are less likely to be activated.How to select the retained parameters is essential to the performance of FPT.To demonstrate this, in Table 5, we experiment FPT with another strategy for layer dropping and FFN reduction, respectively.For layer dropping, we compare our strategy of dropping layers uniformly (denoted as Uniform) with dropping the last few layers (denoted as Last).
For both methods, we retain the same number of layers.For example, in order to select 3 layers from a 24-layer PLM, the Uniform strategy will retain the layer {1, 12, 24}, and the Last strategy will retain the layer {1, 2, 3}.From Table 5, we can derive that the Uniform strategy is slightly better than the Last strategy.We hypothesize the reason is that the overall functionalities of a PLM are uniformly distributed among different layers, and adjacent layers tend to have similar functionalities.Therefore, retaining layers uniformly tends to reserve more functionalities than only retaining the first few layers.
For FFN reduction, we compare our strategy of masking neurons based on the activation score (denoted as Activation) with randomly masking neurons (denoted as Random).For the Activation strategy, we feed 1000 samples prepended by randomly initialized soft prompts into the PLM, and then record the activation score of neurons along each dimension.The results in Table 5 show that the Activation strategy significantly outperforms the Random strategy, demonstrating the effectiveness of our method.Randomly masking neurons may abandon those highly activated (most informative) ones, which hinders PT's convergence.We also find empirically the activation score of each neuron in FFN layer may vary a lot across different tasks, which means different neurons may respond differently to the input.This phenomenon also implies that there may exist some "functional partitions" in the FFN layers of PLMs.

F Effect of Duration for Each Training Stage
To show the effects of the duration of each training stage, following Gong et al. (2019), we conduct experiments on MNLI using T5 LARGE with three proposed variants of FPT, and evaluate the effects of training duration for the last two stages.Specifically, for layer dropping of FPT, we conduct PT on the 18-layer partial PLM for 15k steps, and save the learned soft prompts every 3k steps to get 15/3 = 5 sets of soft prompts.Then using each of these 5 soft prompts as the initialization, we conduct PT with the full-size PLM for 3k steps.We report the validation performance and compare FPT with vanilla PT.For FFN reduction and compound reduction of FPT, we conduct similar experiments except that we start from a partial PLM using different construction methods.
The results are shown in Figure 4, from which we can see that expanding the partial PLM's size and then conducting PT (i.e., the red line) performs better than only conducting PT on the partial PLM (i.e., the yellow line).In addition, comparing our FPT (i.e., the red line) with vanilla PT (i.e., the blue line), there is a specific threshold s of training steps, if we expand the partial PLM before s , the training efficiency can be improved compared with vanilla PT; however, after s , expanding the partial PLM and continuing PT on it does not bring consistent improvement over vanilla PT.In general, expanding the partial PLM between 3k steps and 12k steps works well for all three variants of FPT, indicating that within a reasonably broad range, the performance improvement of FPT is not sensitive to the duration of each training stage.We aim to explore how to decide the optimal duration for each training stage in future to make our FPT more practical.

Figure 1 :
Figure 1: Average performance growth of T5 LARGE on 5 investigated tasks in this paper comparing fine-tuning and PT.The convergence speed of PT is much slower than fine-tuning in terms of training steps.

Figure 2 :
Figure 2: The framework of Fast Prompt Tuning (FPT).The top part (a,b) shows two methods to construct a partial PLM.The bottom part (c) shows FPT's training process, we conduct PT on a partial PLM, progressively expand its size and transfer the trained prompts.

Figure 3 :
Figure 3: Visualization of 5 investigated tasks' soft prompts of different partial PLMs.A marker with a larger size means the performance of the corresponding soft prompts on the partial PLM is better.

Figure 4 :
Figure4: The validation performance on MNLI with different training duration for the last two stages.We conduct this ablation study for each of the three variants of FPT.We compare our FPT with different expanding time (red line) with vanilla PT (blue line) and PT without model expansion (yellow line).Each red dot is connected with a yellow dot by a dashed line, indicating it is initialized by the yellow dot and optimized by conducting PT on full-size PLM.

Table 1 :
×d are the weight matrices, b 1 ∈ R d and b 2 ∈ R d are the bias terms.We abandon a portion of W 1 / W 2 's columns / rows (i.e., reducing d ) by masking the neurons that are seldom activated.In practice, before training, Average results for partial PLM PT on T5 LARGE with layer dropping (LD), FFN Reduction (FR), and compound reduction (CR).∆ denotes the performance degeneration compared with vanilla PT of each setting.The "FLOPs" and "Wall Clock" columns are both relative values compared with PT and are averaged over 5 tasks.

Table 2 :
Performance of the vanilla PT and three variants of our method.FPT LD , FPT FR , and FPT CR refer to constructing partial PLMs by layer dropping, FFN reduction, and compound reduction.The column "Improve↑" denotes the performance improvement of each FPT * method over PT when PT uses the same FLOPs as FPT * .

Table 5 :
Average performance on 5 investigated tasks using different strategies of layer dropping and FFN reduction on T5 LARGE .