SMoP: Towards Efficient and Effective Prompt Tuning with Sparse Mixture-of-Prompts

Prompt tuning has emerged as a successful parameter-efficient alternative to the full fine-tuning of language models. However, prior works on prompt tuning often utilize long soft prompts of up to 100 tokens to improve performance, overlooking the inefficiency associated with extended inputs. In this paper, we propose a novel prompt tuning method SMoP ( S parse M ixture-o f-P rompts) that utilizes short soft prompts for efficient training and inference while maintaining performance gains typically induced from longer soft prompts. To achieve this, SMoP employs a gating mechanism to train multiple short soft prompts specialized in handling different subsets of the data, providing an alternative to relying on a single long soft prompt to cover the entire data. Experimental results demonstrate that SMoP outperforms baseline methods while reducing training and inference costs. We release our code at https://github.com/jyjohnchoi/SMoP .


Introduction
Prompt tuning (Lester et al., 2021;Liu et al., 2021) has recently gained attention as a parameterefficient alternative to the full fine-tuning of language models.By freezing the original language model parameters and solely tuning the soft prompts (i.e., learnable token embeddings) added to the model input, prompt tuning achieves comparable performance to full fine-tuning while largely reducing the number of trainable parameters.Moreover, prompt tuning stands out for its conceptual simplicity and flexibility among other parameterefficient fine-tuning methods (Houlsby et al., 2019;Guo et al., 2021;Hu et al., 2022), as it does not require modifications to the model structure.
Since the proposal of prompt tuning, there has been active research to enhance its efficiency and effectiveness.On one hand, several approaches propose to improve the performance of prompt tuning by integrating soft prompts into activations in each For prompt tuning (Lester et al., 2021), increasing soft prompt length improves accuracy, but also results in a significant increase in memory usage.SMoP outperforms prompt tuning while preserving memory usage by sparsely activating short (length 5) prompts.
layer of the model (Li and Liang, 2021; Qin and Eisner, 2021;Liu et al., 2022), incorporating inputspecific soft prompts (Jiang et al., 2022;Wu et al., 2022), or pruning and rewinding soft prompts (Ma et al., 2022).On the other hand, methods such as FPT (Huang et al., 2022) demonstrate improved training efficiency of prompt tuning in terms of convergence speed via progressive training.Although these methods have empirically shown improvements in prompt tuning, they have overlooked the inefficiency associated with the extension of input sequences caused by the inclusion of soft prompts.While increasing soft prompt length (typically up to 100 tokens) is known to benefit model performance (Lester et al., 2021;Jiang et al., 2022), it consequently yields longer input sequences, leading to increased computational requirements during training and inference (see Figure 1).Therefore, we aim to investigate the utilization of relatively short soft prompts while preserving performance gains typically achieved from longer soft prompts.
To this end, we propose SMoP (Sparse Mixtureof-Prompts), a novel prompt tuning method that utilizes short soft prompts during training and in-

Input Instance
Input Instance

Gating Mechanism
Output Output

Router Model
Figure 2: (a) Illustration of prompt tuning (Lester et al., 2021).A soft prompt is concatenated with the embedding representations of an input instance, and the soft prompt is solely fine-tuned.Given a soft prompt of 100 tokens, the length of the soft prompt is typically longer or similar to the input instance.(b) Illustration of our proposed method SMoP.A gating mechanism is employed to route each input instance to a short soft prompt.
ference.Given that using a single short soft prompt leads to inferior performance compared to longer soft prompts, our key insight is to train multiple short soft prompts that are specialized in handling different subsets of the data.To achieve this, we draw inspiration from the Sparsely-Gated Mixtureof-Experts (Shazeer et al., 2017;Fedus et al., 2022) that sparsely activates sub-networks (i.e., experts) to increase model capacity without a proportional increase in computation.We integrate this concept in the context of prompt tuning by employing a gating mechanism in SMoP, which guides each input instance to one of the short soft prompts based on its embedding representation.Such sparse activation enables effective utilization of short soft prompts without a significant increase in computation or degradation in performance.
To verify the efficiency and effectiveness SMoP introduces to prompt tuning, we conduct evaluations on six natural language understanding tasks from the SuperGLUE benchmark.Experimental results demonstrate that SMoP outperforms prompt tuning with reduced training and inference costs.In particular, SMoP improves the average performance of prompt tuning on six SuperGLUE tasks by 2.5%p with T5-base, and 3.4%p with T5-large on average while reducing training time, memory, and inference computations.
Our contributions are as follows: 1. We propose a novel prompt tuning method SMoP (Sparse Mixture-of-Prompts) that utilizes short soft prompts for efficient training and inference while maintaining performance gains often induced by increased soft prompt length.
2. SMoP sparsely activates short soft prompts via a gating mechanism that routes each instance to one of the multiple soft prompts based on its embedding representation.
3. Experimental results demonstrate that SMoP outperforms the baselines on T5-base and T5-large while utilizing shorter soft prompts, thereby using less training and inference costs.

Preliminaries
Full Fine-tuning Assume that we have a sequence-to-sequence model p ϕ (y | x) parameterized by ϕ.Given an instance with a length n sequence of embedding representations X = {x 1 , x 2 , ..., x n } ∈ R n×e and corresponding label token embedding sequence Y , the objective function for full fine-tuning the model p ϕ is as follows: Prompt Tuning If we define a length l soft prompt with embedding dimension e as P θ which is parameterized by θ ∈ R l×e , the objective function of prompt tuning is as follows: where ; indicates the concatenation of the two matrices.Note that the language model parameters ϕ are no longer updated.Figure 2 (a) depicts the process of prompt tuning.

SMoP: Sparse Mixture-of-Prompts
The goal of SMoP is to train multiple short soft prompts, where each prompt is specialized in a subset of the data.To achieve this, SMoP employs a gating mechanism to direct the input instance to one of the soft prompts based on its embedding representations, as shown in Figure 2 (b).
In the gating mechanism, we introduce a small linear router model L µ parameterized by µ ∈ R e×k which makes decisions regarding which of the soft prompts the input should be routed to.Formally, given k soft prompt embeddings P θ 1 , P θ 2 , ..., P θ k which are parameterized by {θ j } k j=1 where θ j ∈ R l×e , the router model takes the average of input embeddings X ∈ R e as its input and calculates the routing probability p 1 , p 2 , ..., p k for each soft prompt.Thus, the routing probability of the j-th prompt is calculated as: (3) The input is then routed to the soft prompt with the highest probability, and the final soft prompt to be utilized is obtained as the product of the routed prompt and the probability value.Therefore, the objective function of SMoP is defined as follows: where c is the index of the prompt with the highest probability value.Note that in SMoP, while the total prompt length is k • l, the utilized prompt length remains as l.

Experimental Settings
Tasks To cover diverse NLP tasks in our experiments, we evaluate SMoP and baseline methods on six tasks1 from the SuperGLUE benchmark (Wang et al., 2019).As the official test sets for Super-GLUE benchmark are not publicly released, we follow Chen et al. (2022a) to use the validation set as the test set and split the original train set into train and validation sets by 90%/10% proportion.

Models and Baselines
Our experiments are built on the public HuggingFace (Wolf et al., 2019) implementation and pre-trained checkpoints of T5 (Raffel et al., 2020) in two scales: base and large.
To demonstrate the advantages that SMoP introduces to prompt tuning, we compare SMoP to prompt tuning (Lester et al., 2021), P-tuning (Liu et al., 2021), and full fine-tuning.
Evaluation Setup For prompt tuning methods, we experiment on {5, 20, 50, 100} soft prompt tokens, and for SMoP, we sweep through {2, 4, 10, 20} prompts of length {1, 3, 5, 10}.We report experimental results on the setting with the best average performance over two or three runs, as well as the corresponding standard deviations.We report training time 2 and memory usage as training costs and inference FLOPs as inference costs.

Main Results
Table 1 presents the performance of SMoP and the baseline methods.Notably, SMoP achieves the highest performance among the baseline prompt tuning methods on SuperGLUE tasks with the least training and inference costs.On T5-base, SMoP demonstrates an average improvement of 2.5%p, while on T5-large, the improvement reaches 3.4%p.The detailed results of SuperGLUE tasks are shown in Appendix D.
The fact that SMoP outperforms the baselines with less training and inference costs highlights the significance of utilizing short soft prompts during training and inference.For example, SMoP saves 14.6% training time, 22.9% training memory, and 27.2% inference FLOPs in T5-large, compared to prompt tuning with a soft prompt of length 100.It is worth noting that full fine-tuning requires the fewest of FLOPs for inference as no additional tokens are added to the input, while SMoP introduces the least additional FLOPs.

Length and Number of Soft Prompts
To investigate the optimal length and number of soft prompts to employ, we present the experimental results on SMoP with diverse utilized prompt lengths and numbers of prompts in Table 2.
It is observed that increasing the total prompt length over 50 provides marginal performance gains.This finding is aligned with previous research (Lester et al., 2021;Li and Liang, 2021;Ma et al., 2022) that report increasing soft prompt length above a certain threshold brings limited improvements to performance.
Furthermore, we notice that using 20 soft prompts generally lead to a degradation in perfor-2 Measured with a single NVIDIA RTX A6000 GPU.mance.We conjecture that this may be due to the limited labeled data for training in several Super-GLUE tasks, leading to insufficient training of each soft prompt (Wang et al., 2022).
Given these findings, we primarily report the results of SMoP utilizing 4 soft prompts, each with a length of 5 tokens.Note that while SMoP generally demonstrates improvements in prompt tuning, the optimal length and number of soft prompts may vary by specific tasks or datasets.

Routing Methods
To verify the impact of the routing method in the gating mechanism of SMoP, we perform experiments on diverse routing methods, including linear router without router perturbation (w/o perturbation), taking the weighted sum of two prompts with the highest probability (Top-2), Gumbel-Softmax routing where the output probability of the router is calculated as 1 (Gumbel-Softmax), stochastic routing (Stochastic) which is an application of AdaMix to prompt tuning (Zuo et al., 2022;Wang et al., 2022), and no routing (Single) which is identical to prompt tuning with a length 5 soft prompt.
Table 3 shows experimental results on three Su-perGLUE tasks with diverse routing methods.The top-1 linear router with router perturbation, which is our original setting, generally outperforms all other routing strategies.One exception is BoolQ where removing the router perturbation exhibits a slightly better performance.We speculate that in high-resource settings like BoolQ, router perturbation may not be mandatory for sufficient training of each soft prompt.

Prompt Tuning
Pre-trained language models (PLMs) have demonstrated remarkable performance on a wide range of tasks in Natural Language Processing (NLP) (Devlin et al., 2019;Liu et al., 2019).However, with the introduction of larger language models such as T5 (Raffel et al., 2020) and GPT-3 (Brown et al., 2020), fine-tuning the entire parameters of the PLM for each specific task has become notably inefficient in terms of training and deployment.
To address such inefficiency, researchers have proposed parameter-efficient fine-tuning methods (Houlsby et al., 2019;Lester et al., 2021;Pfeiffer et al., 2021;Hu et al., 2022), which involves finetuning a relatively small portion of task-specific parameters of the PLM while keeping the other parameters frozen.Among these methods, prompt tuning (Lester et al., 2021) is a simple and effective approach that entails prepending learnable token embeddings (i.e., soft prompts) to the model input and solely fine-tuning these embeddings.The simplicity and adaptability of prompt tuning have led to several advancements aimed at improving its efficiency and performance by modifying the structure of soft prompts (Liu et al., 2021;Li and Liang, 2021), using instance-specific prompts (Jiang et al., 2022;Wu et al., 2022), or adjusting the training process (Huang et al., 2022;Ma et al., 2022).Moreover, prompt tuning is known for its capability for task knowledge transfer from source task prompts to target task prompts (Vu et al., 2022;Asai et al., 2022;Wang et al., 2023).These methods have improved the overall performance of prompt tuning, but they have overlooked the inefficiency of utilizing lengthy soft prompts.SMoP is designed to alleviate this efficiency concern and is orthogonal to most of the existing prompt tuning methods.

Mixture-of-Experts
Mixture-of-Experts is a model structure in which the output of the model is computed by multiple sub-networks (i.e., experts) conditionally activated by a gating mechanism (Shazeer et al., 2017).This enables increasing the number of model parameters without incurring a proportional increase in computation.Typically, the gating mechanism determines which experts process specific tokens (Shazeer et al., 2017;Fedus et al., 2022), while it can be extended to route sequences or batches (Wang et al., 2022;Zuo et al., 2022;Pan et al., 2023).In particular, Fedus et al. (2022) presents Switch Transformer that employs the Sparsely-Gated Mixture-of-Experts layer (Shazeer et al., 2017), and Zuo et al. (2022) proposes THOR which utilizes stochastic (i.e., random) routing.
Recently, Wang et al. (2022) has proposed AdaMix, a parameter-efficient fine-tuning method that integrates the concept of Mixture-of-Experts to Adapter (Houlsby et al., 2019).It follows THOR (Zuo et al., 2022) to employ stochastic routing and merging of multiple adapter modules.Both SMoP and AdaMix have taken inspiration from the concept of the Mixture-of-Experts structure to improve parameter-efficient fine-tuning.However, their primary motivations are distinct in that the motivation of SMoP is to use multiple short soft prompts for efficient prompt tuning, while the motivation of AdaMix is to provide multiple views of the given task for better performance.Therefore, SMoP employs a linear router for instance-wise prompt selection resulting in multiple soft prompts each specialized in a subset of the task, whereas AdaMix employs stochastic routing and merging, resulting in a single adapter module per task.

Conclusion
We have presented SMoP (Sparse Mixture-of-Prompts), a novel prompt tuning method that utilizes short soft prompts for efficient training and inference while maintaining performance gains associated with increased prompt length.To achieve this, we have employed a gating mechanism in SMoP that routes each instance to one of the multiple short soft prompts.Experimental results have demonstrated that SMoP has outperformed prompt tuning while reducing training and inference costs through the utilization of short soft prompts.

Limitations
Given the same total prompt length, the gating mechanism of SMoP introduces additional parameters compared to prompt tuning, inducing additional storage requirements.Comparing prompt tuning with a soft prompt of length 20 (20,480 trainable parameters) and SMoP with 4 prompts of length 5 (24,576 trainable parameters) on T5-base, SMoP adds 20% trainable parameters and such difference increases as more prompts are utilized.
We further note that SMoP is orthogonal to most of the existing prompt tuning methods including prompt transfer learning methods (Vu et al., 2022;Asai et al., 2022;Wang et al., 2023) as mentioned in Section 4. While our investigation has highlighted the significance of incorporating short soft prompts through sparse activation in conventional singletask prompt tuning, we believe that SMoP holds promise as a valuable direction for augmenting the efficiency of prompt tuning methods in the future.

Appendix A Comparison to Adapter-based Methods
To further explore the advantages of SMoP in the realm of parameter-efficient fine-tuning methods, we compare SMoP and prompt tuning methods to adapter-based parameter-efficient fine-tuning methods, namely Adapter (Houlsby et al., 2019), AdapterFusion (Pfeiffer et al., 2021), and LoRA (Hu et al., 2022).We provide a brief description of each method and present the experimental results on six SuperGLUE tasks with the T5-base model.
Adapter-based methods add additional modules to the internal structure of the model.Adapter (Houlsby et al., 2019) adds bottleneck modules after the multi-head attention and feed-forward layer of each Transformer layer, while AdapterFusion (Pfeiffer et al., 2021) adds bottleneck modules only after the feed-forward layer.LoRA (Hu et al., 2022) adds a low-rank decomposition of each of the attention matrices, which are directly added during inference.We implement these methods upon the adapter-transformers 3 library.
Table 5 presents the experimental results of full fine-tuning, adapter-based methods, and prompt tuning methods on six SuperGLUE tasks with the T5-base model.While adapter-based methods generally outperform prompt tuning methods under their best configuration, SMoP is able to reach comparable performance while only utilizing up to 190× a smaller number of trainable parameters.In particular, when the ratio of trainable parameters narrows to a factor of 33×, SMoP outperforms Adapter on 5 tasks out of 6.Similar results are observed for AdapterFusion, where SMoP shows inferior performance when the bottleneck dimension d is set to 48, but reverses the results when d is reduced to 8.
Considering LoRA, SMoP shows slightly better performance compared to both configurations.One notable result is that using a lower rank in LoRA does not yield a significant decrease in performance.However, as shown in Table 4, the level of parameter efficiency of SMoP is not attainable with LoRA, as LoRA (r=1) still requires 6× more trainable parameters compared to SMoP.These observations highlight the parameter efficiency of SMoP compared to adapter-based approaches.
In general, adapter-based lightweight methods require additional parameters proportional to the 3 https://github.com/adapter-hub/adapter-transformersnumber of layers in the backbone model, as they add an adapter module to the internal structure of the original model.In contrast, prompt tuning methods including SMoP introduce additional parameters exclusively to the inputs of the model, enabling a parameter-efficient module where the number of trainable parameters doesn't increase proportionally to model size (Asai et al., 2022).

B Text-to-Text Templates
We provide the text-to-text templates and verbalizers used in our experiments in Table 6.

C Hyperparameters
We train our model for {50, 100} epochs on CB, COPA, RTE, WiC and for {10, 20} epochs on BoolQ and MultiRC with batch size 32, learning rate of {1e-4, 5e-5, 1e-5} for full fine-tuning and adapter-based methods, and learning rate {0.5, 0.3, 0.1} for prompt tuning methods including SMoP.We perform early stopping based on validation performance, and terminate training if there is no improvement for 10 epochs.We train the model with Adafactor optimizer (Shazeer and Stern, 2018), where the weight decay is 1e-5, and linear learning rate decay of warmup ratio 0.06 is applied.

D Detailed Experimental Results
We provide task-wise results of experiments presented in the paper.Since we experiment with our own train/validation/test split, the results may vary with previous works such as Lester et al. (2021).

D.1 Performance
Table 7 and 8 present the experimental results on six SuperGLUE tasks on T5-base and T5-large.steps) for each SuperGLUE task.For BoolQ and MultiRC in T5-large, we report the results for step batch size of 16 with gradient accumulation, as using batch size 32 exceeds the memory capacity of a single NVIDIA RTX A6000 GPU.

Figure 1 :
Figure 1: Accuracy (left) and training memory usage (right) with varying total prompt length on RTE dataset.For prompt tuning(Lester et al., 2021), increasing soft prompt length improves accuracy, but also results in a significant increase in memory usage.SMoP outperforms prompt tuning while preserving memory usage by sparsely activating short (length 5) prompts.

Table 1 :
Experimental results on six SuperGLUE tasks.Average training costs, inference costs, and performance for baselines and SMoP are presented.The percentage value next to each cost value indicates relative changes in cost values compared to full fine-tuning, and the subscript of the average score indicates the corresponding standard deviation.The highest performance and lowest cost values among prompt tuning methods are bold highlighted.

Table 4 :
Comparison of trainable parameter ratio between SMoP and LoRA.The value in the parenthesis for trainable params % denotes the relative difference, with SMoP as the reference point.

Table 8 :
Table 9 presents the memory used during training (GB), and Table 10 presents the training time (s/100 Experimental results on baseline methods and SMoP on six SuperGLUE tasks with T5-large.Subscripts of each score represent the standard deviation over multiple runs.

Table 9 :
Table 11 presents the inference FLOPs (GFLOPs/sample) for each SuperGLUE task.Peak memory (GB) during training on SuperGLUE tasks.