UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning

Recent parameter-efficient language model tuning (PELT) methods manage to match the performance of fine-tuning with much fewer trainable parameters and perform especially well when training data is limited. However, different PELT methods may perform rather differently on the same task, making it nontrivial to select the most appropriate method for a specific task, especially considering the fast-growing number of new PELT methods and tasks. In light of model diversity and the difficulty of model selection, we propose a unified framework, UniPELT, which incorporates different PELT methods as submodules and learns to activate the ones that best suit the current data or task setup via gating mechanism. On the GLUE benchmark, UniPELT consistently achieves 1 4% gains compared to the best individual PELT method that it incorporates and even outperforms fine-tuning under different setups. Moreover, UniPELT generally surpasses the upper bound that takes the best performance of all its submodules used individually on each task, indicating that a mixture of multiple PELT methods may be inherently more effective than single methods.


Introduction
As pre-trained language models (PLMs) (Devlin et al., 2019;Brown et al., 2020) grow larger and larger, it becomes increasingly infeasible to perform conventional fine-tuning, where separate replicas of the model parameters are modified per single task. To solve the issue, there has recently been a surge of studies on parameter-efficient language model tuning (PELT), namely how to effectively tune the PLMs with fewer trainable parameters. One line of work proposes to only tune a small subset of the parameters such as the top layers  or the bias terms (Ben Zaken et al., 2021). Other studies take a step further by freezing the entire PLM and adding a small number of additional trainable parameters (Houlsby et al., 2019;Li and Liang, 2021;Lester et al., 2021;Guo et al., 2021;Hu et al., 2021).
Existing PELT research generally aims at achieving performance comparable to conventional finetuning with as few trainable parameters as possible, which has seen significant progress -the task-specific trainable parameters used in most recent approaches (Lester et al., 2021;Guo et al., 2021) are almost negligible compared to the total parameters of the PLM (<1%). A more challenging yet barely studied problem is whether one can achieve better performance than fine-tuning with fewer parameters. Recent studies (He et al., 2021;Li and Liang, 2021; find that some PELT methods could be more effective than fine-tuning when the training data is limited, possibly due to the reduced risk of overfitting. However, as found in our analytical experiments, various PELT methods may exhibit diverse characteristics and perform rather differently on the same task, which makes it nontrivial to select the most appropriate method for a specific task, especially considering the fast-growing number of new PELT methods as well as downstream tasks. In light of the diverse performance of PELT methods and the cost of selecting the best method, we propose a unified PELT framework, named UNIPELT, which incorporates different PELT methods as submodules and learns to dynamically activate the submodules that best suit the current data or task setup. As a result, model selection is no longer needed and consistently better performance is achieved under different setups. The activation of each submodule in UNIPELT is controlled by gating mechanism, which learns to favor (assign more weight to) the submodules that perform well on a given task. In addition, since the number of parameters introduced by each submodule is generally small, combining multiple methods leads to negligible losses in parameter efficiency -the trainable parameters in UNIPELT is still <1%.
We select two PELT methods as the representatives for our experiments -adapter-tuning (Houlsby et al., 2019) andprefix-tuning (Li andLiang, 2021), as they (and their extensions) largely represent the most popular PELT methods to date. 2 At a high level, adapter-tuning increases model depth by inserting bottleneck layers into each Transformer layer of the PLM, while prefix-tuning increases model width by prepending continuous vectors (virtual tokens) to the input of each Transformer layer before multi-head attention. In both methods, the original parameters of the PLM are frozen and only the newly added parameters are updated.
We conduct extensive experiments on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019). Experiment results show that UNIPELT is more effective and robust than using each method alone in various scenarios. Specifically, UNIPELT consistently improves the best submodule that it incorporates by 1 to 3 points and even outperforms fine-tuning, achieving the best averaged performance on the GLUE benchmark under different setups. More remarkably, UNIPELT often surpasses the upper bound when taking the best performance of all its submodules used individually on each task, which indicates that UNIPELT successfully learns to leverage different submodules under different setups and maintains (near) optimal performance. The fact that UNIPELT outperforms the upper bound also suggests that a mixture of PELT methods may be inherently more effective than single methods.

Contributions.
(1) We conduct analytical experiments on two representative PELT methods under the same testbed and present valuable findings. (2) We propose a unified PELT framework that can incorporate multiple PELT methods as submodules and automatically learn to activate the most appropriate submodule for a given task without model selection.
(3) Our proposed framework achieves better performance than fine-tuning and the PELT methods that it incorporates on the GLUE bench-2 We plan to incorporate more methods in the next version. mark under different setups, with negligible losses in parameter efficiency.

PELT methods w/o Additional Parameters
PLMs are often used as feature extractors where only the top layers or prediction head are fine-tuned . However, such fine-tuning approaches generally lead to degenerate model performance that is much worse than fine-tuning all parameters Pfeiffer et al., 2021). A recent method BitFit (Ben Zaken et al., 2021), which only fine-tunes the bias terms of the model, achieves performance comparable to fine-tuning when the training data is limited. In the extreme form, in-context prompting used by models such as GPT-3 (Brown et al., 2020) does not involve any parameter tuning but merely few-shot demonstrations provided as the model input.

PELT methods w/ Additional Parameters
Alternatively, some methods fix the entire PLM and introduce a small number of new trainable parameters. Notable examples in this category include adapter-tuning (Houlsby et al., 2019) and its extensions (Pfeiffer et al., 2021;, prefix-tuning (Li and Liang, 2021) and its extensions (Lester et al., 2021), and additive methods (Zhang et al., 2020;Guo et al., 2021;Hu et al., 2021). Next, we will introduce these methods (mostly the primary version) in more detail to facilitate the introduction of our proposed framework. An illustration is shown in Fig. 1 for better understanding. Adapter-tuning. Adapter-tuning (Houlsby et al., 2019) is a lightweight alternative to fine-tuning, which adds a trainable bottleneck layer after the feedforward network in each Transformer layer of the PLM. A bottleneck layer consists of a down+up projection pair that shrinks and recovers the size of token hidden states. Mathematically, if we denote the output of the feedforward network (after residual connection & layer normalization) as h F N with hidden size D hidden and the bottleneck size D mid , then the output of the bottleneck layer (h A ) is: where W down ∈ D hidden × D mid , W up ∈ D mid × D hidden , φ is a nonlinear activation function, and the bias terms are omitted for brevity. The parameters in layer normalization and the final prediction head sometimes are also fine-tuned depending on the specific adapter variants.
Adapter-tuning has shown to be on par with fine-tuning and sometimes exhibits better effectiveness in the low-resource setting (He et al., 2021). Later studies extend adapter-tuning to multi-lingual (Pfeiffer et al., 2021) and multi-task (Karimi Mahabadi et al., 2021) settings, or further reduce the trainable parameters , which can be easily incorporated into UNIPELT as a replacement of the vanilla adapter-tuning. Prefix-tuning. Prefix-tuning (Li and Liang, 2021) prepends a number of task-specific trainable vectors to the input of multi-head attention in each Transformer layer as if they were virtual tokens, which allows the original tokens to attend to during multi-head attention. Specifically, we denote the prefix length as L and the hidden state of the ith token in Transformer layer k before multi-head attention as h k i . Then, for each h k i with i ≤ L, there is a corresponding trainable vector E k i from an embedding matrix E. The rest of h k i (i > L) are the hidden states of the original tokens (in the actual natural language input), which depend on the output of the previous Transformer layer k − 1: To allow for more expressiveness, the embedding matrix E is reparameterized by a two-layer feedforward network: where E ∈ D hidden × L, W down ∈ D hidden × D mid , W up ∈ D mid × D hidden * N layer * 2, and N layer denotes the number of Transformer layers. The parameters of this network can be discarded after training is complete, and only N layer * L * 2 prefix vectors with size D hidden are left to be prepended to the key and value states of multi-head attention in each of N layer Transformer layers. Prefix-tuning is originally used for natural language generation and we adapt it to understanding tasks. Note that prefix-tuning is different from prompt-based fine-tuning methods (Schick and Schütze, 2021;Gao et al., 2021) in multiple ways: (1) Prompt-based fine-tuning is not parameter-efficient as it updates all model parameters while prefix-tuning only updates the prefix embedding matrix E.
(2) The prompts are only used Figure 1: Illustration of UNIPELT inside one Transformer layer. Each submodule of UNIPELT is controlled by a gating function. The trainable parameters are in green. Q, K, V, and P denote Query, Key, Value, and Prefix, respectively. in model input for prompt-based fine-tuning, but added to every Transformer layer in prefix-tuning (stored as different vectors). (3) Prompt-based finetuning typically leverages carefully designed natural language prompts while prefix-tuning uses continuous prompts (virtual tokens). A followup method of prefix-tuning, named prompt-tuning (Lester et al., 2021), further reduces task-specific parameters by limiting the prefix to the first layer but only performs competitively with very large model sizes (billions of total parameters), and is thus not considered in our study.

Multi-Head Attention
Additive Methods. Additive PELT methods treat the model parameters after fine-tuning as an addition of the pre-trained parameters θ pre-trained and task-specific differences δ task , where θ pre-trained is fixed and a new (sub)set of model parameters are added on top (θ task = θ pre-trained + δ task ). There are various ways to parameterize the task-specific differences δ task , leading to different additive methods such as LoRA (Hu et al., 2021), diff pruning (Guo et al., 2021), and side-tuning (Zhang et al., 2020). We plan to incorporate additive methods into UNIPELT in the next version.

Task Formulation
Given a large PLM M with size |M| that cannot be fine-tuned directly due to computational or storage cost, suppose that we have a list of PELT methods {m i }, the trainable parameters of which are negligible (i.e., i |m i | |M|), our goal is to design a unified PELT framework that incorporates {m i } as submodules and learns to dynamically activate (upweight) different submodules when appropriate under different scenarios, such that one could achieve satisfactory results in terms of both model effectiveness and robustness without the hassle of trying out each method individually.

Proposed Method
In our analytical experiments, we observe that different PELT methods exhibit diverse characteristics and perform rather differently on the same task. For example, prefix-tuning generally performs well on natural language inference tasks regardless of the size of training data. Also, as can be seen in Fig. 1 and Sec. 2, different PELT methods often involve different parts of the PLM architecture (e.g., before multi-head attention for prefix-tuning and after feedforward layer for adapter-tuning), making it feasible to combine multiple PELT methods without (directly) interfering with each other.
In light of the two observations above, we propose a unified PELT framework, UNIPELT, which takes a hybrid approach by incorporating multiple PELT methods as submodules. At a high level, UNIPELT learns to activate (upweight) the submodules that best suit the current task or specific data sample and deactivate (downweight) the rest. Gating Mechanism. To achieve fine-grained control of submodule (de)activation, we add a trainable gate for each submodule in every Transformer layer (see Fig. 1). Ideally, if a submodule m i is useful for a given data or task setup, the gate output for m i would be high such that m i plays a more important role in the current setup.
Specifically, for adapter-tuning, there is a residual connection between the feedforward network and the adapter-tuning submodule that sums the adapter input (before normalization) h F and output h A as its final output: h A = h A + h F . We design a gating function G A ∈ (0, 1) that estimates the importance of adapter-tuning by its direct input h F N using a feedforward network with sigmoid activation and then scales its output: Intuitively, the adapter-tuning submodule is effectively bypassed if G ≈ 0.
Similarly, for prefix-tuning, we design a gating function G P ∈ (0, 1) that is applied to the prefix vectors E k i with the representation of the original tokens intact: In this way, the impact of the prefix would be diminished if the gate output of the prefix-tuning submodule is low. The gating function G P is estimated by the Transformer layer input h in with another feedforward network.
Despite the seeming simplicity of UNIPELT, we note that it is nontrivial for a unified approach to work well under different scenarios. Naively combining different PELT methods as a hybrid may lead to worse performance than using individual methods, as observed in both our experiments and prior studies (Hu et al., 2021).

Experiment Setup
Task Setup. We conduct extensive experiments on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019), which involves four types of natural language understanding tasks including linguistic acceptability (CoLA), sentiment analysis (SST-2), similarity and paraphrase tasks (MRPC, STS-B, QQP), and natural language inference (MNLI, QNLI, RTE). WNLI is omitted following prior studies (Houlsby et al., 2019;Devlin et al., 2019;He et al., 2021;Ben Zaken et al., 2021) due to its adversarial nature. Data Setup. We first consider a low-resource setting where training data is limited. We sample a small subset of the training set for each task with size K = {100, 500, 1000}. As it is infeasible to submit a large number of runs to the GLUE leaderboard (2 submissions/day), we take 1,000 samples on the training set as the development set to select the best checkpoint and use the original development set as the test set following He et al. (2021). Specifically, we randomly shuffle the training set with seed s, take the first K samples as the new training set, and the next 1,000 samples as the development set. To reduce random variance, we shuffle the data with 5 random seeds and report the average performance. 3 Next, we consider a high-resource setting where the whole training set is used for every task, and the best performance on the GLUE development set is recorded. Compared Methods.
We mainly compare UNIPELT with conventional fine-tuning and the PELT methods that UNIPELT incorporates, namely, adapter-tuning (Houlsby et al., 2019) and prefix-tuning (Li and Liang, 2021) when used individually. We additionally compare with a baseline, UNIPELT-NoGate, where the submodules are simply used together without gating. Implementation Details. We use BERT base as the major model in the experiments. We adopt Adapter-Hub (Pfeiffer et al., 2020), a library based on Hug-gingFace Transformers (Wolf et al., 2019), as our codebase. We re-implement other submodules in the same codebase to ensure a fair comparison for all compared methods. We largely follow the recommended hyperparameters of AdapterHub and keep them the same for different tasks due to practical considerations. Specifically, we set the input length to 128 and the training batch size to 16. We set the number of epochs to 50 to ensure that all methods under different setups are well trained. We adopt early stopping and set the patience to 10 non-increasing epochs. We set the learning rate of fine-tuning and adapter-tuning to 2e-5 and 1e-4 according to prior studies (Pfeiffer et al., 2020;He et al., 2021). We tune the learning rate of prefixtuning and UNIPELT from {1e-4, 2e-4, 5e-4} on the development set and set their learning rate to 2e-4 and 5e-4, respectively. We set the prefix length L = 10 and adapter bottleneck size D mid = 48.

Analysis of Individual PELT Methods
In Table 1, we show the comparison results on the GLUE benchmark with various sizes of training data. As one can see, although the average performance of different methods over 8 tasks is sometimes similar, the differences are quite significant under certain setups and can be as large as 5~9 points on a specific task (e.g., STS-B and MNLI, K = 500) even when excluding cases where some methods fail to learn (e.g., prefix-tuning on QQP K = 100). Next, we will take a closer look at the submodules of UNIPELT when used individually. Analysis of Adapter-tuning. The performance of adapter-tuning is relatively stable -there is no significantly better or worse result than fine-tuning that is consistent on different tasks or sizes of training data. In general, adapter-tuning is slightly worse than fine-tuning in most cases. We do not observe that adapter-tuning consistently outperforms fine-tuning in the low-resource setting as in prior studies (He et al., 2021), possibly because they tuned model hyperparameters on each task, which could be computationally prohibitive in real-world applications. For example, the bottleneck size of adapter-tuning D mid is tuned from {64, 128, 256} in He et al. (2021), while D mid = 48 in UNIPELT, which involves fewer parameters and is fixed across tasks. Another difference is that we only add one adapter submodule in each Transformer layer, which has shown to be on par with adding two but uses half of the parameters (Pfeiffer et al., 2021).
On the other hand, there are certain tasks (e.g., STS-B) that adapter-tuning largely outperforms prefix-tuning regardless of the size of training data, suggesting that one should favor adapter-tuning over prefix-tuning under certain scenarios. Analysis of Prefix-tuning. For prefix-tuning, we observe that it sometimes fails to learn effectively when the training data is limited (e.g., K = 100 on SST-2 and K = 500 on QQP), leading to unsatisfactory performance and (or) huge variance across different runs. Similar phenomena have been observed in a concurrent study (Gu et al., 2021) on few-shot prompt-tuning. Overall, prefixtuning performs poorly when having very limited training data (K = {100, 500}), and becomes on par with fine-tuning as well as adapter-tuning when K reaches 1000.
On the other hand, prefix-tuning performs especially well on certain tasks such as natural language inference (QNLI and MNLI) with various sizes of training data, which suggests that a hybrid approach that learns to activate (assign more weight to) prefix-tuning on these tasks is likely to yield decent results.

Effectiveness of UNIPELT
Now let us turn to the effectiveness of our proposed framework UNIPELT, which incorporates existing PELT methods as submodules. Low-Resource Performance. We observe that UNIPELT consistently achieves the best perfor-   Moreover, UNIPELT performs the best or 2nd best on 6/8/7 out of 8 tasks when trained with 100/500/1,000 samples, and never performs the worst in any setup across different tasks, which indicates that UNIPELT is quite robust and performs reliably under different scenarios. The improvements of UNIPELT are generally larger when having fewer training samples, suggesting that UNIPELT performs especially well in the lowresource regime. In particular, on the tasks where both adapter-tuning and prefix-tuning fail to learn such as CoLA (K = 100), UNIPELT manages to achieve performance close to fine-tuning.
UNIPELT vs. Upper Bound. In Table 2, we show the comparison of UNIPELT and the upper bound when taking the best performance of its submodules on each task. Perhaps surprisingly, UNIPELT performs even better than the upper bound (although sometimes marginally), which indicates that UNIPELT successfully learns to leverage different submodules and maintains (near) op-  timal performance under different setups. The fact that UNIPELT outperforms the upper bound also suggests that a mixture of PELT methods might be inherently more superior to single methods.
High-Resource Performance. In Table 3, we compare the performance of different methods on the development set of GLUE when all training samples are used. UNIPELT again achieves the best overall performance, although the gains are not as significant as in the low-resource setting. Also, simply combining multiple PELT methods without gating may not work very well -although UNIPELT-NoGate never performs the worst in each task, its overall performance is rather poor, which suggests that a more careful mixture of PELT methods is important for achieving better model effectiveness.

Efficiency of UNIPELT
Parameter Efficiency. Table 4 lists the number of trainable parameters in different PELT methods. A general trend is that the trainable parameters in recent PELT methods have been continuously decreasing. For example, for adapter-tuning, the number of task-specific parameters used to achieve competitive performance on GLUE has been reduced to 0.047% (Mahabadi et al., 2021) from 3.6% in the primary version (Houlsby et al., 2019). Prefixtuning (Li and Liang, 2021) typically involves 0.1% to 1% additional parameters, while its successor prompt-tuning (Lester et al., 2021) reaches under 0.01% for most model sizes.
As the trainable parameters in recent PELT methods are almost negligible, combining multiple methods does not lead to significant losses in parameter efficiency. UNIPELT still has <1% trainable parameters in total, where its submodules prefix-tuning and adapter-tuning uses 0.17% and 0.81%, respectively. The number can be further reduced (to e.g., <0.1%) if one uses more parameterefficient variants of the two methods, which can be Method #Param.

Related Work
Parameter-Efficient Tuning of PLMs. As it is infeasible to train and store a full copy of a large PLM for each downstream task in practice, how to efficiently tune the PLM with a small number of trainable parameters becomes critical. Existing PELT methods can be largely divided into two categories based on whether new trainable parameters are introduced. Specifically, one may either train a subset of the model parameters such as the prediction head  and bias terms (Ben Zaken et al., 2021), or introduce task-specific parameters to different parts of the PLM such as before multi-head attention (Li and Liang, 2021) or after feedforward layer (Houlsby et al., 2019). As the number of PELT methods keeps increasing, the purpose of UNIPELT is to better understand and leverage the differences of different methods instead of proposing yet another one.
Mixture-of-Experts. UNIPELT is also related to approaches that involve a high-capacity network and activate different parts of the network given different inputs. One notable example is Mixtureof-Experts (MoE) (Shazeer et al., 2017;Hazimeh et al., 2021), which maintains a set of experts (neural networks) and one or more trainable gates to select a combination of the experts that is specific to each input example. Despite being conceptually similar, UNIPELT is different from MoE in several ways: (1) The submodules in UNIPELT are not combined explicitly by summation like MoE but in sequential order and affect each other implicitly.
(2) The "experts" are heterogeneous and diverse in UNIPELT while usually homogeneous or identical in MoE methods. (3) The importance of each submodule in UNIPELT is estimated individually instead of by a shared gate using the same representation.

Conclusion
In this paper, we propose a unified framework that incorporates different PELT methods as submodules and learns to automatically activate the most appropriate submodules for a given data or task setup. Our proposed framework consistently outperforms conventional fine-tuning as well as the submodules that it incorporates under different setups, and often surpasses the upper bound when taking the best performance of each submodule used individually on each task. Our findings suggest that a mixture of multiple PELT methods may be favorable in terms of both model effectiveness and robustness with negligible losses in parameter efficiency. For future work, we will conduct more analytical experiments on existing PELT methods and incorporate more of them into our framework. We will also try to better understand and explain the performance discrepancy of various PELT methods in different scenarios.