AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning

Standard fine-tuning of large pre-trained language models (PLMs) for downstream tasks requires updating hundreds of millions to billions of parameters, and storing a large copy of the PLM weights for every task resulting in increased cost for storing, sharing and serving the models. To address this, parameter-efficient fine-tuning (PEFT) techniques were introduced where small trainable components are injected in the PLM and updated during fine-tuning. We propose AdaMix as a general PEFT method that tunes a mixture of adaptation modules – given the underlying PEFT method of choice – introduced in each Transformer layer while keeping most of the PLM weights frozen. For instance, AdaMix can leverage a mixture of adapters like Houlsby or a mixture of low rank decomposition matrices like LoRA to improve downstream task performance over the corresponding PEFT methods for fully supervised and few-shot NLU and NLG tasks. Further, we design AdaMix such that it matches the same computational cost and the number of tunable parameters as the underlying PEFT method. By only tuning 0.1-0.2% of PLM parameters, we show that AdaMix outperforms SOTA parameter-efficient fine-tuning and full model fine-tuning for both NLU and NLG tasks.


Introduction
Standard fine-tuning of large pre-trained language models (PLMs) (Devlin et al., 2019;Liu et al., 2019;Brown et al., 2020;Raffel et al., 2019) to downstream tasks requires updating all model parameters.Given the ever-increasing size of PLMs (e.g., 175 billion parameters for GPT-3 (Brown et al., 2020) and 530 billion parameters for MT-NLG (Smith et al., 2022)), even the fine-tuning step becomes expensive as it requires storing a full copy Full Fine-tuning 88.9 Figure 1: Performance of different parameter-efficient fine-tuning methods on GLUE development set with RoBERTa-large encoder following a setup similar to (Houlsby et al., 2019) for fair comparison.We report the performance of Pfeiffer (Pfeiffer et al., 2021), Houlsby (Houlsby et al., 2019) and LoRA (Hu et al., 2021) with their default number of fine-tuned parameters as well as the number of fine-tuned parameters used in AdaMix with a mixture of adaptations .Red dash shows the performance of full model fine-tuning.
of model weights for every task.To address these challenges, recent works have developed parameterefficient fine-tuning (PEFT) techniques.These approaches typically underperform standard full model fine-tuning, but significantly reduce the number of trainable parameters.There are many varieties of PEFT methods, including prefix-tuning (Li and Liang, 2021) and prompt-tuning (Lester et al., 2021) to condition frozen language models via natural language task descriptions, low dimensional projections using adapters (Houlsby et al., 2019;Pfeiffer et al., 2020Pfeiffer et al., , 2021) ) and more recently using low-rank approximation (Hu et al., 2021).Figure 1 shows the performance of some popular PEFT methods with varying number of tunable parameters.We observe a significant performance gap with respect to full model tuning where all PLM parameters are updated.
In this paper, we present AdaMix, a mixture of adaptation modules approach, and show that it outperforms SOTA PEFT methods and also full model fine-tuning while tuning only 0.1 − 0.2% of PLM parameters.
In contrast to traditional PEFT methods that use a single adaptation module in every Transformer layer, AdaMix uses several adaptation modules that learn multiple views of the given task.In order to design this mixture of adaptations, we take inspiration from sparsely-activated mixture-of-experts (MoE) models.In traditional dense models (e.g., BERT (Devlin et al., 2019), GPT-3 (Brown et al., 2020)), all model weights are activated for every input example.MoE models induce sparsity by activating only a subset of the model weights for each incoming input.
Consider adapters (Houlsby et al., 2019), one of the most popular PEFT techniques, to illustrate our method.A feedforward layer (FFN) is introduced to down-project the hidden representation to a low dimension d (also called the bottleneck dimension) followed by another up-project FFN to match the dimensionality of the next layer.Instead of using a single adapter, we introduce multiple project-up and project-down FFNs in each Transformer layer.We route input examples to one of the project-up and one of the project-down FFN's resulting in the same amount of computational cost (FLOPs) as that of using a single adapter.For methods like LoRA (Hu et al., 2021), that decomposes the gradient of pre-trained weights into low-rank matrices (A and B), we introduce multiple lowrank decompositions and route the input examples to them similar to adapters.
We discuss different routing mechanism and show that stochastic routing yields good performance while eliminating the need for introducing any additional parameters for module selection.To alleviate training instability that may arise from the randomness in selecting different adaptation modules in different training steps, we leverage consistency regularization and the sharing of adaptation modules during stochastic routing.
The introduction of multiple adaptation modules results in an increased number of adaptation parameters.This does not increase computational cost but increases storage cost.To address this, we develop a merging mechanism to combine weights from different adaptation modules to a single module in each Transformer layer.This allows us to keep the number of adaptation parameters the same as that of a single adaptation module.Our merging mechanism is inspired by model weight averaging model soups (Wortsman et al., 2022) and multi BERTs (Sellam et al., 2022).Weight averaging of models with different random initialization has been shown to improve model performance in recent works (Matena and Raffel, 2021;Neyshabur et al., 2020;Frankle et al., 2020) that show the optimized models to lie in the same basin of error landscape.While the above works are geared towards fine-tuning independent models, we extend this idea to parameter-efficient fine-tuning with randomly initialized adaptation modules and a frozen language model.Overall, our work makes the following contributions: (a) We develop a new method AdaMix as a mixture of adaptations for parameter-efficient fine-tuning (PEFT) of large language models.Given any PEFT method of choice like adapters and low-rank decompositions, AdaMix improves downstream task performance over the underlying PEFT method.(b) AdaMix is trained with stochastic routing and adaptation module merging to retain the same computational cost (e.g., FLOPs, #tunable adaptation parameters) and benefits of the underlying PEFT method.To better understand how AdaMix works, we demonstrate its strong connections to Bayesian Neural Networks and model ensembling.(c) By tuning only 0.1 − 0.2% of a pre-trained language model's parameters, AdaMix is the first PEFT method to outperform full model fine-tuning methods for all NLU tasks on GLUE, and outperforms other competing methods for NLG and few-shot NLU tasks.Practical benefits of PEFT methods.The most significant benefit of PEFT methods comes from the reduction in memory and storage usage.For a Transformer, the VRAM consumption can be significantly reduced as we do not need to keep track of optimizer states for the frozen parameters.PEFT methods also allow multiple tasks to share the same copy of the full (frozen) PLM.Hence, the storage cost for introducing a new task can be reduced by up to 444x (from 355MB to 0.8MB with RoBERTa-large encoder in our setting).
We present background on Mixture-of-Experts (MoE) and adapters in Section A of Appendix.

Mixture-of-Adaptations
Consider a set of M adaptation modules injected in each Transformer layer, where

Multi-Head Attention
Add & Norm

Multi-Head Attention
Add & Norm We adopt the popularly used Transformer architecture (Vaswani et al., 2017) consisting of L repeated Transformer blocks, where each block consists of a self-attention sub-layer, a fully connected feed-forward network (FFN) and residual connections around the sub-layers followed by layer normalization.Each adaptation module A ij corresponding to the adapters (Houlsby et al., 2019) consists of a feedforward up W up ij and a feedforward down W down ij projection matrices.

Routing Policy
Recent work like THOR (Zuo et al., 2021) has demonstrated stochastic routing policy like random routing to work as well as classical routing mechanism like Switch routing (Fedus et al., 2021) with the following benefits.Since input examples are randomly routed to different experts, there is no requirement for additional load balancing as each expert has an equal opportunity of being activated simplifying the framework.Further, there are no added parameters, and therefore no additional computation, at the Switch layer for expert selection.The latter is particularly important in our setting for parameter-efficient fine-tuning to keep the parameters and FLOPs the same as that of a single adap-tation module.To analyze the working of AdaMix, we demonstrate connections to stochastic routing and model weight averaging to Bayesian Neural Networks and model ensembling in Section 2.5.
In the stochastic routing policy for AdaMix with adapters, at any training step, we randomly select a pair of feedforward up and feedforward down projection matrices in the i th Transformer layer as Given this selection of adaptation modules A i and B i in each Transformer layer in every step, all the inputs in a given batch are processed through the same set of modules.Given an input representation x in a given Transformer layer, the above pair of modules perform the following transformations: Such stochastic routing enables adaptation modules to learn different transformations during training and obtain multiple views of the task.However, this also creates a challenge on which modules to use during inference due to random routing protocol during training.We address this challenge with the following two techniques that further allow us to collapse adaptation modules and obtain the same computational cost (FLOPs, #tunable adaptation parameters) as that of a single module.

Consistency regularization
Consider A = {A L i=1 } and B = {B L i=1 } to be the sets of adaptation modules (e.g., projection matrices) activated during two stochastic forward passes through the network for an input x across L layers of the Transformer.The objective of consistency regularization is to enable the adaptation modules to share information and prevent divergence.To this end, we add the following consistency loss as a regularizer to the task-specific optimization loss: (2) where I(x, c) is a binary indicator (0 or 1) if class label c is the correct classification for x and z A (.) (x) and z B (.) (x) are the predicted logits while routing through two sets of adaptation modules A and B respectively with KL denoting the Kullback-Leibler divergence.x is the input representation from the PLM with frozen parameters and only the parameters of modules {W up , W down } are updated during training.

Adaptation module merging
While the above regularization mitigates inconsistency in random module selection during inference, it still results in increased serving cost to host several adaptation modules.Prior works in fine-tuning language models for downstream tasks have shown improved performance on averaging the weights of different models fine-tuned with different random seeds outperforming a single fine-tuned model.Recent work (Wortsman et al., 2022) has also shown that differently fine-tuned models from the same initialization lie in the same error basin motivating the use of weight aggregation for robust task summarization.We adopt and extend prior techniques for language model fine-tuning to our parameterefficient training of multi-view adaptation modules.
In contrast to the aforementioned techniques like stochastic routing and consistency regularization that are applied at the training phase, we employ adaptation merging only during inference.Given a set of adaptation modules, W up ij and W down ik we simply average the weights of all the corresponding modules (e.g., project-up or project-down matrices) in every Transformer layer to collapse to a single module {W ′ up i , W ′ down i }, where:

Adaptation module sharing
While stochastic routing to multi-view adaptation modules increases the model capacity, it can also impact downstream tasks with less amounts of labeled data for tuning several sets of adaptation modules.To address this challenge, we use another mechanism to share some of the adaption modules (e.g., project-down or the project-up operations) to improve training efficiency.In the standard setting for adapters, we share only the feedforward projection-up matrices i.e., W up ij = W up i .We investigate these design choices via ablation studies in our experiments in Section 3.3 and Section C in Appendix.

Connection to Bayesian Neural Networks and Model Ensembling
Bayesian Neural Network (BNN) (Gal and Ghahramani, 2015) replaces a deterministic model's weight parameters by a distribution over the parameters.For inference, BNN averages over all the possible weights, also referred to as marginalization.Consider f W(x) ∈ R d to be the d−dimensional output of such a neural network where the model likelihood is given by p(y|f W(x) ).In our setting, W = ⟨W up , W down ⟩ along with frozen PLM parameters that are dropped from the notation for simplicity.For classification, we can further apply a softmax likelihood to the output to obtain: ).Given an instance x, the probability distribution over the classes is given by marginalization over the pos-terior distribution as: This requires averaging over all possible model weights, which is intractable in practice.Therefore, several approximation methods have been developed based on variational inference methods and stochastic regularization techniques using dropouts.In this work, we leverage another stochastic regularization in the form of random routing.Here, the objective is to find a surrogate distribution q θ (w) in a tractable family of distributions that can replace the true model posterior that is hard to compute.The ideal surrogate is identified by minimizing the Kullback-Leibler (KL) divergence between the candidate and the true posterior.
Consider q θ (W) to be the stochastic routing policy which samples T masked model weights { W t } T t=1 ∼ q θ (W).For classification tasks, the approximate posterior can be now obtained by Monte-Carlo integration (Gal et al., 2017) as: However, computing the approximate posterior above in our setting requires storing all the stochastic model weights W t (x) which increases the serving cost during inference.To reduce this cost, we resort to the other technique for weight averaging via adaptation module merging during inference.
Let L AM W = E x,y L(sof tmax(f W (x), y) denote the expected loss with merging of the stochastic adaptation weights with W = 1 T t W t (from Equation 3) and L denoting the cross-entropy loss.
Consider y)) denote the expected loss from logit-level stochastic model ensembling (from Equation 4).
Prior work (Wortsman et al., 2022) shows that averaging the weights of multiple models fine-tuned with different hyper-parameters improves model performance.They analytically show the similarity in loss between weight-averaging (L AM W in our setting) and logit-ensembling (L Ens W in our setting) as a function of the flatness of the loss and confidence of the predictions.While the above analysis is geared towards averaging of multiple independently fine-tuned model weights, we can apply a similar analysis in our setting towards averaging of multiple stochastically obtained adaptation weights in obtaining a favorable loss L AM W . Further, adaptation merging reduces the serving cost during inference since we need to retain only one copy of the merged weights as opposed to logit-ensembling which requires copies of all the adaptation weights.

Experimental Setup
Dataset.We perform experiments on a wide range of tasks including eight natural language understanding (NLU) tasks in the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019) and three natural language generation (NLG) tasks, namely, E2E (Novikova et al., 2017), WebNLG (Gardent et al., 2017) and DART (Nan et al., 2020).For the NLU and NLG tasks, we follow the same setup as (Houlsby et al., 2019) and (Li and Liang, 2021;Hu et al., 2021), respectively.Baselines.We compare AdaMix to full model fine-tuning and several state-of-the-art parameterefficient fine-tuning (PEFT) methods, namely, Pfeiffer Adapter (Pfeiffer et al., 2021), Houlsby Adapter (Houlsby et al., 2019), BitFit (Zaken et al., 2021), Prefix-tuning (Li and Liang, 2021), UNIPELT (Mao et al., 2021) and LoRA (Hu et al., 2021).We use BERT-base (Devlin et al., 2019) and RoBERTa-large (Liu et al., 2019) as encoders for NLU tasks (results in Table 1 and Table 2), and GPT-2 (Brown et al., 2020) for NLG tasks (results in Table 3).AdaMix implementation details.We implement AdaMix in Pytorch and use Tesla V100 gpus for experiments with detailed hyper-parameter configurations presented in Section E in Appendix.AdaMix with adapters uses a dimension of 16 and 48 using BERT-base and RoBERTa-large encoders following the setup of (Hu et al., 2021;Mao et al., 2021) for fair comparison.AdaMix with LoRA uses rank r = 4 following the setup of (Hu et al., 2021) to keep the same number of adaptation parameters during inference.The number of adaptation modules in AdaMix is set to 4 for all the tasks and encoders unless otherwise specified.The impact of adapter dimension and number of adaptation modules for NLU tasks are investigated in lying PEFT mechanism for NLG tasks.

NLU Tasks
Tables 1 and 2 show the performance comparison among PEFT models with RoBERTa-large and BERT-base encoders respectively.Fully fine-tuned RoBERTa-large and BERT-base provide the ceiling performance.We observe AdaMix with a mixtureof-adapters to significantly outperform other stateof-the-art baselines on most tasks with different encoders.AdaMix with adapters is the only PEFT method which outperforms full model fine-tuning on all the tasks and on average score.

NLG Tasks
AdaMix leverages mixture of adaptations to improve over underlying PEFT method as demonstrated in  (Houlsby et al., 2019) respectively.We report results on DART and WebNLG in Tables 4 and 5 in Appendix.

Few-shot NLU
In contrast to the fully supervised setting in the above experiments, we also perform few-shot experiments on six GLUE tasks following the same setup (e.g., shots, train and test splits) and evaluation as in (Wang et al., 2021).Detailed experimental configuration presented in Section B of Appendix.AdaMix uses a mixture-of-adapters with prompt-based fine-tuning (Gao et al., 2021).
Table 6 shows the performance comparison among different PEFT methods with |K| = 30 labeled examples with RoBERTa-large as frozen encoder.We observe significant performance gap for most PEFT methods with full model promptbased fine-tuning i.e. with all model parameters being updated.AdaMix with adapters outperforms full model tuning performance for few-shot NLU similar to that in the fully supervised setting.Note that AdaMix and LiST (Wang et al., 2021) use similar adapter design with prompt-based fine-tuning.

Ablation Study
We perform all the ablation analysis on AdaMix with adapters for parameter-efficient fine-tuning.Analysis of adaptation merging.In this ablation study, we do not merge adaptation modules and consider two different routing strategies at inference time: (a) randomly routing input to any adaptation module, and (b) fixed routing where we route all the input to the first adaptation module in AdaMix.From Table 7, we observe AdaMix with adaptation merging to perform better than any of the other variants without the merging mechanism.Notably, all of the AdaMix variants outperform full model tuning.Moreover, Figure 4 shows that the performance of merging mechanism is consistently better than the average performance of random routing and comparable to the best performance of random routing.Averaging weights v.s.ensembling logits.We compare AdaMix with a variant of logit ensembling, denoted as AdaMix-Ensemble.To this end, we make four random routing passes through the network for every input (T =4) and average the logits from different passes as the final predicted logit.Inference time for this ensembling method is 4 × AdaMix.We run repeated experiments with three different seeds and report mean performance in Ta-

RTE, MRPC)
. This is further demonstrated in Figure 7 in Appendix which shows a faster convergence and lower training loss of AdaMix with shar-ing compared to that without given the same number of training steps.We explore which adaptation module to share (project-up v.s.project-down) in Table 11 in Appendix that depict similar results.
Impact of the number of adaptation modules.
In this study, we vary the number of adaptation modules in AdaMix as 2, 4 and 8 during training.
Table 9 shows diminishing returns on aggregate task performance with increasing number of modules.As we increase sparsity and the number of tunable parameters by increasing the number of adaptation modules, low-resource tasks like RTE and SST-2 -with limited amount of labeled data for fine-tuning -degrade in performance compared to high-resource tasks like MNLI and QNLI.
Impact of adapter bottleneck dimension.Table 10 shows the impact of bottleneck dimension of adapters with different encoders in AdaMix.The model performance improves with increase in the number of trainable parameters by increasing the bottleneck dimension with diminishing returns after a certain point.

Related Work
Parameter-efficient fine-tuning of PLMs.Recent works on parameter-efficient fine-tuning (PEFT) can be roughly categorized into two categories: (1) tuning a subset of existing parameters including head fine-tuning (Lee et al., 2019), bias term In contrast, we study parameter-efficient adaptation of pre-trained language models by tuning only a very small number of sparse adapter parameters.

Averaging model weights.
Recent explorations (Szegedy et al., 2016;Matena and Raffel, 2021;Wortsman et al., 2022;Izmailov et al., 2018) study model aggregation by averaging all the model weights.(Matena and Raffel, 2021) propose to merge pre-trained language models which are fine-tuned on various text classification tasks.(Wortsman et al., 2022) explores averaging model weights from various independent runs on the same task with different hyper-parameter configurations.In contrast to the above works on full model finetuning, we focus on parameter-efficient fine-tuning.We explore weight averaging for merging weights of adaptation modules consisting of small tunable parameters that are updated during model tuning while keeping the large model parameters fixed.

Conclusions
We develop a new framework AdaMix for parameter-efficient fine-tuning (PEFT) of large pretrained language models (PLM).AdaMix leverages a mixture of adaptation modules to improve downstream task performance without increasing the computational cost (e.g., FLOPs, parameters) of the underlying adaptation method.We demonstrate AdaMix to work with and improve over different PEFT methods like adapters and low rank decompositions across NLU and NLG tasks.
By tuning only 0.1 − 0.2% of PLM parameters, AdaMix outperforms full model fine-tuning that updates all the model parameters as well as other state-of-the-art PEFT methods.

Limitations
The proposed AdaMix method is somewhat compute-intensive as it involves fine-tuning largescale language models.The training cost of the proposed AdaMix is higher than standard PEFT methods since the training procedure involves multiple copies of adapters.Based on our empirical observation, the number of training iterations for AdaMix is usually between 1∼2 times the training for standard PEFT methods.This imposes negative impact on carbon footprint from training the described models.
AdaMix is orthogonal to most of the existing parameter-efficient fine-tuning (PEFT) studies and is able to potentially improve the performance of any PEFT method.In this work, we explore two representative PEFT methods like adapter and LoRA but we did not experiment with other combinations like prompt-tuning and prefix-tuning.We leave those studies to future work.

Acknowledgment
The authors would like to thank the anonymous referees for their valuable comments and helpful suggestions and would like to thank Guoqing Zheng and Ruya Kang for their insightful comments on the project.This work is supported in part by the US National Science Foundation under grants NSF-IIS 1747614 and NSF-IIS-2141037.Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
The predominant methodology for task adaptation is to tune all of the trainable parameters of the PLMs for every task.This raises significant resource challenges both during training and deployment.A recent study (Aghajanyan et al., 2021) shows that PLMs have a low instrinsic dimension that can match the performance of the full parameter space.
To adapt PLMs for downstream tasks with a small number of parameters, adapters (Houlsby et al., 2019) have recently been introduced as an alternative approach for lightweight tuning.
The adapter tuning strategy judiciously introduces new parameters into the original PLMs.During fine-tuning, only the adapter parameters are updated while keeping the remaining parameters of the PLM frozen.Adapters usually consist of two fully connected layers as shown in Figure 5, where the adapter layer uses a down projection W down ∈ R d×r to project input representation x to a low dimensional space r (referred as the bottleneck dimension) with d being the model dimension, followed by a nonlinear activation function f (•), and a up-projection with W up ∈ R r×d to project the low-dimensional features back to the original dimension.The adapters are further surrounded by residual connections.
Given the above adapter design with parameters ψ, the dataset D K , a pre-trained language model encoder enc with parameters Θ PLM , where Θ PLM ≫ ψ, we want to perform the following optimization for efficient model adaptation: B Few-shot NLU Datasets Data.In contrast to the fully supervised setting in the above experiments, we also perform fewshot experiments following the prior study (Wang et al., 2021) on six tasks including MNLI (Williams et al., 2018), RTE (Dagan et al., 2005;Bar Haim et al., 2006;Giampiccolo et al., 2007;Bentivogli et al., 2009), QQP1 and SST-2 (Socher et al.).The results are reported on their development set fol-lowing (Zhang et al., 2021).MPQA (Wiebe et al., 2005) and Subj (Pang and Lee, 2004) are used for polarity and subjectivity detection, where we follow (Gao et al., 2021)

D Detailed Results on NLU Tasks
The results on NLU tasks are included in Table 1 and Table 13.The performance AdaMix with RoBERTa-large encoder achieves the best performance in terms of different task metrics in the GLUE benchmark.AdaMix with adapters is the only PEFT method which outperforms full model fine-tuning on all the tasks and on average score.Additionally, the improvement brought by AdaMix is more significant with BERT-base as the encoder, demonstrating 2.2% and 1.2% improvement over the performance of full model fine-tuning and the best performing baseline UNIPELT with BERTbase.The improvement is observed to be consistent as that with RoBERTa-large on every task.The NLG results are included in Table 4 and 5.

E Hyper-parameter
Detailed hyper-parameter configuration for different tasks presented in Table 15  Table 13: Main results on GLUE development set with BERT-base encoder.The best result on each task is in bold and "-" denotes the missing measure.† and ⋄ denote that the reported results are taken from (Mao et al., 2021;Zaken et al., 2021).The average performance is calculated based on F1 of QQP and MRPC.#Param.refers to the number of updated parameters in the inference stage.

Figure 5 :
Figure 5: Conventional adapter design in standard Transformer architecture.

Figure 6 :
Figure 6: Violin plot of AdaMix-RandomRouting performance distribution with BERT-base and RoBERTa-large encoders.Red dot denotes the performance of AdaMix.

Figure 7 :
Figure 7: Convergence analysis demonstrating the impact of adapter sharing design in AdaMix.
• • M } represents the j th adaptation module in the i th Transformer layer.For illustration, we will consider adapters(Houlsby

Table 1 :
Table 9 and 10.For most of the experiments and ablation analysis, we report results from AdaMix with adapters for NLU tasks.For demonstrating the generalizability of our framework, we report results from AdaMix with LoRA (Hu et al., 2021) as the under-Results for NLU tasks on GLUE development set with RoBERTa-large encoder.The best result on each task is in bold and "-" denotes missing measure.AdaMix with a mixture of adapters outperforms all competing methods as well as fully fine-tuned large model with only 0.23% tunable parameters.† denotes results reported from (Hu et al., 2021).Mcc refers to Matthews correlation coefficient, and Pearson refers to Pearson correlation.#Param.denotes the number of tunable adaptation parameters used during inference.

Table 2 :
(Mao et al., 2021;Zaken et al., 2021)ment set with BERT-base encoder and AdaMix with a mixture-of-adapters.The best result on each task is in bold.† and ⋄ denote results reported from(Mao et al., 2021;Zaken et al., 2021).Detailed task-specific results are reported in Table13of Appendix.#Param.refers to the number of tunable adaptation parameters during inference.
Table 3 for E2E NLG i.e.AdaMix with LoRA and AdaMix with adapters outperform LoRA (Hu et al., 2021) and adapters

Table 3 :
(Hu et al., 2021)G Challenge with GPT-2 medium backbone.Best result on each task is in bold.We report AdaMix results with both adapters and LoRA as underlying PEFT method.AdaMix outperforms all competing methods as well as fully fine-tuned large model with only 0.1% tunable parameters.†denotesresults reported from(Hu et al., 2021)and repr.denotesreproducedresults.#Param.denotes the number of tunable adaptation parameters used during inference.Results on DART and WebNLG presented in Tables4 and 5in Appendix.

Table 4 :
(Hu et al., 2021)ith GPT-2 backbone encoder.Best result on each task is in bold.We report AdaMix results with both adapters and LoRA as underlying PEFT method.AdaMix outperforms all competing methods as well as fully fine-tuned large model with only 0.1% tunable parameters.†denotesresults reported from(Hu et al., 2021)and repr.denotes reproduced results.#Param.denotes the number of tunable adaptation parameters used during inference.

Table 5 :
Results on WebNLG with GPT-2 medium backbone.The results are based on all categories in the test set of WebNLG.Best result on each task is in bold.

Table 8 .
Analysis of adaptation module sharing.We remove adaptation module sharing in AdaMix for ablation and keep four different copies of projectdown and four project-up FFN layers.From Table8we observe the performance gap between AdaMix and AdaMix w/o sharing to increase with decrease in the dataset size demonstrating the importance of parameter sharing for low-resource tasks (e.g.,

Table 6 :
(Wang et al., 2021) and standard deviation of several parameter-efficient fine-tuning strategies based on RoBERTa-large with |K| = 30 training labels.The best performance is shown in bold.Prompt-tuning, Head-only and BitFit tune 1M model parameters during inference.Houlsby Adapter, LiST Adapter and AdaMix Adapter tune 14M model parameters.*denotesthat the results are taken from(Wang et al., 2021).

Table 7 :
AdaMix without adaptation merging and different routing and ensembling strategies.Average results are presented on GLUE development set with BERTbase encoder.Detailed task results in Table 14 of Appendix for BERT-base and RoBERTa-large encoders.Figure 4: Violin plot of AdaMix-RandomRouting performance distribution with RoBERTa-large encoders.Red dot denotes the performance of AdaMix.

Table 8 :
Ablation study demonstrating the impact of consistency regularization and sharing in AdaMix.

Table 10 :
Varying the bottleneck dimension of adapters in AdaMix with RoBERTa-large encoder.* denotes the bottleneck dimension used in AdaMix with adapters.Results with BERT-base encoder in Table 12 in Appendix.

Table 11 :
(Wang et al., 2021);Wang et al., 2021)e few-shot model only has access to |K| labeled samples for any task.Following true few-shot learning setting(Perez et al., 2021;Wang et al., 2021), we do not use any additional validation set for any hyper-parameter tuning or early stopping.The performance of each model is reported after fixed number of training epochs.For a fair comparison, we use the same set of few-shot labeled instances for training as in(Wang et al., 2021).We train each model with 5 different seeds and report average performance with standard deviation across the runs.In the few-shot experiments, we follow(Wang et al., 2021)to train AdaMix via the prompt-based fine-tuning strategy.In contrast to(Wang et al., 2021), we do not use any unlabeled data.Ablation study demonstrating the impact of parameter sharing in AdaMix adapter framework.

Table 12 :
Varying the bottleneck dimension of adapters in AdaMix with BERT-base and RoBERTa-large encoder.
* denotes the bottleneck dimension used in AdaMix with adapters.
and Table 16.

Table 14 :
Comparing the impact of different routing and ensembling strategies with AdaMix.Results are presented on GLUE development set with BERT-base and RoBERTa-large encoders.Average results are calculated following Table1and Table2for consistency.The best result on each task is in bold and "-" denotes the missing measure.

Table 16 :
(Hu et al., 2021)nfigurations for GPT-2 Medium on NLG tasks.We retain all other default training and generation specific hyper-parameters from LoRA(Hu et al., 2021).