Learned Adapters Are Better Than Manually Designed Adapters

,

Figure 1: Overall comparison between our Learned Adapter framework and baselines.The x-axis is the number of tunable parameters, and the y-axis is the average performance on the GLUE benchmark with RoBERTa-large backbone.The details can be found in Section 5.
Recently, parameter efficient tuning (PETuning) has raised much attention in the research field since it can only train a small portion of PTMs and keep the vast majority of parameters frozen, thus alleviating the computation costs during full finetuning.A series of studies (Houlsby et al., 2019;Pfeiffer et al., 2021;Mahabadi et al., 2021;Ben-Zaken et al., 2021;Hu et al., 2021;Guo et al., 2021a;Li and Liang, 2021;Lester et al., 2021) has verified that PETuning can achieve competitive performance compared to conventional finetuning with very few trainable parameters, resulting in a considerable reduction in model adaptation costs.Adapter-based methods (Houlsby et al., 2019;Pfeiffer et al., 2021;Mahabadi et al., 2021;He et al., 2021) inject newly-introduced layers after or around the attention or feed-forward modules of the Transformer block, and yield promising results by fine-tuning a small portion of the PTM's parameters.
Recently, a branch of recent research has ad- vanced the understanding of adapter-based tuning more deeply and improved the adapters' architectures to improve parameter efficiency further.
Adaptable adapters (Moosavi et al., 2022) propose that adapters at different layers should have different activation functions.Thus they fit the rational activation functions to downstream tasks during parameter tuning.AdapterDrop (Rücklé et al., 2020) tries to reduce the number of adapters' parameter number by not inserting adapters on the lower layers.He et al. (2021) bridge connections among different PETuning approaches to form a unified framework and further propose to insert adapters in parallel to the modules of the Transformer block.Jie and Deng (2022) and Sung et al. (2022) propose to add encoding operations between the projection layers of an adapter and achieve better PETuning performances.The above empirical evidence implies that altering the adapters' architecture designs can help to improve the PETuning performances of adapters with even fewer tunable parameters.Predictably, such an optimal architecture is difficult to construct manually and may vary across different PTM backbones and tasks.Therefore, we propose to search for the optimal architecture of adapters automatically.
We present the Learned Adapter framework to search for the optimal architecture of adapters automatically.We first construct a unified search space (Figure 2) that considers various design choices of adapters, including the activation functions, encoding operations, and how the adapters are connected to the PTM backbone.In terms of the specific methodology for optimization on the search space, make a simple-yet-effective modification to the optimization method in DARTS (Liu et al., 2019a), which is better at identifying the proper components for adapters at different intermediate layers.
We conduct extensive experiments to study the effectiveness of our Learned Adapter framework.The experimental results show that with 0.068% parameters, we can recover 99.5% finetuning performances on the GLUE (Wang et al., 2018) benchmark.Moreover, the searched architecture outperforms the manually designed PETuning baselines while tuning fewer parameters.Figure 1 depicts the overall comparison between our Learned Adapter and the baselines.Furthermore, the learned architectures of adapters are transferable across tasks, which significantly strengthens the usefulness of the searched structures.Further experiments demonstrate that our newly proposed search space for adapters is valid.

Related work
Adapter-based tuning.One of the most important research lines of PETuning is adapter-based tuning.Adapter (Houlsby et al., 2019) inserts adapter modules with bottleneck architecture between every consecutive Transformer (Vaswani et al., 2017) sublayers.AdapterFusion (Pfeiffer et al., 2021) only inserts sequential adapters after the feed-forward module.Adapter-based tuning methods have comparable results with model tuning when only tuning a fraction of the backbone model's parameter number.Due to their amazing results on PETuning, a branch of literature has investigated the architecture of adapters in search of further improvements.He et al. (2021) analyze a wide range of PETuning methods and show that they are essentially equivalent.They also propose the general architecture of PETuning.AdapterDrop (Rücklé et al., 2020) investigates the efficiency of removing adapters from lower layers.Adaptive adapters (Moosavi et al., 2022) investigate the activation functions of adapters and propose to learn the activation functions of adapters via optimizing the parameters of rational functions as a part of the model parameters.Compacter (Mahabadi et al., 2021) uses lowrank parameterized hypercomplex multiplication (Le et al., 2021) to compress adapters' tunable parameters.There is also work (Sung et al., 2022;Jie and Deng, 2022) trying to add different encoding operations, like self-attention operations and convolutions between the bottleneck structure of adapters, and achieve better performances.
Our work complements this branch of literature by investigating: (a) whether and how the adapter architectures affect the PETuning performances, and whether different layers of PTMs need different adapter architectures; (b) whether we can obtain better adapter architectures via neural architecture search.
Other PETuning methods Another main research line of PETuning is the prompt-based tuning that inserts some additional soft prompts into the hidden states instead of injecting new neural modules to PTMs.Prompt tuning (Lester et al., 2021) and P-tuning (Liu et al., 2022) insert a soft prompt to word embeddings only, and can achieve competitive results when applied to supersized PTMs.Prefix-tuning (Li and Liang, 2021) and P-tuning v2 (Liu et al., 2021) insert prompts to every hidden layer of PTM.IDPG (Wu et al., 2022) uses the prompt generator with parameterized hypercomplex multiplication (Le et al., 2021) to generate a soft prompt for every instance.There are also some other popular PETuning methods, such as BitFit (Ben-Zaken et al., 2021) which only tunes the bias terms, LoRA (Hu et al., 2021) which optimizes low-rank decomposition matrices of the weights within self-attention layers.
Neural architecture search In the early attempts, neural architecture search (NAS) requires massive computations, like thousands of GPU days (Zoph and Le, 2017;Zoph et al., 2018;Liu et al., 2018).Recently, a particular group of one-shot NAS, led by the seminal work DARTS (Liu et al., 2019a) has attracted much attention.DARTS formulates the search space into a super-network that can adjust itself in a continuous space so that the network and architectural parameters can be optimized alternately (bi-level optimization) using gradient descent.A series of literature try to improve the performance and efficiency of DARTS, such as Xie et al. (2019), Chen et al. (2021), Chu et al. (2021), Nayman et al. (2019).SNAS (Xie et al., 2019) reformulate DARTS as a credit assignment task while maintaining the differentiability.Gao et al. (2020) penalize the entropy of the architecture parameters to encourage discretization on the hyper-network.P-DARTS (Chen et al., 2021) analyze the issues during the DARTS bi-level optimization, and propose a series of modifications.PC-DARTS (Xu et al., 2021) reduces the memory cost during search by sampling a portion of the channels in supernetworks.FairDARTS (Chu et al., 2021) change the softmax operations in DARTS into sigmoid and introduce a zero-one loss to prune the architectural parameters.XNAS (Nayman et al., 2019) dynamically wipes out inferior architectures and enhances superior ones.NAS is widely applied in both computer vision and natural language processing, especially in knowledge distillation (Zhu, 2021a;Zhang et al., 2021).
Our work complements the literature by examining the optimization of DARTS on our search space and propose a new training procedure that does not require re-training after discretization.
3 Search space of Learned Adapter

Pilot experiments and motivations
In this subsection, we conduct a series of experiments on the RTE (Dagan et al., 2005) and MRPC (Dolan and Brockett, 2005) datasets to demonstrate the necessity of investigating the architecture of adapters.The baseline model M is RoBERTa-large model with an parallel adapter at the feed-forward module (FFN adapter) (He et al., 2021).The backbone model is frozen and we only tune the adapters on downstream tasks.The bottleneck dimension is 32 and the activation function is ReLU.The other experimental settings follows Appendix B. We now consider a series of simple modifications to the baseline model.
Modifying the activation function We replace the activation functions of the adapters from ReLU to GeLU, SWISH or Tanh, while keeping the other settings unchanged.The three modified models are denoted as M gelu , M swish and B tanh , respectively.
Adding encoding operations We add a selfattention operation (Vaswani et al., 2017) or a convolutional operation of kernel size 3 after the downprojection and before the activation function.The two variants of model M are denoted as M sa and M conv , respectively.Since extra operations introduce more parameters, we reduce the bottleneck dimension of M sa and M conv to 24 to ensure fair comparison.
Alternative adapter placements Instead of inserting the adapter around the FFN module of the transformer block, we now consider: (1) M attn inserts the adapters at the attention modules (attn adapter); (2) M block inserts the adapters around the entire transformer block (block adapter).Note that the setting of block adapters is theoretically supported by the general framework of PETuning in He et al. (2021) but not considered by the previous work.In this work, we will demonstrate the usefulness of block adapters via experiments.
Table 1 reports the experimental results of the above models.The evaluation metrics for the RTE and MRPC tasks are introduced by Appendix A.2.We can see that the four simple modifications to the baseline model, M gelu , M sa , M conv and M block , can slightly outperform M, demonstrating that the adapter architectures are essential for adapter tuning, and it is promising to design better adapter architectures for better adapter tuning performances.
The pilot experiments raise a vital research question: What are the optimal architectures for adapters?Obviously, such an optimal architecture will be different across tasks and PTM models and even across different intermediate layers of a PTM, making it impossible for manual designs.We are motivated to investigate the problem of optimizing the architectures of adapters via neural architecture search.

General architecture of adapters
As depicted in Figure 2, we now construct the search space of the Learned Adapter.The adapter is a bottleneck architecture with bottleneck dimension r, consisting of down-projection layer MLP d , an activation function g 1 , an encoder layer Enc, another activation function g 2 and finally a up-projection layer MLP u .Formally, the hidden states h x goes through the adapter and becomes Following He et al. (2021), the hidden representation h x will also go through the backbone's certain encoding module BEnc, and the adapted hidden states will become h x .Following (Wu et al., 2022;Mahabadi et al., 2021;Le et al., 2021), we employ the parameterized hypercomplex multiplication (PHM) layer (Le et al., 2021) with parameter n to reduce the parameters of MLP d and MLP u .The PHM layer has a parameter complexity of O(rd/n), reducing the parameters of the projection layers by at most 1 n .

Search space
We now formally introduce the search space of our Learned Adapter framework.The whole search space contains three types of search cells as shown in Figure 2: Activation Search Cell The Activation Search Cell is designated to choose the proper activation functions g 1 and g 1 from several candidates.Similar to So et al. (2019), the collection of candidate activation functions considered is: (a) ReLU (Agarap, 2018); (b) GeLU (Hendrycks and Gimpel, 2016); (c) SWISH (Ramachandran et al., 2017); (d) Tanh (Krizhevsky et al., 2012); (e) Nul-lAct, which means no activation function and not to make changes to the input.Encoder Search Cell As is shown in Figure 2, different from (Wang et al., 2020;Zhu et al., 2021b), we construct our encoder cell as a simple DAG with a single edge.Our collection of encoder operations consists of the following four groups: (a) 1-d convolutional layers, with stride 1, same padding, output filters equal to the input's dimension, and kernel size equal to 1, 3, 5, or 7 (denoted as conv_k, k = 1, 3, 5, 7).(b) Multi-head self-attention (MHA) layers (Vaswani et al., 2017), with head size equal to 2 or 8 (denoted as mha_k, k = 2, 8).(c) Skip connection (He et al., 2015), denoted as skip-connect.4 Search Method

Preliminaries on DARTS
Assume there is a pre-defined space of operations denoted by O, where each element, o(•), denotes a neural network operation, like convolutional operation, self-attention, activation, etc. DARTS (Liu et al., 2019a) initialize a hypernetwork in which each block is a search cell, that is, a fully connected directed acyclic graph (DAG) with N nodes.Let (i, j) denote a pair of nodes in the DAG.The core idea of DARTS is to use a weighted sum to include all |O| operations in O, where , z i denotes the output of the i-th node, and α o i,j is the architectural parameters that represents the weight (or the importance score) of o(•) in edge (i, j).The output of a node is the sum of all input flow, i.e., z j = i<j f i,j (z i ).The output of the entire cell is formed by summing the last two nodes.
This design makes the entire framework differentiable to both layer weights and architectural parameters α o i,j so that it is possible to perform architecture search in an end-to-end fashion.After the search process is completed, the discretization procedure extracts the final sub-network by selecting the best operation on each edge and dropping the lower-score operations.And the final sub-network will train on the original train set with randomly initialized parameters.

Discussion on the search method
The standard optimization method for the above framework is the bi-level optimization proposed in DARTS (Liu et al., 2019a).However, there are recent works arguing that the single-level optimization method could also work for the DARTS framework.As pointed out by Bi et al. (2019) and Bi et al. (2020), bi-level optimization suffers considerable inaccuracy of gradient estimation and the potential instability can increase with the complexity of the search space.And Bi et al. (2020) conduct experiments to demonstrate that one-level optimization performs comparably with bi-level optimization but with better efficiency.Their experiments are conducted mainly on computer vision benchmarks like CIFAR-10 ( Krizhevsky et al., 2012).In this work, we would like to investigate which optimization method is better under our framework.
Note that the original DARTS requires one to re-train the learned networks from scratch after the search procedure, which definitely introduce additional computation costs.In this work, we propose to gradually discretize the hyper-network and obtain a sub-network without re-training.We will refer to this method as gradually discretizing neural architecture search (GDNAS).We first train the complete hyper-network M for K 1 epochs.Then we select a edge e in the search space to discretize (for example, the edge in the encoder cell).Dis-cretization simply means selecting the operation o e * with highest architectural parameter, and drop the other operations.Now we have obtain a new reduced hyper-network M .The discretized edge may cause the performance of the hyper-network to drop significantly, so we further finetune the hyper-network M for K 2 epochs.
In addition to the advantage of not to retrain the learned network, GDNAS retains the knowledge in the hyper-network, and obtain the performance gains against the re-trained sub-network.This is analogous to the model pruning literature, where a network pruned from a larger one is usually better than the network trained from scratch (Liang et al., 2021).5 Experiments

Evaluation datasets
We evaluate the performance of the methods on the GLUE (Wang et al., 2018) benchmarks.These benchmarks cover multiple tasks of paraphrase detection (MRPC, QQP), sentiment classification (SST-2), natural language inference (MNLI, RTE, QNLI), linguistic acceptability (CoLA). 3 Since the original test sets of the GLUE benchmark are not publicly available, we follow Zhang et al. (2020) and Mahabadi et al. (2021) to construct the train/dev/test splits as follows to ensure a fiar comparison: (a) for datasets with fewer than 10k samples (RTE, MRPC, STS-B, CoLA), we divide 3 Following Devlin et al. (2019) and (Raffel et al., 2019), as a common practice, we do not experiment with the WNLI task (Levesque et al., 2011) due to its adversarial nature with respect to the training set.
the original validation set in half, using one half for validation and the other for testing.(b) for larger datasets, we split 1k samples from the training set as the development set, and use the original development set as the test set.The detailed statistics and evaluation metrics of the GLUE benchmark is presented in Table 7 of Appendix A.

Experiment Settings
We run all the experiments on NVIDIA V100 32GB GPUs.We mainly evaluate our method on the GLUE benchmarks with the RoBERTa-large (Liu et al., 2019c) backbone model.We also evaluate our framework with the DeBERTa-large (He et al., 2020) and GPT2-large (Radford et al., 2019) backbones.We use the HugginFace Transformers (Wolf et al., 2020) for implementing all the methods.Unless otherwise specified, GDNAS will adopt the bi-level optimization method of DARTS.For GDNAS' discretization procedure, we set K 1 = 5 and K 2 = 0.5 on large datasets (SST-2, QNLI, QQP and MNLI), and K 1 = 20 and K 2 = 2 on low-resource datasets.And the batch size is set to 128 for dataset with more than 10k training samples, and 32 otherwise.For Learned Adapter, we set the bottleneck dimension r to 32 and select at most one adapter at each transformer layer.For the PHM layers, we use the PyTorch implementation of Le et al. (2021) and set n to 8. We run each task under 5 different random seeds and report the average performance and standard deviation.More details of the experimental settings are put in Appendix B.

Baselines
We compare our Learned Adapter framework with the current SOTA baseline methods.
Fine-tune The traditional fine-tuning methods that train all parameters in the PTM backbone.Adapter-based tuning For adapter-based tuning methods, we compare with: (1) Adapter (Houlsby et al., 2019); (2) Compacter (Mahabadi et al., 2021); (3) Parallel Adapter proposed by (He et al., 2021) added on the FFN module; (4) LST (Sung et al., 2022).We re-implement Parallel Adapter with PHM projection layers (n = 8).Prompt-based tuning For prompt-based tuning methods, we compare with (1) Prompt Tuning (Lester et al., 2021), (2) P-tuning v2 (Liu et al., 2021).The number of prompt tokens in these methods is set to 20.We implement Aadpter, BitFit, and LoRA using the OpenDelta 4 library.Other baselines are implemented using their open-sourced codes with their default settings.For a fair comparison, we do not use supplementary training like Wu et al. (2022) to enhance performance.

Results on the GLUE benchmark
Table 2 shows the results on GLUE with RoBERTa-large.Our Learned Adapter framework, outperforms the previous PETuning methods and notably preserves 99.4% performance of the fullmodel fine-tuning method while only tuning 240K to 300K parameters.
We can observe from Table 2 that: (a) Note that our Learned Adapter framework obtains further improvements by automatically designing adapter architectures for different intermediate layers of the PTM.(b) Note that although we add encoding operations in adapters, the total tunable parameters of the Learned Adapter in the macro setting are fewer than Compacter since our framework can automatically drop adapters on certain layers when necessary.

Further analysis
Explanations of the searched architectures To understand the searched adapter architectures under our Learned Adapter framework, we present the 4 https://github.com/thunlp/OpenDeltalearned adapter architectures on the RTE and SST-2 tasks on Table 9 and 10 in Appendix D, respectively.From the learned adapter architectures, we can observe that: (a) The adapter architecture varies for different layers, showing that different layers require different adapter architectures.(b) On each task, Learned Adapter chooses the null encoding operation on 3-5 intermediate layers, meaning to drop the adapters on these layers.(c) Regarding the adapter placement choices, we find that on each task, all three placement candidates, FFN adapter, Attn adapter, and Block adapter, are selected.This observation demonstrates that introducing block adapters into our search space is necessary.(d) Most adapters select the convolutional operations, and multi-head self-attention operations tend to occur in adapters of deeper layers.(e) around half of the learned adapters choose the NullAct for the second activation function g 2 .Furthermore, we observe that there are adapters on the deeper transformer layers that requires no activation function but an encoder operation, demonstrating novel design patterns for adapters.
Exploring the limit of parameter efficiency To explore the limit of parameter efficiency, we train the Learned Adapter, and Compacter (Mahabadi et al., 2021) with different rank parameters n ∈ {1, 2, 4, 8, 16, 32}.Note that in the main experiments, we set n equal to 4. With larger n, the parameters of adapters will increase proportionally.In Figure 3, we demonstrate the results of the RTE and SST-2 tasks.We can see that the advantages of our Learned Adapter framework become more prominent with lower tunable parameter budgets.The results demonstrate that our framework can Source Target RTE SST-2 RTE 80.4 (1.3) 94.6 (0.1) SST-2 80.3 (1.4) 94.9 (0.3) effectively deliver the most proper architectures under the given parameter budgets and boost the performance of adapter-based tuning.Architectures' Transferability We now evaluate the transferability of the searched structures by the Learned Adapter.The RTE and SST-2 datasets are used as source and target datasets.We search the adapter architectures on the source dataset and train the searched architectures from scratch on the target task, and report the average and standard deviation of scores over 5 random seeds on table 3. We can see from Table 3 that the searched architectures are highly transferable.The transferred architectures can already achieve comparable or better performances than most baseline models (Table 2).The transferability guarantees the reusability of the searched adapter architectures.Ablation studies on the search space We now conduct an ablation study of our search space by reducing our search space S to a singleton step-bystep : (a) reduce the activation cells by only keeping the ReLU activation for g 1 and the NullAct for g 2 (S 1 ); (b) further reduce the encoder cell to only include skip-connect (S 2 ); (c) further reduce the adapter placement cell to only include FFN adapter, and now the search space only contains Parallel Adapter (He et al., 2021).framework, we also conduct experiments on two other widely used PTM backbones, DeBERTalarge (He et al., 2020), and GPT2-large (Radford et al., 2019).The results are shown in Table 5.
Our Learned Adapter successfully outperforms the adapter-based tuning baselines on both pre-trained backbones.This result enhances the reliability of our framework.We now validate our Learned Adapter framework on other pre-trained backbone: DeBERTalarge (He et al., 2020) and GPT2-large (Radford et al., 2019).The results are presented in Table 5.

Discussions on the search method
Search efficiency of GDNAS We use the RTE task to demonstrate the search efficiency.Running the RTE task with DARTS takes 1.5h (70.5min for bi-level optimization for 25 epochs and 21.6min for re-training with 25 epochs).Since GDNAS does not require re-training, it requires 1.2h (73.3min for training the hyper-network for k 1 +3 * K 2 = 26 epochs).Our method consumes around three times the training time of Parallel Adapter (He et al., 2021), which is affordable compared to manually designing different architectures and running numerous evaluations.Ablation study of search methods We now run Learned Adapter with GDNAS with singlelevel optimization, the original DARTS (Liu et al., 2019a) and ENAS (Cai et al., 2018).The results are shown in Table 6.The results demonstrate that our GDNAS are effective in discovering better adapter architectures.In addition, the results demonstrate that bi-level optimization obtains slightly better results.Performance on a NAS benchmark To further

Conclusion
In this work, we propose the Learned Adapter framework, which automatically optimizes the adapter architectures.First, we design a unified search space for adapters, taking into account the recent works of manual adapter designs.Second, in light of the issues in the DARTS method, we propose a novel GDNAS method that can deliver better adapter architectures and requires no re-training of the learned adapter architectures.We run extensive experiments and analyses on the GLUE benchmark, demonstrating that our Learned Adapter framework can achieve better tuning performances than the baselines while maintaining parameter efficiency.

Limitations
We showed that our proposed method can greatly improve the performance of parameter efficient tuning on diverse NLU tasks and three different pre-trained models (i.e., RoBERTa-large, DeBERTalarge and GPT2-large).However, we acknowledge the following limitations: (a) the more super-sized pretrained models with tens of billions of or more parameters were not studied due to limited computation resources.(b) Other tasks in natural language processing, like the text generation tasks, were also not considered.But our framework can be easily transferred to other backbone architectures and different types of tasks.It would be of interest to investigate if the superiority of our method holds for other backbone models and types of tasks.And we will explore it in future work.
We train the hyper-network on the train set D train following (Liu et al., 2019a).For training epochs, we set K 1 = 5 and K 2 = 1 on large datasets (SST-2, QNLI, QQP and MNLI), and K 1 = 20 and K 2 = 5 on low-resource datasets.We will run the search procedure once for each task.
After the hyper-network is fully discretized, instead of retraining from scratch, we further train the remained network for K 2 epochs, and we evaluate the model on the dev set and save the model checkpoint every I eval steps.The best checkpoint on the dev set is used to run predictions on the test set.We report the average scores on the test set and standard deviations across 5 random seeds.Other hyper-parameters We do pilot experiments on SST-2 using learning rates in {2e-5, 5e-5, 1e-4, 2e-4}, and find that 1e-4 performs the best.For fine-tuning, we try learning rates in {1e-5, 2e-5, 5e-5} and find that 2e-5 performs the best.The number of training epochs for the baselines is set as K = 5 on large datasets (SST-2, QNLI, QQP and MNLI), and K = 20 on smaller datasets.We apply these hyper-parameters to all baselines, and no further hyperparameter-tuning are conducted.Therefore, the comparison is fair for all methods.

C Experimental results on the CIFAR-10 task
To further validate that our GDNAS method can obtain better search performances than the DNAS baselines, we now conduct experiments in the general NAS setting.Following DARTS (Liu et al., 2019a), we conduct neural architecture search on the CIFAR-10 dataset (Krizhevsky, 2009) based on the search space of DARTS.We keep all the search settings identical to DARTS.We first train the hyper-network with frozen architectural weights for 50 epochs.After selecting the operation on an edge, we tune the hyper-net for 8 epochs to let the modified hyper-network to adjust.Following DARTS (Liu et al., 2019a), we run the search and architecture selection phase with four random seeds and report both the best and average test errors of the obtained architectures.The results are reported in Table 8, which compares GDNAS with the DNAS baseline methods.Our GDNAS method achieves 2.52% test error, which manageable search cost of 0.6 GPU days.The results of GDNAS is comparable to other method (like P-DARTS (Chen et al., 2021)) with more complex procedures.

D Learned architectures on the GLUE tasks
In this section, we present the learned adapter architectures on the RTE and SST-2 tasks.The learned adapter architecture are presented in Table 9, 10, respectively.

Figure 2 :
Figure 2: The overall framework of our Learned Adapter.
(d)  The null encoding operation that multiplies zero tensors to the input (null).1  Adapter Placement Search Cell This search cell is designated to decide the placement of the adapter in an intermediate transformer block.We consider three candidate placements shown in Figure 2: 2 (a) FFN adapter, that is, to insert the adapter in parallel to the feed-forward module; (b) Attn adapter, parallel to the self-attention module; (c) Block adapter inserts the adapter in parallel to the whole transformer block.This placement option is supported by the theoretical analysis of He et al. (2021) but has not been considered by the literature.In the experiments, we will show that including the above three choices for adapter placements are necessary.Note that the above three search cells are singleedge DAGs.Following Pham et al. (2018); Wang et al. (2020); Zhu et al. (2021b); Zhu (2021a); Zhu et al. (2021d), we consider the macro search space, that is, different adapter architectures are learned for different intermediate layers.Intuitively, the macro search space allows for idiosyncratic architectures for different intermediate layers, leading to easier model adaptation.Despite the simple structures of search cell DAGs compared to the general NAS literature, our macro search space can result in 6.38e+90 combinations of different adapter architectures across different intermediate layers of the PTM backbones.Note that our search space contains the adapter architectures from Sung et al. (2022); Jie and Deng (2022) as special cases.

45
Algorithm 1: GDNAS Input: A hyper-network M , all edges E on hyper-network M ; Output: Set of selected operations {o * e } e∈E Data: Training set D train , a batch of validation data B val 1 Train hyper-network M on the training set D train for K 1 epochs; 2 for edge e in E do 3 Select the best operation o * e ← arg max o α o e ; Discretize edge e of hyper-network M by only keeping o * e ; Further train the hyper-network M on D train for K 2 epochs;

Figure 3 :
Figure 3: Performances under different PHM rank n.The x-axis represents the number of tunable parameters, and the y-axis represents the performance.The performance of full-model fine-tuning is in dotted horizontal line.

Table 1 :
Results of the pilot experiments.

Table 2 :
The Overall comparison on the GLUE benchmark with RoBERTa-large backbone.We report mean and standard deviation of performance over 5 random seeds.Bold and Underline indicate the best and the second best results.The metric for each task is explained in Appendix A.2.

Table 3 :
Architecture transfer from source datasets to target datasets.The target datasets are in the column names, and the source datasets are in the row names.

Table 4 :
Experimental results for the ablation study of our Learned Adapter search space.

Table 5 :
Table 4 reports the search results on different search spaces, showing that that dropping any components of the whole search space results in performance losses.The results demonstrate that each search cell in our search space design is necessary and beneficial.Working with other PTM backbones To verify the general applicability of our Learned Adapter Results on 2 GLUE tasks using DeBERTalarge and GPT2-large models as the backbone.Bold indicates the best PETuning results.

Table 6 :
Experimental results for the ablation study of the search methods.

Table 8 :
The search results on the CIFAR-10 task.Layer index Adapter placement Activation g 1 Activation g 2

Table 9 :
The learned adapter architectures on the RTE task when the PTM backbone is RoBERTa-large.If an adapter's architecture contains only "-", it means our Learned Adapter framework choose the null encoder operation, and equivalently, dropping this layer's adapter.Layer index Adapter placement Activation g 1 Activation g 2

Table 10 :
The learned adapter architectures on the SST-2 task when the PTM backbone is RoBERTa-large.If an adapter's architecture contains only "-", it means our Learned Adapter framework choose the null encoder operation, and equivalently, dropping this layer's adapter.