AdapterDrop: On the Efficiency of Adapters in Transformers

Transformer models are expensive to fine-tune, slow for inference, and have large storage requirements. Recent approaches tackle these shortcomings by training smaller models, dynamically reducing the model size, and by training light-weight adapters. In this paper, we propose AdapterDrop, removing adapters from lower transformer layers during training and inference, which incorporates concepts from all three directions. We show that AdapterDrop can dynamically reduce the computational overhead when performing inference over multiple tasks simultaneously, with minimal decrease in task performances. We further prune adapters from AdapterFusion, which improves the inference efficiency while maintaining the task performances entirely.


Introduction
While transfer learning has become the go-to method for solving NLP tasks (Pan and Yang, 2010;Torrey and Shavlik, 2010;Ruder, 2019;Howard and Ruder, 2018;Peters et al., 2018), transformerbased models are notoriously deep requiring millions or even billions of parameters (Radford et al., 2018;Devlin et al., 2019;Radford et al., 2019;Liu et al., 2019;Brown et al., 2020). This results in slow inference and large storage requirements.
We close this gap and establish the computational efficiency of two adapter architectures at training and inference time. We investigate different strategies to further improve the efficiency of adapter-based models by incorporating ideas from all three directions mentioned above. Our strategies rely on dropping out adapters from transformers, at training and inference time, resulting in models that are dynamically adjustable regarding the available computational resources. Our approaches are agnostic to the pre-trained transformer model (e.g., base, large), which makes them broadly applicable.

Contributions:
1. We are the first to establish the computational efficiency of adapters compared to full fine-tuning. We show that the training steps of adapters can be up to 60% faster than full model fine-tuning with common hyperparameter choices, while being 4-6% slower at inference. Hence, adapters are a suitable choice for researchers interested in achieving faster training times, or when requiring extensive hyperparameter tuning. 2. We propose AdapterDrop, the efficient and dynamic removal of adapters with minimal impact on the task performances. We show that dropping adapters from lower transformer layers considerably improves the inference speed in  Table 1: Relative speed of adapters compared to fully fine-tuned models. For example, 1.6 for training with the Pfeiffer adapter means that we can perform 1.6 training steps with this adapter in the time of one training step with full model fine-tuning.
multi-task settings. For example, with adapters dropped from the first five layers, AdapterDrop is 39% faster when performing inference on 8 tasks simultaneously. This can be beneficial for researchers working on models that need to make multiple predictions on each input. 3. We prune adapters from adapter compositions in AdapterFusion (Pfeiffer et al., 2021a) and retain only the most important adapters after transfer learning, resulting in faster inference while maintaining the task performances entirely. This is suitable for settings with little labeled training data, where AdapterFusion can achieve ample improvements over standard single task models.

Efficiency of Adapters
We first establish the computational efficiency of adapters without AdapterDrop. As illustrated in Figure 1, significant differences exist in the forward and backward pass when fine-tuning adapters compared to fully fine-tuning the model. In the forward pass, adapters add complexity with the additional components; however, it is not necessary to backpropagate through the entire model during the backward pass. We compare the training and inference speed of full model fine-tuning against the adapter architectures of Houlsby et al. (2019) and Pfeiffer et al. (2021a) (depicted in Figure 1) using the AdapterHub.ml framework (Pfeiffer et al., 2020a). We conduct our measurements with the transformer configuration of BERT base and verify them with different GPUs. 1 We provide measurements corresponding to common experiment configurations in Table 1.
Training. Adapters can be considerably faster compared to full model fine-tuning-60% faster in some configurations. The two adapter architectures differ only marginally in terms of training efficiency: due to its simpler architecture, training steps of the Pfeiffer adapters are slightly faster. The magnitude of the differences depends on the input size; the available CUDA cores are the primary bottleneck. 2 We do not observe any particular differences between adapters and full fine-tuning regarding the training convergence. 3 The training speedup can be explained by the decreased overhead of gradient computation. Most of the parameters are frozen when using adapters and it is not necessary to backpropagate through the first components (see Figure 1).
Inference. The two adapter architectures are 94-96% as fast as fully fine-tuned models, which varies depending on the input size. This can have a considerable impact when deployed at scale.

AdapterDrop
We have established that adapters are more efficient in terms of training time, however, there is a perpetuate need for sustainable and efficient models (Strubell et al., 2019). Backpropagating through as few layers as possible would further improve the efficiency of training adapters. The efficiency for inference can be improved by sharing representations at lower transformer layers when simultaneously performing inference for multiple tasks-in other words, when performing multiple independent classifications on the same input. We establish this in Table 2, finding that models are up to 8.4% faster with every shared layer (16 tasks).
Motivated by these observations, we propose AdapterDrop: Dynamically removing adapters from lower transformer layers (depicted in Figure 1). AdapterDrop is similar to dropping out entire transformer layers (Fan et al., 2020), however, specialized to adapter settings-where lower layers often have a small impact on the task performances (Houlsby et al., 2019).
We study two training methods for AdapterDrop: (1) Specialized AdapterDrop: Removing adapters from the first n transformer layers, where n is fixed during training. This yields separate models for each possible n.
(2) Robust AdapterDrop: Drawing the integer n randomly from [0, 11] for each Speedup (each layer) 4.3% 6.6% 7.8% 8.4%  Figure 2 shows that specialized AdapterDrop maintains good results even with several dropped layers. With the first five layers dropped, specialized AdapterDrop maintains 97.1% of the original performance (averaged over all eight GLUE tasks; see Table 8). Moreover, robust AdapterDrop achieves comparable results, and with five layers dropped it maintains 95.4% of the original performance (on avg). The advantage of robust over specialized AdapterDrop is that the robust variant can be dynamically scaled. Based on current available computational resources, robust AdapterDrop can (de)activate layers with the same set of parameters, whereas specialized AdapterDrop needs to be trained for every setting explicitly.
The efficiency gains can be large. When performing inference for multiple tasks simultaneously, we measure inference speedups of 21-42% with five 4 We also explored dropping adapters from randomly chosen layers (instead of early layers). This generally performs worse and it requires selecting a suitable dropout rate. 5 The detailed setup is listed in Appendix A.2. dropped layers-depending on the number of simultaneous tasks (Table 2). 6 Training of our robust adapters is also more efficient, which increases the speed of training steps by 26%. 7

Efficiency of AdapterFusion
AdapterFusion (Pfeiffer et al., 2021a) leverages the knowledge of several adapters from different tasks and learns an optimal combination of the adapters' output representations for a single target task (see Figure 3). AdapterFusion (AF) is particularly useful for small training sets where learning adequate models is difficult. Despite its effectiveness, AF is computationally expensive because all included adapters are passed through sequentially. 8 Table 3 shows that the differences can be substantial for both training and inference. For instance, compared to a fully fine-tuned model, AF with eight adapters is around 47% slower at training time and 62% slower at inference. 9

AdapterDrop for AdapterFusion
There exists considerable potential for improving the efficiency of AF, especially at inference time. We address this with two variants of AdapterDrop 6 For more details see Appendix G.2 7 Every dropped adapter improves the speed of training steps by 4.7% and we drop on average 5.5 adapters when training robust adapter models (more hyperparameter settings and details are given in Appendix G.2). 8 We also test AF with parallel operations and found no efficiency gains (see Appendix H). 9 All with Pfeiffer adapter and depending on the input size. We provide more measurements in Appendix G.3.    for AF by (1) removing entire AF layers; (2) pruning the least important adapters from AF models.

Removing AdapterFusion Layers
We fuse the adapters from all eight GLUE tasks and observe the largest gains of AF on RTE and CoLA. We additionally train robust AF models with the same procedure as in §3. We investigate from how many lower layers we can remove AF at test time while still outperforming the corresponding singletask adapter (without AdapterDrop). Figure 4 shows that AF performs better than the  AF is trained with eight adapters, and we gradually remove the least important from the model.
single-task adapter on RTE until removing AF from the first five layers. This improves the inference efficiency by 26%. 10 On CoLA, we observe a different trend. Removing AF from the first layer results in more noticeable performance decreases, achieving lower task performances than the single-task adapter. This is in line with recent work showing that some linguistic tasks heavily rely on information from the first layers (Vulić et al., 2020). We deliberately highlight that AdapterDrop might not be suitable for all tasks. However, Figure 13 shows that CoLA represents the most extreme case. Nevertheless, our results suggest that researchers need to be cautious when removing AdapterFusion layers as there may exist a considerable performance/efficiency tradeoff.

AdapterFusion Pruning
The inference efficiency of AF largely depends on the number of fused adapters, see Table 3. We can, therefore, achieve efficiency improvements by pruning adapters from the trained AF models (depicted in Figure 3). Our hypothesis is that we can safely remove adapters if they are not usually activated by AF, which means that they do not contribute much to the output representations. In each fusion layer, we record the average adapter activations-their relative importance-using all instances of the respective AF training set. We then remove the adapters with lowest activations. Figure 5 demonstrates that we can remove most adapters in AF without affecting the task performance. With two remaining adapters, we achieve comparable results to the full AF models with eight adapters and improve the inference speed by 68%.
We therefore recommend performing Adaper-Fusion pruning before deploying these models in practice. This is a simple yet effective technique to achieve efficiency gains even when aiming at maintaining performance entirely.

Conclusion
Adapters have emerged as a suitable alternative to full model fine-tuning, and their most widely claimed computational advantage is the small model size. In this work, we have demonstrated that the advantages of adapters go far beyond mere parameter efficiency. Even without our extensions, the training steps of two common adapter architectures are up to 60% faster. However, these improvements come at the cost of 4-6% slower inference speed. Thus, if training is more important, adapters can be advantageous over full model fine-tuning.
AdapterDrop expands these advantages by dropping a variable number of adapters from lower transformer layers. We dynamically reduce the computational overhead at run-time when performing inference over multiple tasks and maintain task performances to a large extent. This benefits researchers working on models that need to make multiple independent predictions on a single input.
Finally, we also investigated the computational efficiency of AdapterFusion models. We find that dropping entire AdapterFusion layers comes at a considerable performance/efficiency tradeoff, whereas pruning of the least activated adapters in each layer can improve the model efficiency while maintaining performance entirely.
We believe that our work can be widely extended and that there exist many more directions to obtain efficient adapter-based models. For instance, we could explore more efficient pre-trained adapters, 11 sharing the adapter weights across layers, 12 or pruning adapters from AdapterFusion at training time. 13 In the Appendix to this paper, we present preliminary results for several related ideas, which may serve as a starting point for future work.

Acknowledgments
This work has received financial support from multiple sources. (1) The German Federal Ministry of 11 In Appendix B, we evaluate MLM pre-trained adapters. Our results suggest that different strategies are necessary for adapters as compared to fully fine-tuned transformers, which can serve as a starting point for further experiments. 12 Appendix D shows that adapter with shared weights across layers achieves comparable results to a standard adapter while drastically reducing the number of parameters. 13 Appendix E shows that we can randomly dropout 75% of the adapters during AdapterFusion training with a minimal impact on the task performance.

A.1 Computational Efficiency
We use Python 3.6, PyTorch 1.5.1, CUDA 10.1 for all measurements. We repeat them with two different GPUs: NVIDIA Tesla V100 PCIe (32GB) and a NVIDIA Titan X Pascal (12GB). We make use of the torch.cuda.Event class and torch.cuda.synchronize to measure only the exact period of time of a training (or inference) step. 14 For both inference and training, we repeat the respective step 300 times. We report the median to mitigate the impact of outliers caused by GPU warmup.
Relativ speed. We define the relative speed of an adapter compared full model fine-tuning as: Sa S f where S a and S f are the time of one step with the adapter model and the fully fine-tuned model, respectively. For example, a relative speed of 1.5 means that the adapter model can perform 1.5 steps in the time the fully fine-tuned model performs one step.
Speedup. Speedup describes the positive change in relative speed of an adapter model when using AdapterDrop (or another method). A speedup of p% means that the adapter model with Adapter-Drop requires only (1 − p/100)× of the runtime than the adapter model without AdapterDrop. The speedup of AdapterDrop (and AdapterFusion) are additive. If dropping one layer results in p% speedup, dropping two layers results in 2p% speedup, etc.

A.2 Task Performances
We study the task performances of adapter models on the popular GLUE benchmark (Wang et al., 2018). Following Devlin et al. (2019), we exclude the WNLI because of the problematic data construction. 15 We perform our analyses using RoBERTa base (Liu et al., 2019) as our pre-trained model and report the mean and standard deviation over three runs of the best development performance evaluated after every epoch. We train larger data sets (SST-2, MNLI, QNLI, and QQP) for 10 epochs and the rest of the data sets for 20 epochs. We use a batch size of 32 and, if not otherwise noted, the default hyperparameters for adapter fine-tuning as in (Pfeiffer et al., 2021a).

B Adapter Initialization and Convergence
Besides measuring training and inference time, we are interested in (1) how using adapters compare to standard RoBERTa-base with regards to downstream task convergence, and (2) if initializing adapters with pre-trained weights using masked language modeling can lead to faster convergence. First, we compare RoBERTa-base with adapter models using the architecture proposed by Pfeiffer et al. (2021a). Second, we pretrain an adapter with masked language modeling (MLM) using documents from the English Wikipedia. 16 The results for both experiments are visualized in Figure 12. When comparing RoBERTa-base with randomly initialized adapters, We find that adapters do not come at the cost of requiring more training steps for convergence (1). For several of the eight GLUE tasks, we observe similar convergence behavior with the standard RoBERTa-base model and its counterpart using adapters.
Further, we observe across all tasks that initializing the adapter weights with MLM pre-training does not have a substantial impact on the downstream task convergence (compared to a randomly initialized adapter). Thus, we find no evidence that pre-training of adapters with our masked language modeling objective leads to better convergence performance in our experiments (2).

C Detailed Results: AdapterDrop Task Performances
We plot the detailed task performances of Adapter-Drop with the different training strategies in Figure 13. The relative differences of AdapterDrop to a standard adapter with no AdapterDrop are given in Table 8.

D Adapter with Cross-Layer Parameter Sharing
We can further reduce the number of parameters required for each task by sharing the weights of the adapters across all transformer layers. This is similar to weight sharing in ALBERT (Lan et al., 2020), but specialized on adapters and can therefore be applied to a wide range of pre-trained models. We use the Pfeiffer adapter architecture in our experiments with the same hyperparameters as in Appendix A.2. Because cross-layer parameter sharing reduces the capacity of adapter models, we study the impact of the adapter compression rate. The compression rate refers to the down-projection factor in the adapter's bottleneck layer and thus impacts the its capacity (the compression rate specifies by how much 'FF Down' in Figure 1 compresses the representations). The standard compression rate is 16, and smaller values result in a larger model capacity. Table 6 shows that cross-layer parameter sharing with the same compression rate of 16 largely maintains the performance compared to separate weights with an average difference of 2.35%. With a smaller compression rate of 4, we close this gap by more than 50% while still requiring 66% fewer parameters. 17 The resulting models are lightweight: our shared adapter with a compression rate of 16 requires only 307KB storage space.

E Training AdapterFusion with Dropout
We investigate the random dropout of adapters from AdapterFusion during training (using our eight task adapters as in §4) to improve the speed of training steps. Each layer randomly selects different adapters to drop out. This means that the model itself may still use the knowledge from all tasks, although not in the layers individually. Table 7 shows the results for the four smallest GLUE tasks in terms of training data size. The speedup that we achieve with AdapterFusion dropout can be substantial: with a dropout rate of 75% (i.e., dropping out 6 out of our 8 adapters) each training step is 74% faster on average (with a sequence length of 128, a batch size of 32). We observe no clear trend in terms of task performances. Fusion dropout leads to consistent decreases on RTE and CoLA, only a small impact on STS-B (no difference when dropping out 25% of adapters), and yields improvements on MRPC.
The effectiveness of Fusion dropout, thus, depends on the individual downstream task. Nevertheless, we believe that this methods could be suitable, e.g., for resource-constrained settings.

F Detailed Results: Removing AdapterFusion Layers
The computational overhead of AF can be reduced during inference by decreasing the number of adapters. We investigate how dropping AF layers impacts the performance on the four smallest GLUE tasks (MRPC, STS-B, CoLA, RTE) and visualize the results in Figure 7.
In this experiment we compare the performance of AF with and without AdapterDrop during training. For both, we use standard adapters as well as adapters created via AdapterDrop as basis for AF. Unsurprisingly, the performance of AF without AdapterDrop within the adapters or fusion drops fastest on all four datasets. Using AdapterDrop when creating the adapters, applying AdapterDrop on AF, or the combination of both significantly reduces the performance drop when omitting fusion layers during inference. On RTE and MRPC, multiple AF layers can be omitted while still performing en par with or better compared to a single task adapter. We further find this robustness to be task dependent. Even AF with AdapterDrop shows a steep fall in performance on RTE and CoLA, while being relatively stable on MRPC and STS-B, even with most layers omitted.

G Detailed Efficiency Measurements
In this section, we present detailed results of our efficiency measurements for V100 and TitanX GPUs.

G.1 Adapters
We present the efficiency results for adapters and fully fine-tuned models in Figure 6, where we plot the required time (absolute numbers) during training and inference. The relative speed of adapters compared to fully fine-tuned models is given in Table 9.

G.2 AdapterDrop
Multi-task inference. In Figure 8, we plot the speed of adapters in a multi-task setting compared to fully fine-tuned models with sequential processing of inputs. In Table 11, we present the relative speed of adapters in this setting and show the speedup gained with AdapterDrop for each dropped layer. The average speedup in Table 2 is calculated as the average speedup over the batch sizes 16, 32 and 64 in Table 11.
Training adapters with dropped layers. Table  5 shows the speedup of AdapterDrop when training a single adapter. The average speedup for training with AdapterDrop is 4.7% per layer for the V100 and 4.5% for the TitanX. This is the average result over batch sizes 16, 32, 64 and sequence length 64, 128, 256, and 256 (see Table 5).

G.3 AdapterFusion
We plot the speed of AdapterFusion with different numbers of included adapters in Figure 9. In Table 10, we present the relative speed of Adapter-Fusion compared to a fully-finetuned model and a model with one adapter. This also shows the computational overhead (slowdown) that results from adding more adapters to AdapterFusion. Table 4 shows the speedup gained with Adapter-Drop for AdapterFusion during training and inference. Figure 10 shows the required time as a function of the dropped layers.

H Parallel Implementation of AdapterFusion
AdapterHub's implementation of AdapterFusion passes through each task adapter sequentially. We hypothesized that a better efficiency can be achieved with parallel processing of adapters. We implement the parallel computation of the different adapters by reformulation the linear layers as two convolutions.
The first convolution is a convolution with a kernel size equal to the hidden dimension of the transformer and output channels equal to the number of adapters times the downprojection dimension of the adapters. The second convolution is a grouped convolution 18 which processes the channels in blocks the size of the downprojection dimension. It outputs channels equal to the number of adapters times the hidden dimension.

Inference Training
Adapters V100 TitanX V100 TitanX 2 3.0% 3.1% 6.3% 6.4% 4 4.0% 4.1% 6.8% 6.8% 8 5.2% 5.2% 7.3% 7.3% 16 6.3% 6.3% 7.8% -  We show in Figure 11 and in Table 12 that the iterative implementation is faster than the parallel implementation for larger input sizes (e.g., batch sizes greater than). This indicates that once the input can no longer be processed entirely in parallel on the GPU (due to limited CUDA cores) the iterative implementation seems to be more efficient.  Table 6: Task performance scores of the standard approach with separate adapter weights vs. cross-layer parameter sharing. The compression rate denotes the factor by which 'FF Down' in Figure 1 compresses the representations. The number of parameters is given without classification heads.    ----512  64  --------512 128 --------     Table 11: The relative inference speed of simultaneous processing of multiple tasks with adapters compared to sequential processing of tasks with fully fine-tuned models. Gray columns show the speedup of Adapter-Drop for every additional dropped layer. All measurements use a sequence length of 128. Batch size 1 for the V100 is an outlier in both speedup and relative speed compared to the other results due to the small input size (compare with Figure 8).   The absolute time required for performing inference for multiple tasks on the same input. The measurements are conducted with a sequence length of 128. N FF models denotes N fully fine-tuned models, executed sequentially. Parallelized denotes the time required by N fully fine-tuned models running fully parallelized. Batch size 1 on the V100 is an outlier compared to the other results with a smaller speedup for each dropped layer but a higher relative speed compared to the fine-tuned models due to the small input size. Negative values indicate that the iterative implementation is faster. We calculate the difference as t i − t p , where t i , t p are the times for iterative and parallel implementation, respectively. In Figure (a), the parallel implementation is faster if the input is sufficiently small as the GPU is not working at capacity and is able to use the parallel implementation. Figure 12: Evaluation performance of fine-tuning RoBERTA-base in comparison with different initialization strategies for adapters (randomly initialized vs. pre-trained on masked language modeling task). Training was conducted for 10k steps with a learning rate of 5e-05 for RoBERTa-base and 0.0001 for adapters, respectively.