Parameter-Efficient Transfer Learning with Diff Pruning

The large size of pretrained networks makes them difficult to deploy for multiple tasks in storage-constrained settings. Diff pruning enables parameter-efficient transfer learning that scales well with new tasks. The approach learns a task-specific “diff” vector that extends the original pretrained parameters. This diff vector is adaptively pruned during training with a differentiable approximation to the L0-norm penalty to encourage sparsity. As the number of tasks increases, diff pruning remains parameter-efficient, as it requires storing only a small diff vector for each task. Since it does not require access to all tasks during training, it is attractive in on-device deployment settings where tasks arrive in stream or even from different providers. Diff pruning can match the performance of finetuned baselines on the GLUE benchmark while only modifying 0.5% of the pretrained model’s parameters per task and scales favorably in comparison to popular pruning approaches.


Introduction
Task-specific finetuning of pretrained deep networks is the dominant paradigm in contemporary NLP, achieving state-of-the-art results across a suite of natural language understanding tasks (Devlin et al., 2019;Liu et al., 2019c;Yang et al., 2019;Lan et al., 2020). While straightforward and empirically effective, this approach is difficult to scale to multi-task, memory-constrained settings (e.g. for on-device applications), as it requires shipping and storing a full set of model parameters for each task. Inasmuch as these models are learning generalizable, task-agnostic language representations through self-supervised pretraining, finetuning the entire model for each task seems especially profligate.
Code: https://github.com/dguo98/DiffPruning A popular approach to parameter-efficiency is to learn smaller compressed models for each task (Gordon et al., 2020;Zhao et al., 2020;Sanh et al., 2020). Such approaches face a steep sparsity/performance tradeoff and keep a substantial amount of nonzero parameters per task (e.g. 10%-30%). Multi-task learning and featurebased transfer allow for more parameter-efficient transfer learning per task (Liu et al., 2019b;Clark et al., 2019;Stickland & Murray, 2019;Reimers & Gurevych, 2019). These methods train a small number of additional parameters (e.g. a linear layer) on top of a shared model. However, multi-task learning generally requires access to all tasks during training to prevent catastrophic forgetting (French, 1999), while feature-based transfer learning (e.g. based on task-agnostic sentence representations) is typically outperformed by finetuning (Howard & Ruder, 2018).
An appealing middle ground is to finetune an extension of the base model for specific tasks. This approach captures the training benefits of finetuning while maintaining the task modularity of feature-based transfer. For example, Adapters (Rebuffi et al., 2018) use smaller, task-specific modules that are inserted between layers of a model This approach does not require access to all tasks during training, targeting realistic settings where as new tasks arrive in stream (Houlsby et al., 2019;Pfeiffer et al., 2020a,b,c). Houlsby et al. (2019) find that adapter layers can match the performance of fully finetuned BERT on the GLUE benchmark while requiring 3.6% additional parameters (on average) per task.
Diff pruning is a new extension to pretrained models with the goal of even more parameterefficient transfer learning. Instead of modifying the architecture of the model, diff pruning extends the base model through a task-specific difference vector.
In order to learn this vector, we reparameterize the task-specific model parameters as θ task = θ pretrained + δ task , where the pretrained parameter vector θ pretrained is fixed and the task-specific diff vector δ task is finetuned. The diff vector is regularized with a differentiable approximation to the L 0 -norm penalty (Louizos et al., 2018) to encourage sparsity.
Diff pruning can become extremely parameterefficient, as it only requires storing the nonzero positions and weights of the diff vector for each task. The cost of storing the shared pretrained model remains constant and is amortized across multiple tasks. On the GLUE benchmark (Wang et al., 2019a), diff pruning can match the performance of the fully finetuned BERT baselines while finetuning only 0.5% of the pretrained parameters per task. As the number of tasks increase, diff pruning outperforms popular pruning-based methods in amount of storage required.

Background: Transfer Learning
Transfer learning in NLP mostly uses a pretrainand-finetune paradigm, which initializes a subset of the model parameters for all tasks from a pretrained model and then finetunes on a task-specific objective. Pretraining objectives include context prediction (Mikolov et al., 2013), autoencoding (Dai & Le, 2015), machine translation (McCann et al., 2017), and more recently, variants of language modeling (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2019) objectives.
Here we consider applying transfer learning to multiple tasks. We consider a setting with a potentially unknown set of tasks (which may arrive in stream), where each task τ ∈ T has an associated training set For all tasks, the goal is to produce (possibly tied) model parameters θ τ to minimize the empirical risk, where f τ (·; θ τ ) is a parameterized function over the input (e.g. a neural network), C(·, ·) is a loss function (e.g. cross-entropy), 1 and R(·) is an optional regularizer with hyperparameter λ.
We can use the pretrain-finetune approach by simply learning independent parameters for each task. However, the large size of pretrained models makes this approach exceedingly parameter inefficient. For example, widely-adopted models such as BERT BASE and BERT LARGE have 110M and 340M parameters respectively, while their contemporaries have parameter counts in the billions (Raffel et al., 2020;Shoeybi et al., 2019;Rajbhandari et al., 2019). Storing the fully finetuned models therefore becomes difficult even for a moderate number of tasks. 2 A classic approach to tackling this parameter-inefficiencyis to train a single shared model (along with a task-specific output layer) against multiple tasks through joint training (Caruana, 1997). However, the usual formulation of multi-task learning requires the set of tasks T to be known in advance in order to prevent catastrophic forgetting (French, 1999), 3 making it unsuitable for applications in which the set of tasks is unknown or when tasks arrive in stream.

Diff Pruning
Diff pruning formulates task-specific finetuning as learning a diff vector δ τ that is added to the pretrained model parameters θ, which remain fixed. We first reparameterize the task-specific model parameters, which results in the following empirical risk minimization problem, where for brevity we define L(D τ , f τ , θ τ ) as This trivial reparameterization shows that the cost of storing the pretrained parameters θ is amortized across tasks, and the only marginal cost for new tasks is the diff vector. If we can regularize δ τ to be sparse such that δ τ 0 θ 0 , then this approach can become more parameter-efficient as the number of tasks increases. We can specify this goal with an L 0 -norm penalty on the diff vector,

Differentiable approximation to the
L 0 -norm This regularizer is difficult to optimize as it is nondifferentiable. In order to approximate this L 0 objective, we follow an approach for gradient-based learning with L 0 sparsity using a relaxed mask vector (Louizos et al., 2018). This approach involves relaxing a binary vector into continuous space, and then multiplying it with a dense weight vector to determine how much of the weight vector is applied during training. After training, the mask is made deterministic, and a large portion of the diff vector is zero. 4 To apply this method we first decompose δ τ into a binary mask vector multiplied with a dense vector, We now lower bound the true objective and optimize an expectation with respect to z τ , whose distribution p(z τ ; α τ ) is initially Bernoulli with introduced parameters α τ , This objective is still complicated by the discrete nature of z τ 's, but the expectation provides some guidance for empirically effective relaxations. We follow prior work (Louizos et al., 2018;Wang et al., 2019b) and relax z τ into continuous space [0, 1] d with a stretched Hard-Concrete distribution (Jang et al., 2017;Maddison et al., 2017), which allows for the use of pathwise gradient estimators. Specifically, z τ is now defined to be a deterministic and (sub)differentiable function of a sample u from a uniform distribution, Here l < 0 and r > 1 are two constants used to stretch s τ into the interval (l, r) d before it is clamped to [0, 1] d with the min(1, max(0, ·)) operation. In this case we have a differentiable closedform expression for the expected L 0 -norm, Thus the final optimization problem is given by, and we can now utilize pathwise gradient estimators to optimize the first term with respect to α τ since the expectation no longer depends on it. 5 After training we obtain the final diff vector δ τ by sampling u once to obtain z τ (which is not necessarily a binary vector but has a significant number of dimensions equal to exactly zero due to the clamping function), then setting δ τ = z τ w τ . 6 3.2 L 0 -ball projection with magnitude pruning for sparsity control Differentiable L 0 regularization allows us to achieve a high sparsity rate. However, it would be ideal to set an exact sparsity rate, especially considering applications which require parameter budgets. As the regularization coefficient λ is a Lagrangian multiplier for the constraint E [ δ τ 0 ] < η for some η, this could be achieved in principle by searching over different values of λ. However we found it more efficient and empirically effective to achieve an exact sparsity rate by projecting onto a target L 0 -ball after training. Specifically, we use magnitude pruning on the diff vector δ τ and target a sparsity rate t% by only keeping the top t% × d values in δ τ . 7 Note that unlike standard magnitude pruning, this is based on the magnitude of the diff vector values and not the model parameters. We found it important to further finetune δ τ with the nonzero masks fixed to maintain good performance, as is often the case in magnitude pruning (Han et al., 2016). Since this type of parameter-efficiency through projection onto the L 0 -ball can be applied without adaptive diff pruning, 8 such an approach will serve as one of our baselines in the empirical study.

Structured Diff Pruning
To allow diff pruning to adapt to the model architecture, we consider a structured extension which incorporates dependence between dimensions. We hypothesize that this approach can allow the model to learn to modify parameters in local regions, as opposed to treating each parameter independently.
We modify the regularizer to first partition the parameter indices into G groups {g(1), . . . , g(G)} where g(j) is a subset of parameter indices governed by group g(j). 9 We then introduce a scalar z j τ (with the associated parameter α j τ ) for each group g(j), and decompose the task-specific pa- We can train with gradient-based optimization as before. Parameters in a group are encouraged by the regularizer to be removed jointly.

Model and datasets
For evaluation we use the GLUE benchmark (Wang et al., 2019b) as well as the SQuAD extractive question answering dataset (Rajpurkar et al., 2016). Following Adapters (Houlsby et al., 2019), we test our approach on the following subset of the GLUE tasks: Multi-Genre Natural Language Inference (MNLI), where the goal is two predict whether the relationship between two sentences is entailment, contradiction, or neutral (we test on both MNLI m and MNLI mm which respectively tests on matched/mismatched domains); Quora Question Pairs (QQP), a classification task to predict whether two question are semantically equivalent; Question Natural Language Inference (QNLI), which must predict whether a sentence is a correct answer to the question; Stanford Sentiment Treebank (SST-2), a sentence classification task to predict the sentiment of movie reviews; Corpus of Linguistic Acceptability (CoLA), where the goal is predict whether a sentence is linguistically acceptable or not; Semantic Textual Similarity Benchmark (STS-B), which must predict a similarity rating between two sentences; Microsoft Research Paraphrase Corpus (MRPC), where the goal is to predict whether two sentences are semantically equivalent; Recognizing Textual Entailment (RTE), which must predict whether a second sentence is entailed by the first. The benchmark uses Matthew's correlation for CoLA, Spearman for STS-B, F 1 score for MRPC/QQP, and accuracy for MNLI/QNLI/SST-2/RTE. For the main experiments and analysis, we use the BERT LARGE model from Devlin et al. (2019) to compare against the adapter-based approach of Houlsby et al. (2019). Our implementation is based on the Hugging Face Transformer library .

Baselines
We compare both structured and non-structured variants of diff pruning against the following baselines: Full finetuning, which fully finetunes BERT LARGE as usual; Last layer finetuning, which only finetunes the penultimate layer (along with the final output layer) 10 ; Adapters from Houlsby et al. (2019), which train task-specific bottleneck layers between each layer of a pretrained model, where parameter-efficiency can be controlled by varying the size of the bottleneck layers; and Non-adaptive diff pruning, which performs diff pruning just based on magnitude pruning (i.e., we obtain θ τ through usual finetuning, set δ τ = θ τ − θ, and then apply magnitude pruning followed by additional finetuning on δ τ ). For diff pruning we set our target sparsity rate to 0.5% and investigate the effect of different target sparsity rates in section 6.1.

Implementation details and
hyperparameters Diff pruning introduces additional hyperparameters l, r (for stretching the Hard-Concrete distribution) and λ (for weighting the approximate L 0norm penalty). We found l = −1.5, r = 1.5, λ = 1.25 × 10 −7 to work well across all tasks. We also initialize the weight vector w τ to 0, and α τ to a positive vector (we use 5) to encourage z τ to be close to 1 at the start of training. 11 While we mainly experiment with BERT models to faciliate comparison against existing work, in preliminary experiments we found these hyperparameters to work for finetuning RoBERTa (Liu et al., 2019c) and XLNet (Yang et al., 2019) models as well.
For all tasks we initially train for 3 epochs and perform a hyperparameter search over batch size ∈ {5, 8, 12, 16} and learning rate ∈ {1×10 −5 , 2× 10 −5 , 5 × 10 −5 }. 12 Finetuning with the fixed mask after projecting onto the L 0 -ball with magnitude pruning is done for 3 epochs with a learning rate of 5 × 10 −5 for all datasets except for MRPC/STS-B/RTE/SST-2 dataset, where we finetune for 5 epochs. The exact hyperparameters for each task are given in section A.1 of the appendix. Grouping for the structured version of diff pruning is based on the matrix/bias vectors (i.e. parameters that belong to the same matrix or bias vector are assumed to be in the same group), which results in 393 groups. 13

Results on GLUE
Our main results on the GLUE benchmark are shown in Table 1. Structured diff pruning can match the performance of a fully finetuned BERT LARGE model while only requiring 0.5% ad-11 These values were found via by a light hyperparameter search on the SST-2 validation set. 12 However we found the default settings used for regular finetuning as suggested in the original BERT paper to work well for most tasks. 13 This definition of groups is implementation-specific since it depends on how one concatenates the input vector before each affine layer. Our grouping is based on Hugging Face's BERT implementation at commit 656e1386a296d696327a9db37de2ccccc79e2cc7. We found this simple definition to work well compared to alternative definitions (e.g. based on individual neurons). ditional parameters per task. Diff pruning without structured sparsity also performs well, though slightly worse than the structured approach. Nonadaptive diff pruning, which magnitude prunes the diff vector without learning the binary mask z τ , performs significantly worse, indicating the importance of learning the masking vector. Compared to Adapters, diff pruning obtains similar performance while requiring many fewer parameters per task, making it a potential alternative for parameterefficient transfer learning. 14

Results on SQuAD
To demonstrate the effectiveness of our approach beyond the GLUE tasks, we additionally experiment on SQuAD (Rajpurkar et al., 2016), an extractive question answering dataset where the model has to select the answer span to a question given a Wikipedia paragraph. To make direct comparisons with Houlsby et al. (2019), we run all experiments on SQuAD v1.1. For diff pruning, we use the same general hyperparameters as our full finetuning baseline (see section A.1). As shown in Figure 1 (right), diff pruning is able achieve comparable or better performance with only 1.0% additional parameters. Interestingly, diff pruning measurably improves the upon the full finetuning baseline while modifying fewer parameters, which indicates that diff pruning can have a useful regularization effect on top of parameter-efficiency. 6 Analysis 6.1 Varying the target sparsity In Figure 1 (left), we plot results on the GLUE validation set averaged across all tasks at target sparsity 14 Comparing storage costs is a bit more challenging as it is implementation-specific. Diff pruning incurs additional storage cost due to storing the nonzero positions of the diff vector. See section 6.6 for storage comparison against Adapters assuming float32 for weights and int32 for positions.  rates of 0.1%, 0.25%, 0.5%, 1.0% for the different baselines. Structured diff pruning consistently outperforms non-structured and and non-adaptive variants across different sparsity rates. The advantage of adaptive methods becomes more pronounced at extreme sparsity rates. In Table 2, we report the breakdown of accuracy of structured diff pruning across different tasks and sparsity rates, where we observe that different tasks have different sensitivity to target sparsity rates. This suggests that we can obtain even greater parameter-efficiency through targeting task-specific sparsity rates in the diff vector.

Structured vs. Non-structured Diff Pruning
Structured diff pruning introduces an additional mask per group, which encourages pruning of entire groups. This is less restrictive than traditional group sparsity techniques that have been used with L 0 -norm relaxations, which force all parameters in a group to share the same mask (Louizos et al., 2018;Wang et al., 2019b). However we still expect entire groups to be pruned out more often, which might bias the learning process towards either eliminating completely or clustering together nonzero diffs. In Table 3, we indeed find that structured diff pruning leads to finetuned models that are much more likely to leave entire groups unchanged from their pretrained values (zero diffs).

Task-specific Sparsity
Different layers of pretrained models have been argued to encode different information (Liu et al., 2019a;Tenney et al., 2019). Given that each task will likely recruit different kinds of language phenomena embedded in the hidden layers, we hypothesize that diff pruning will modify different parts of the pretrained model through task-specific finetuning. Figure 2 shows the percentage of nonzero diff parameters attributable to the different layers for each task. We find that different tasks indeed modify different parts of the network, although there are some qualitative similarities between some tasks, for example between QNLI & QQP (both must encode questions), and MRPC & STS-B (both must predict similarity between sentences). The embedding layer is very sparsely modified for all tasks. While some of the variations in the sparsity distributions is due to simple randomness, we do observe some level of consistency over multiple runs of the same task, as shown in section A.2 of the appendix. The ability to modify different parts of the pretrained model for each task could explain the improved parameter-efficiency of our approach compared to Houlsby et al. Non-structured 6.2% 6.1% 6.0% 6.4% 6.1% 6.4% 7.1% 6.1% 6.3% Structured 37.7% 64.6% 28.8% 20.8% 13.2% 12.2% 12.7% 34.9% 28.1% Table 3: Percentage of groups where all of the parameters in the group are fully zero for structured vs. non-structured diff pruning at 0.5% target sparsity. We group based on each matrix/bias vector, resulting in 393 groups in total.
Figure 2: Percentage of modified parameters attributable to each layer for different tasks at 0.5% target sparsity. The layers are ordered from earlier to later (i.e. the embedding layer is shown at the top). The x-axis for each plot goes from 0% to 20%.
tentially suggests that Adapters with more finegrained access into model internals (e.g. Adapters for key/value/query transformations) might result in even greater parameter-efficiency. While left as future work, we also note that diff pruning can be applied in conjunction with Adapters, which might further improve results.

Effect of L 0 -ball projection
Applying magnitude pruning to project onto the L 0ball was crucial in achieving exact sparsity targets. As shown in Table 4, we observed little loss in performance through this approach. We reiterate that it was crucial to finetune with a fixed mask, even for the approach which does not apply magnitude pruning. 16

Comparison against BERT compression
Direct BERT compression methods also provide a straightforward approach to parameter-efficient transfer learning. Here we compare diff pruning against existing BERT compression methods, in particular DistilBERT , Mobile-BERT (Sun et al., 2020b) and TinyBERT (Jiao et al., 2020). In these experiments we apply diff pruning on the smaller BERT BASE model as these works typically utilize BERT BASE as the baseline. As shown in Table 5, we observe that diff pruning is more parameter-efficient when considering all GLUE tasks while maintaining better performance. Of course, BERT compression methods typically have faster inference time (e.g. TinyBERT 4 is 9.4× faster that BERT BASE ). However we note that diff 16 Without fixed-mask finetuning, GLUE performance decreases from 84.9 to 81.4. pruning can be applied on these methods, which may further improve parameter-efficiency while maintaining fast inference.

Storage cost
Finally, Table 6 shows the actual memory requirements for diff pruning compared to Adapters for a Python implementation. While diff pruning requires storing positions in addition to the weights (unlike Adapters which can just store the weights), diff pruning is still more storage-efficient due to the greater parameter-efficiency.

Discussion and caveats
For training, our approach requires more memory than usual finetuning due to additionally optimizing α τ and w τ . Since the majority of GPU memory is typically utilized by a minibatch's intermediate layers, this did not present a significant challenge for pretrained models that we experimented with in this study. However, this could present an issue as model sizes get larger and larger. After training, storing the task-specific diff vector requires storing a compressed version with both the nonzero positions and weights, which incurs additional storage requirements. Finally, while training efficiency was not a primary concern of this work, diff pruning was also approximately 1.5× to 2× slower to train per minibatch than regular finetuning.

Related Work
Multi-task learning Multi-task learning (Caruana, 1997), broadly construed, aims to learn models and representations that can be utilized across a diverse range of tasks, and offers a natural approach    to training parameter-efficient deep models. Several works have shown that a single BERT model can obtain good performance across multiple tasks when jointly trained (Liu et al., 2019b;Clark et al., 2019;Stickland & Murray, 2019). An alternative approach to multi-task learning that does not require access to all tasks during training involve training smaller task-specific layers that interact with a fixed pretrained model (Rebuffi et al., 2018;Zhang et al., 2020a). In particular, Adapters (Rebuffi et al., 2018), which learn to read and write to layers of a shared model, have been applied to obtain parameter-efficient BERT models (Houlsby et al., 2019;Pfeiffer et al., 2020a,b,c). In recent work, Li & Liang (2021) and Qin & Eisner (2021) explore the use of learned prompts on top of pretrained models to obtain task-specific models. Yet another line of work targets extreme parameterefficiency through task-agnostic sentence representations that can be used without finetuning for downstream tasks (Le & Mikolov, 2014;Kiros et al., 2015;Wieting et al., 2016;Hill et al., 2016;Arora et al., 2017;Conneau et al., 2017;Cer et al., 2018;Zhang et al., 2018;Subramanian et al., 2018;Reimers & Gurevych, 2019;Zhang et al., 2020b). These feature-based transfer learning methods are however generally outperformed by fully finetuned models (Howard & Ruder, 2018).
Model compression There has been much recent work on compressing pretrained trained with selfsupervision (see (Ganesh et al., 2020) for a recent survey). A particularly promising line of work focuses on obtaining smaller pretrained models (for subsequent finetuning) through weight pruning (Gordon et al., 2020; and/or knowledge distillation Sun et al., 2019;Turc et al., 2019;Jiao et al., 2020;Sun et al., 2020b). It would be interesting to see whether our approach can be applied on top of these smaller pretrained models to for even greater parameter-efficiency.
Learning to mask Our work is closely related to the line of work on learning to mask parts of deep networks with differentiable relaxations of binary masks for model pruning and parameter sharing (Wang et al., 2019b;Zhao et al., 2020;Sanh et al., 2020;Radiya-Dixit & Wang, 2020;Mallya et al., 2018;Guo et al., 2019;Sun et al., 2020a;Cao et al., 2021). While these works also enable parameterefficient transfer learning, they generally apply the masks directly on the pretrained parameters instead of on the difference vector as in the present work. Regularization towards pretrained models Finally, diff pruning is also related to works which regularize the learning process towards pre-trained/shared models for continual learning (Rusu et al., 2016;Kirkpatrick et al., 2017;Schwarz et al., 2018), domain adaptation (Wiese et al., 2017;Miceli Barone et al., 2017), and stable finetuning . These works typically do not utilize sparse regularizers and target a different goal than parameter-efficiency.

Conclusion
We propose diff pruning as a simple approach for parameter-efficient transfer learning with pretrained models. Experiments on standard NLP benchmarks and models show that diff pruning can match the performance of fully finetuned baselines while requiring only a few additional parameters per task, and can sometimes have a regularization effect and improve upon regular finetuning. We also propose a structured variant of diff pruning which provides further improvements. Avenues for future work include (i) injecting parameter-efficiency objectives directly into the pretraining process (to pretrain models that are better suited towards sparse transfer learning), and (ii) combining diff pruning with other techniques (e.g. adapters, model compression) to achieve even greater parameter-efficiency. Table 7 shows hyperparameters we used for training GLUE tasks. For SQuAD v1.1 experiments, we ran distributed training across 8 GPUs, and used per gpu batch size 3, maximum sequence length 384, document stride 128, learning rate 3 × 10 −5 , number of initial training epochs 2 and number of finetuning epochs 2.
A.2 Consistency of Nonzero Parameters Figure 3 shows the percentage of modified parameters attributable to each layer across 5 runs of SST-2. We find that there is nonotrivial variation in sparsity across runs, but also a degree of consistency. For example, the first layer is modified considerably more than other layers across all runs.