Pruning Pre-trained Language Models Without Fine-Tuning

To overcome the overparameterized problem in Pre-trained Language Models (PLMs), pruning is widely used as a simple and straightforward compression method by directly removing unimportant weights. Previous first-order methods successfully compress PLMs to extremely high sparsity with little performance drop. These methods, such as movement pruning, use first-order information to prune PLMs while fine-tuning the remaining weights. In this work, we argue fine-tuning is redundant for first-order pruning, since first-order pruning is sufficient to converge PLMs to downstream tasks without fine-tuning. Under this motivation, we propose Static Model Pruning (SMP), which only uses first-order pruning to adapt PLMs to downstream tasks while achieving the target sparsity level. In addition, we also design a new masking function and training objective to further improve SMP. Extensive experiments at various sparsity levels show SMP has significant improvements over first-order and zero-order methods. Unlike previous first-order methods, SMP is also applicable to low sparsity and outperforms zero-order methods. Meanwhile, SMP is more parameter efficient than other methods due to it does not require fine-tuning.


Introduction
Pre-trained Language Models (PLMs) like BERT (Devlin et al., 2019) have shown powerful performance in natural language processing by transferring the knowledge from large-scale corpus to downstream tasks. These models also require large-scale parameters to cope with the large-scale corpus in pretraining. However, these large-scale parameters are overwhelming for most downstream tasks , which results in significant overhead for transferring and storing them.
To compress PLM, pruning is widely used by removing unimportant weights and setting them to zeros. By using sparse subnetworks instead of the original complete network, existing pruning methods can maintain the original accuracy by removing most weights. Magnitude pruning (Han et al., 2015a) as a common method uses zeroth-order information to make pruning decisions based on the absolute value of weights. However, in the process of adapting to downstream tasks, the weight values in PLMs are already predetermined from the original values. To overcome this shortcoming, movement pruning (Sanh et al., 2020) uses firstorder information to select weights based on how they change in training rather than their absolute value. To adapt PLMs for downstream tasks, most methods like movement pruning perform pruning and fine-tuning together by gradually increasing the sparsity during training. With the development of the Lottery Ticket Hypothesis (LTH) (Frankle and Carbin, 2018) in PLMs, some methods Liang et al., 2021) find certain subnetworks from the PLM by pruning, and then fine-tune these subnetworks from pre-trained weights. Moreover, if the fine-tuned subnetwok can match the performance of the full PLM, this subnetwork is called winning ticket .
In this work, we propose a simple but efficient first-order method. Contrary to the previous pruning method, our method adapts PLMs by only pruning, without fine-tuning. It makes pruning decisions based on the movement trend of weights, rather than actual movement in movement pruning. To improve the performance of our method, we propose a new masking function to better align the remaining weights according to the architecture of PLMs. We also avoid fine-tuning weights in the task-specific head by using our head initialization method. By keeping the PLM frozen, we can save half of the trainable parameters compared to other first-order methods, and only introduce a binary mask as the new parameter for each downstream task at various sparsity levels. Extensive experi-ments on a wide variety of sparsity demonstrate our methods strongly outperform state-of-the-art pruning methods. Contrary to previous first-order methods (Sanh et al., 2020), which show poor performance at low sparsity, our method is also applied to low sparsity and achieves better performances than zero-order methods.

Related Work
Compressing PLMs for transfer learning is a popular area of research. Many compression methods are proposed to solve overparameterized problem in PLMs, such as model pruning (Han et al., 2015b;Molchanov et al., 2017;Xia et al., 2022), knowledge distillation (Jiao et al., 2020;, quantization (Shen et al., 2020;Qin et al., 2022), and matrix decomposition (Lan et al., 2020). Among them, pruning methods have been widely studied as the most intuitive approach.
Pruning methods focus on identifying and removing unimportant weights from the model. Zeroorder methods and first-order methods are widely used to prune PLMs. For zero-order methods, magnitude pruning (Han et al., 2015a) simply prunes based on absolute value of their weights. For first-order methods, which are based on first-order Taylor expansion to make pruning decision, L 0 regularization (Louizos et al., 2017) adds the L 0 norm regularization to decrease remaining weights by sampling them with hard-concrete distribution. Movement pruning (Sanh et al., 2020) uses staightthrough estimator (Bengio et al., 2013) to calculate first-order informantion.
Based on pruning methods, Frankle and Carbin (2018) proposes Lottery Ticket Hypothesis (LTH). LTH clarifies the existence of sparse subnetworks (i.e., winning tickets) that can achieve almost the same performance as the full model when trained individually. With the development of LTH, lots of works that focus on the PLMs have emerged.  find that BERT contains winning tickets with a sparsity of 40% to 90%, and the winning ticket in the mask language modeling task can be transferred to other downstream tasks. Recent works also try to leverage LTH to improve the performance and efficiency of PLM. Liang et al. (2021) find generalization performance of the winning tickets first improves and then deteriorates after a certain threshold. By leveraging this phenomenon, they show LTH can successfully improve the performance of downstream tasks.

Background
Let a = Wx refer to a fully-connected layer in PLMs, where W ∈ R n×n is the weight matrix, x ∈ R n and a ∈ R n are the input and output respectively. The pruning can be represented by a = (W M)x, where M ∈ {0, 1} n×n is the binary mask.
We first review two common pruning methods in PLMs: magnitude pruning (Han et al., 2015b) and movement pruning (Sanh et al., 2020). Magnitude pruning relies on the zeroth-order information to decide M by keeping the top v percent of weights according to their absolute value M = Top v (S). The importance scores S ∈ R n×n is: i,j after T steps update, L and α w are learning objective and learning rate of W i,j . Magnitude pruning selects weights with high absolute values during fine-tuning.
For movement pruning, it relies on the first-order information by learning the importance scores S with gradient. The gradient of S is approximated with the staight-through estimator (Bengio et al., 2013), which directly uses the gradient from M. According to (Sanh et al., 2020), the importance scores S is: where α s is the learning rate of S. Compared to magnitude pruning, movement pruning selects weights that are increasing their absolute value.
To achieve target sparsity, one common method is automated gradual pruning (Michael H. Zhu, 2018). The sparsity level v is gradually increased with a cubic sparsity scheduler starting from the where v 0 and v f are the initial and target sparsity, N is overall pruning steps, and ∆t is the pruning frequency.
During training, these methods update both W and S to perform pruning and fine-tuning simultaneously. Since fine-tuned weights stay close to their pre-trained values (Sanh et al., 2020), the importance scores of magnitude pruning is influenced by pre-trained values, which limits its performance at high sparsity. However, magnitude pruning still outperforms movement pruning at low sparsity.

Static Model Pruning
In this work, we propose a simple first-order pruning method called Static Model Pruning (SMP). It freezes W to make pruning on PLMs more efficient and transferable. Based on movement pruning (Sanh et al., 2020), our importance scores S is:

Masking Function
To get masks M based on S, we consider two masking functions according to the pruning structure: local and global.
For the local masking function, we simply apply the Top v function to each matrix: M = Top v (S), which selects the v% most importance weights according to S matrix by matrix.
For the global masking function, ranking all importance scores together (around 85M in BERT base) is computationally inefficient, which even harms the final performance in section 6.1. To this end, we propose a new global masking function that assigns sparsity levels based on the overall score of each weight matrix. Considering the architecture of BERT, which has L transformer layers, each layer contains a self-attention layer and a feed-forward layer. In lth self-attention block, W l Q , W l K , W l V , and W l O are the weight matrices we need to prune. In the same way, W l U and W l D are the matrices to be pruned in the lth feed-forward layer. We first calculate the sparsity level of each weight matrix instead of ranking all parameters of the network. The sparsity level of each weight matrix v l (·) is computed as follows: is the importance socres of weight W l (·) , and (·) can be one of {Q, K, V, O, U, D}. The sparsity level is determined by the proportion of important scores to the same type of matrix in different layers.

Task-Specific Head
Instead of training the task-specific head from scratch, we initialize it from BERT token embedding and keep it frozen during training. Inspired by current prompt tuning methods, we initialize the task-specific head according to BERT token embeddings of corresponding label words following (Gao et al., 2021). For example, we use token embeddings of "great" and "terrible" to initialize classification head in SST2, and the predicted positive label score is h is the final hidden state of the special token [CLS] and e great is the token embeddings of "great".

Training Objective
To prune the model, we use the cubic sparsity scheduling (Michael H. Zhu, 2018) without warmup steps. The sparsity v t at t steps is: we gradually increase sparsity from 0 to target sparsity v f in the first N steps. After N steps, we keep the sparsity v t = v f . During this stage, the number of remaining weights remains the same, but these weights can also be replaced with the removed weights according to important scores. We evaluate our method with and without knowledge distillation. For the settings without knowledge distillation, we optimize the following loss function: where L CE is the classification loss corresponding to the task and R (S) is the regularization term with hyperparameter λ R . Inspired by softmovement (Sanh et al., 2020), it uses a regularization term to decrease S to increase sparsity with the thresholding masking function.We find the regularization term is also important in our method. Since λ R is large enough in our method, the most important scores in S are less than zero when the current sparsity level v t is close to v f . Due to the ∂S i,j increases with the increase of S i,j when S i,j < 0, scores corresponding to the remaining weights will have a larger penalty than removed weights. It encourages the M to be changed when v t is almost reached or reached v f .
For the settings with knowledge distillation, we simply add a distillation loss L KD in L following (Sanh et al., 2020;Xu et al., 2022): where D KL is the KL-divergence. p s and p t are output distributions of the student model and teacher model.

Datasets
To show the effectiveness of our method, we use three common benchmarks: nature language inference (MNLI) (Williams et al., 2018), question similarity (QQP) (Aghaebrahimian, 2017) and question answering (SQuAD) (Rajpurkar et al., 2016) following Sanh et al. Moreover, we also use GLUE benchmark (Wang et al., 2019) to validate the performance of our method at low sparsity.

Experiment Setups
Following previous pruning methods, we use bert-base-uncased to perform task-specific pruning and report the ratio of remaining weight in the encode For the task-specific head, we initial it according to the label words of each task following (Gao et al., 2021). For SQuAD, we use "yes" and "no" token embeddings as the weights for starting and ending the classification of answers. We freeze all weights of BERT including the task-specific head and only fine-tuning mask. The optimizer is Adam with a learning rate of 2e-2. The hyperparameter λ R of the regularization term is 400. We set 12 epochs for MNLI and QQP, and 10 epochs for SQuAD with bath size 64. For tasks at low sparsity (more than 70% remaining weights), we set N in cubic sparsity scheduling to 7 epochs. For tasks at high sparsity, we set N to 3500 steps. We also report the performance of bert-base-uncased and roberta-base with 80% remaining weights for all tasks on GLUE with the same batch size and learning rate as above. For sparsity scheduling, we use the same scheduling for bert-base-uncased and a linear scheduling for roberta-base. N in sparsity scheduling is 3500. For the large tasks: MNLI, QQP, SST2 and QNLI, we use 12 epochs. For the small tasks: MRPC, RTE, STS-B and COLA, we use 60 epochs. Note that the above epochs have included pruning steps. For example, we use around 43 epochs to achieve target sparsity in MRPC. We search the pruning structure from local and global.

Baseline
We compare our method with magnitude pruning (Han et al., 2015b), L 0 -regularization (Louizos et al., 2018), movement pruning (Sanh et al., 2020) and CAP (Xu et al., 2022). We also compare our method with directly fine-tuning and super tickets (Liang et al., 2021) on GLUE. For super tickets, it finds that PLMs contain some subnetworks, which can outperform the full model by fine-tuning them. Table 1 shows the results of SMP and other pruning methods at high sparsity. We implement SMP with the local masking function (SMP-L) and our proposed masking function (SMP-S).

Experimental Results
SMP-S and SMP-L consistently achieve better performance than other pruning methods without knowledge distillation. Although movement pruning and SMP-L use the same local masking function, SMP-L can achieve more than 2.0 improvements on all tasks and sparsity levels in Table 1. Moreover, the gains are more significant at 3% remaining weights. For soft-movement pruning, which assigns the remaining weights of matrix nonuniformly like SMP-S, it even underperforms SMP-L.
Following previous works, we also report the results with knowledge distillation in Table 1. The improvement brought by knowledge distillation is also evident in SMP-L and SMP-S. For example, it improves the F1 of SQuAD by 3.3 and 4.1 for SMP-L and SMP-S. With only 3% remaining weights, SMP-S even outperforms soft-movement pruning at 10% in MNLI and QQP. Compared with CAP, which adds contrastive learning objectives from teacher models, our method consistently yields significant improvements without auxiliary learning objectives. For 50% remaining weights, SMP-S on MNLI achieves 85.7 accuracy compared to 84.5 with full-model fine-tuning, while it keeps all weights of BERT constant.
Our method is also parameter efficient. Compared with other first-order methods, we can save  Table 1: Performance at high sparsity. SMP-L and SMP-S refer to our method with local masking function and our masking function. θ M is the size of binary mask M, which is around 2.7M parameters and can be further compressed. Since other pruning methods freeze the embedding modules of BERT (Sanh et al., 2020), the trainable parameters of first-order methods are the sum of BERT encoder (85M), importance scores S (85M) and taskspecific head (less than 0.01M). For zero-order pruning methods like magnitude pruning, the trainable parameters are 85M, excluding S. Our results are averaged from five random seeds.
half of the trainable parameters by keeping the whole BERT and task-specific head frozen. For new parameters of each task, it is also an important factor affecting the cost of transferring and storing subnetworks. Our method only introduces a binary mask θ M as new parameters for each task at different sparsity levels, while other methods need to save both θ M and the subnetwork. With remaining weights of 50%, 10%, and 3%, we can save 42.5M, 8.5M, and 2.6M parameters respectively compared with other pruning methods. Figure 1 shows more results from 3% remaining weights to 80% by comparing our method with first-order methods: movement pruning and softmovement pruning, and the zero-order pruning method: magnitude pruning. We report the results of our method at 3%, 10%, 30%, 50% and 80% remaining weights. Previous first-order methods such as movement pruning underperform magni-tude pruning at remaining weights of more than 25% in MNLI and SQuAD. Even under high sparsity level like 20% remaining weights, magnitude pruning still strongly outperforms both movement pruning and soft-movement pruning in Figure 1 (c). This shows the limitation of current first-order methods that performing ideally only at very high sparsity compared to zero-order pruning methods. However, SMP-L and SMP-S as first-order methods can constantly show better performance than magnitude pruning at low sparsity. For the results without knowledge distillation, SMP-S and SMP-L achieve similar performance of soft-movement pruning with much less remaining weights. Considering to previous LTH in BERT, we find SMP-S can outperform full-model fine-tuning at a certain ratio of remaining weights in Figure 1 (a), (b) and (c), indicating that BERT contains some subnetworks that outperform the original performances without MaP, SMvP and MVP refer to soft-movement pruning, movement pruning and magnitude pruning, respectively. We report the results of our method on 3%, 10%, 30%, 50%, 70%, and 80% remaining weights. Our method constantly outperforms other methods from low sparsity to high.  Table 2: Performance on GLUE development. Our results are averaged from five random seeds. The results of SuperT are from (Liang et al., 2021), and the remaining weights and new parameters per task in SuperT are averaged over all tasks. Note all results are from the setting without knowledge distillation for a fair comparison.
fine-tuning. For the results with knowledge distillation, SMP-S and SMP-L benefit from knowledge distillation at all sparsity levels. After removing even 70% weights from the encoder, our method still strongly outperforms full-model fine-tuning.
We also validate our method on GLUE and report the results at 80% remaining weights in Table 2. Compared to full-model fine-tuning, our method achieves better performance on two PLMs by only removing 20% parameters in the encoder while keeping the remaining parameters unchanged. Compared to SuperT (Liang et al., 2021), which searches 8 different sparsity levels for each task, our method achieves better performance by using the same sparsity levels. In addition, our method also saves more than 98M new parameters per task compared to SuperT.

Masking Function
In this section, we discuss the influence of different masking functions. Table 3 shows the results of different masking functions on our method without knowledge distillation. Contrary to previous pruning methods, the thresholding masking function T fails to converge in our method due to the difficulty in controlling the sparsity during training. For global masking function G, we sort all 85M BERT encoder weights and remain Top v%  Table 3: Influence of different masking functions. We report the results at MNLI and SQuAD with 80%, 10% and 3% remaining weights. N/A refers to unable convergence in our setting. Masking function is to transform S (·) to the binary mask M l (·) of W l (·) . T refers to the thresholding masking function following (Sanh et al., 2020), and τ is the threshold. G and L are global and local masking functions, and S v is the smallest value in the top v% after sorting all S together. S refers to our proposed masking function, and v l (·) is from Eq. 4. weights in each training step. Compared to local masking functions L, G takes more than twice the training times due to the computational cost of sorting 85M weights. Although it took the longest to train, it still underperforms L at 10% and 3% remaining weights. Contrary to G, our proposed masking function S outperforms L without additional training time since S directly assigns the remaining weights of each matrix. More results of masking functions S and L are also available in Table 1 and Figure 1. Figure 2 displays the distribution of remaining weights in different layers in MNLI with 10% remaining weights. We find G assigs too many remaining weights for W U and W V , which are four times larger than other matrices. It causes other weight matrices such as W Q to be more sparse than S and L. Following previous studies (Sanh et al., 2020;Mallya and Lazebnik, 2018), we also find that overall sparsity tends to increase with the depth of the layer. However, only W U and W V follow this pattern in all three matrices. Since W U and W V occupy more than 60% of the weight in each layer, it causes the overall distribution of each layer also follows their trend as well.
To understand the behavior of attention heads, we also display the remaining weights ratio of each head in Figure 3. Each row represents a matrix containing 12 heads. Due to space limitation and the similar distribution between W Q and W K , we only show W Q and W V . Instead of assigning sparsity uniformly to each head, the sparsity of each head is not uniform in three masking functions, Figure 2: Distribution of remaining weights corresponding to each layer. Overall refers to the overall remaining weights of each layer. W (·) is the remaining weights for each weight matrix in BERT encoder. L, G and S in figures refer to the masking functions following Table 3. with most heads having only below 1% or below remaining weights. Furthermore, three masking functions show similar patterns even with different ways of assigning remaining weights. For our masking function S, S can assign more remaining weights to important heads compared to L, and some heads in W Q achieve more than 60% remaining weights at 9th layer. For global masking function G, due to most of remaining weights being assigned to W U and W D , the average remaining weights ratio of W Q and W V in G are only 3.2% and 2.8%, which causes G to underperform other masking functions. Under these sparsity levels, most heads are masked.  Table 3.

Task-Specific Head
To validate the effectiveness of our task-specific head initialization method, we compare it with training from scratch.  Table 4: Influence of different task-specific head methods. "From scratch" refers to training head from scratch following previous pruning methods. "Initialization" refers to our initialization method. Table 4 shows the results of SMP-L on MNLI and SQuAD with 80%, 10% and 3% remaining weights. For training from scratch, we randomly initial the head and fine-tune it with the learning rate of 3e-5 following previous pruning methods. Results show our method achieves better performance with task-specific heads frozen.

Training Objective
Regularization term in training objective is a key factor for our method. We find that our method is hard to converge at high sparsity without regularization term R in Table 5. With the increase of sparsity, the performance gap between with and without R sharply increases. SMP-L without R even fails to converge at 10% and 3% remaining weights in SQuAD.  As analyzed in section 4.3, we find the remaining weights in attention heads are more uniform without R. For example, the standard deviation of remaining weights in each attention head is 3.75 compared to 12.4 in SMP-L with R at MNLI 10% remaining weights. In other words, without R, it cannot assign more remaining weights to important heads as in Figure 3.

Conclusion
In this paper, we propose a simple but effective task-specific pruning method called Static Model Pruning (SMP). Considering previous methods, which perform both pruning and fine-tuning to adapt PLMs to downstream tasks, we find finetuning can be redundant since first-order pruning already converges PLMs. Based on this, our method focuses on using first-order pruning to replace finetuning. Without fine-tuning, our method strongly outperforms other first-order methods. Extensive experiments also show that our method achieves state-of-the-art performances at various sparsity. For the lottery ticket hypothesis in BERT, we find it contains sparsity subnetworks that achieve original performance without training them, and these subnetworks at 80% remaining weights even outperform fine-tuned BERT on GLUE.  Table 7: Standard deviation of Table 2