Block Pruning For Faster Transformers

Pre-training has improved model accuracy for both classification and generation tasks at the cost of introducing much larger and slower models. Pruning methods have proven to be an effective way of reducing model size, whereas distillation methods are proven for speeding up inference. We introduce a block pruning approach targeting both small and fast models. Our approach extends structured methods by considering blocks of any size and integrates this structure into the movement pruning paradigm for fine-tuning. We find that this approach learns to prune out full components of the underlying model, such as attention heads. Experiments consider classification and generation tasks, yielding among other results a pruned model that is a 2.4x faster, 74% smaller BERT on SQuAD v1, with a 1% drop on F1, competitive both with distilled models in speed and pruned models in size.


Introduction
Pre-trained transformer models are the standard for NLP tasks in both classification and generation tasks (Devlin et al., 2019;Lewis et al., 2020). The recent trend is for models to continue to grow in size while yielding improved performance on standard benchmarks (Rosset, 2020). This development highlights the need to reduce the storage size and increase the efficiency of pre-trained models.
Pruning methods have shown to be extremely effective at reducing the storage size of models fine-tuned for a specific task. Approaches such as magnitude pruning (Han et al., 2015), L0 regularization (Louizos et al., 2018), lottery ticket hypothesis (Frankle and Carbin, 2018), diff pruning (Guo et al., 2020), and movement pruning (Sanh et al., 2020) have demonstrated remarkable reductions in model size. Movement pruning produces 77% savings in parameter storage for a 1% drop in accuracy on SQuAD v1.1. However, these models yield very little actual efficiency benefits, as to run them in standard hardware often requires reconstructing the original dense shape.
On the other hand distillation methods have been more effective at producing faster models as has been shown by DistilBERT (Sanh et al., 2019), TinyBERT (Jiao et al., 2019) or MobileBERT (Sun et al., 2020). These approaches utilize targeted distillation to produce smaller models with a dense structure that is fast on standard hardware. However without careful engineering and size selection these models are much larger than pruned ones.
In this work, we target closing this gap through block pruning. Unlike pruning individual parameters, this approach encourages pruning that can be optimized on dense hardware. It is a less rigid approach than row or column-based pruning typically used in structured approaches (McCarley, 2019), which have been difficult to apply effectively to transformers. We integrate this approach with Movement pruning (Sanh et al., 2020), a simple method for pruning pre-trained models during fine-tuning. The final method 1 has few additional hyperparameters or training requirements.
Experiments consider a large variety of different benchmark datasets comparing accuracy and efficiency. We find a surprising result that despite utilizing sub-row square blocks during training, the approach learns to eliminate full components of the model, effectively dropping a large number of attention heads. This effect allows the model to achieve speedups even beyond standard structured pruning of feed-forward layers. Results show a 2.4x speedup on SQuAD v1.1 with a 1% drop of F1, and a 2.3x speedup on QQP with a 1% loss of F1. Experiments on summarization also show a 1.39x speedup for an average of 2 points drop on all ROUGE metrics on CNN/DailyMail, and for a reduction of decoder weights of 3.5x. 1 Available at https://github.com/ huggingface/nn_pruning 2 Related Work There has been a growing interest in the compression of pre-trained language models. We consider three varieties of methods: distillation, pruning, and structured pruning.
Knowledge distillation, introduced by Hinton et al. (2015), is a popular compression technique. Researchers have applied this method to a variety of NLP models (Tang et al., 2019;Sun et al., 2019;Turc et al., 2019). Distillation has been used to obtain significantly smaller BERT models achieving competitive performances. Sanh et al. (2019) distills BERT into shallower students during the pre-training stage and optionally during the finetuning stage. MobileBERT (Sun et al., 2020) and TinyBERT (Jiao et al., 2019) are obtained thanks to a layer-wise distillation strategy. While the distillation of former is task-agnostic, the one used to obtain the latter is task-specific.
Other previous work has focused on unstructured pruning (LeCun et al., 1989;Han et al., 2015;Frankle and Carbin, 2018). When targeting transformer models, it is typical to select the weights to prune based on their magnitude (Gordon et al., 2020), or by computing an importance score using a firstorder method (Sanh et al., 2020). While these methods allow for a significant reduction in model size, specialized hardware is required to make use of the resulting unstructured sparse matrices in order to speed up inference.
In contrast, structured pruning removes coherent groups of weights (Murray and Chiang, 2015;See et al., 2016;Joulin et al., 2016;Fan et al., 2020;Sajjad et al., 2020). Recent works (Michel et al., 2019;Voita et al., 2019) show that some heads can be removed without significant degradation in performance, leading to the conclusion that most heads provide redundant information. Other authors have worked on combining matrix factorization and weight pruning. While Mao et al. (2020) combine SVD-based matrix factorization with unstructured pruning,  use structured pruning in order to reduce the rank. Related to our approach, Kim and Awadalla (2020) and McCarley (2019) both apply structured pruning on the heads of the multi-head attention (MHA) and on the inner-layer nodes of the feed-forward network (FFN). The former uses predefined pruning ratios, shared across all layers, in order to select the modules to prune after sorting them given an importance score. McCarley (2019) compares dif-ferent methods to compute the prunable module masks and find L0 regularization to perform the best.

Background
Starting with a transformer model with parameters θ, our goal is to produce a set of parameters θ that are both fine-tuned for a specific end-task and smaller in such a way that inference can be efficiently computed on parallel hardware.
The two largest lines in the transformer parameter budget are the feed-forward network sublayer (FFN) and the multi-head attention sub-layer (MHA). The FFN parameters consist of two matrices (W 1 and W 2 ) of transposed shape R d model ×d ff and R d ff ×d model where d model is the hidden size and d ff d model is the inner size. These are used in the standard fashion by the network. The MHA parameters consist of 4 projection matrices (W q , W k , W v and W o ) of size R d model ×d model (query, key, value, out). These are used to project the hidden vector to and from the component attention parts. In implementations, this projection is made with the matrices in their folded tensor form In standard fine-tuning, starting from θ, we optimize the loss L (for instance, cross-entropy for classification): Movement pruning (Sanh et al., 2020) is a scorebased pruning approach that encourages the model to optimize these score parameters. Specifically, we focus on the soft-movement variant of movement pruning that sets M (S) = 1(S > τ ) for a threshold parameter τ , and optimizes a regularized objective, where λ is a hyper-parameter, A = i,j A i,j and σ is the sigmoid function.
This pruning objective encourages the model to fine-tune the parameters while lowering the scores of unimportant parameters and thus encouraging more sparsity. In order to train through the threshold, a straight-through estimator (Bengio et al., 2013) is used.
Movement pruning, combined with distillation, has shown to be a very effective method to reduce the number of parameters in an existing model yielding 94% pruning in our tests for a F1 of 87.5 on SQuAD v1.1 (BERT-base is 88.5). This results in significantly smaller models than distillation alone. However, even with this sparsity level, the model is not substantially faster when run on most standard hardware that cannot significantly take advantage of this style of sparse matrix-vector product.

Model: Block Movement Pruning
In this work, we extend movement pruning to work on blocks of local parameters. Specifically, each matrix in the transformer is partitioned into fixedsized blocks. This setting goes beyond the arbitrary pruning of unstructured methods, with the goal of encouraging the data locality closer to what would be needed for efficiency. 2 Our approach is extremely simple. For each parameter matrix W ∈ R M ×N , we assume a fixedsized block structure (M , N ). Each of these blocks acts as an individual group in the regularization with a shared score parameter derived from the corresponding score matrix S ∈ R M/M ×N/N . Computing the masked weight is done by expanding the thresholded values, i.e.
As in past work, this model is trained with distillation to match the performance of a teacher model.
Unlike other distillation approaches that require fully specifying the new model structure, our method only requires the size and shapes of the blocks, i.e. the set of (M , N ) for each parameter matrix in the model. If blocks are too large, then they are difficult to prune, but if they are too small they do not support efficient inference.
2 Linear algebra libraries perform matrix multiplication using large blocks, typically 128*64. At a micro level those machines are typically 32 ways SIMD, and memory is loaded by large contiguous chunks to maximize bandwidth. Unstructured sparsity is hard to implement with dense algebra performance on GPUs. Data locality is important on CPU too, but in a more limited way.
To reduce the search space, we will limit ourselves to test (M , N ) att and (M , N ) ff : the same block size will be used for all layers for attention weights W q , W k , W v and W o on one hand, and for the feed-forward weights W 1 and W 2 on the other hand. We split the movement pruning regularization term into: This allows us to take into account the difference in terms of gradient received by the score parameters.
To reduce further the search space, we will test on two kinds of blocks: • (32, 32) : square blocks (Block) ing on paired FFN rows and columns (Dim) These block sizes allow for efficient models: blocks of size at least (16, 16) are efficient to compute with appropriate GPU kernels, whereas full rows, columns or heads can be entirely removed from the matrix: the remaining matrix is then dense.
We also include two additional baseline block types used to verify the approach: • (2 n , 2 n ), n ∈ [2, 5] : smaller power of two square block sizes to study the impact of size on performance (Block) • ( d model n heads , d model ) : for attention heads (Heads) The first considers small blocks, and the second considers very large functional blocks.

Experimental Setup
We conduct experiments on five (English) tasks commonly used to evaluate pre-trained language models: question answering (SQuAD v1. . These datasets respectively contain 87k, 130k, 392k, 363k, 67k and 287k training examples, and are downloaded from the Hugging Face datasets hub. SQuAD is formulated as a span-extraction task, MNLI and QQP are sentence pairs classification tasks, SST-2 is a sentence classification task and CNN/DailyMail ("CNN") is formulated as a conditional generation task. We report the performance on the development set as measured by the accuracy for MNLI and SST-2, F1 for QQP, the exact match (EM) and F1 for SQuAD and ROUGE for CNN/DailyMail.
We experiment with task-specific pruning of transformer language models. We use BERT (Devlin et al., 2019) (an encoder-only Transformer language model with 110M parameters, among which 85M are part of the linear layers present in the Transformer layers) for sentence classification and question answering (340M and 227M respectively for BERT-large), and BART (Lewis et al., 2020) (an encoder-decoder language model with 139M parameters, among which 99M are part of the linear layers present in the Transformer layers) for summarization (406M and 353M for BART-large).
We compare against several baselines. Movement pruning is a fully unstructured approach and gives an upper bound on the sparsity trade-offs we hope to achieve, even if it provides little speed benefit. We also compare our results against state-ofthe-art approaches developed for fast inference of transformer-based language models. DistilBERT (Sanh et al., 2019) is obtained by distilling through pre-training a pre-trained BERT into a smaller model. TinyBERT (Jiao et al., 2019) distills a finetuned model while using data augmentation. Mo-bileBERT (Sun et al., 2020) is the result of a large architecture search. dBART (Shleifer and Rush, 2020) is obtained by arbitrarily copying equally spaced layers of a large model to a smaller one. To measure inference speed on GPU, we use a 24GB 3090 RTX and an Intel i7 CPU, using a large batch size (128) for evaluation and using PyTorch CUDA timing primitives. We measure the speed of other models in this same setup. Results may be different from original papers, as latency and throughput characteristics are different for each platform. We also provide the number of parameters in the linear layers of the Transformer layers for each of our models and for the reference ones: as the linear layers represent most of the FLOPS, this is a good proxy for the computation required and to some extent for the compute time, when the model characteristics are equivalent.

Resources and Reproducibility
We are using a minimal set of hyperparameters. The ratio of λ att and λ ffn is fixed by the relative sizes. We performed a few experiments with differ- ent values fixed manually for these parameters, but their influence is minor.
The main hyperparameter is the number of training epochs. For SQuAD v1.1, we are using 20 epochs instead of typically 2 for BERT models. This means a fine-tuning is taking about 12h with our method instead of 45mn with a standard finetuning setup. This number has to be large enough to let pruning happen slowly enough for a given task. A warming up phase and a post-pruning cooldown phase are helpful, but their exact length has not a large impact on final performance. We believe the training time is less important than the inference time for energy consideration, as inference is performed repeatedly. Our method is optimizing inference by a large factor: the training energy is potentially recouped by a large margin with inference savings.
Finally, the checkpoints created during the experiments are available on an AWS S3 bucket, with their metadata and training parameters, totaling 3TB of data, to facilitate reproduction of our results and to make it possible to study further the behavior of those models. Code for experiments, analysis, and tools to prepare the present paper are available on GitHub (see Appendix A).

Pruning Methods
The pruning approaches are shown in Table 1.
Block pruning use square block sizes throughout all the linear layers, as an extension of the original movement pruning for which the block size is 1.
Hybrid pruning jointly removes hidden dimensions in feed-forward layers W 1 and W 2 , using movement pruning to create the dimension mask. This corresponds to full rows or columns in the parameter matrices. The pruned W 1 and W 2 can then be "compacted" to become fully dense: we perform dense operations on cropped matrices. For the attention layers, pruning only some rows or columns in W q , W k , W v and W o can not be practically exploited. This is because the structure of the computation makes the additional cost of resizing the tensor inefficient. We, therefore, use square block pruning on the attention layer, with a block size of (32, 32) which showed the best tradeoff between performance and accuracy.
Struct pruning uses the same methods for FFN layers but aims to remove model attention heads directly. To do so, we choose a block size on attention that equals the head size while still using the same soft movement pruning strategy. For this approach, we use a λ att equals to 1/32, as there are 32 times more parameters than in an attention block than in a feed-forward dimension.
When Block Pruning does not fully remove a component such as an attention head, as shown in Figure 1, we cannot speed up the model. But we can reclaim some of the performance at no speed cost and at marginal cost on sparsity by making use of those zero weights.
Hybrid Filled pruning allows the model to reinitialize these reclaimed weights uniformly at random and continue fine-tuning the smaller model for a few steps. We also explore "rewinding" (Frankle and Carbin, 2018) by identifying weights that should not be pruned (because they are part of a non-empty attention head) and re-fine-pruning the pre-trained model: the first run marks the attention heads that were not pruned, and the second uses this information to create a positive mask of weights that are protected from pruning. We did not find a significant difference between the two methods. The results presented here do not use rewinding.

Experiments
Main Results We begin by observing the highlevel impact of the different pruning methods. Figure 1 shows the effect on attention and feedforward layers for the different block pruning methods. We find that all the different block sizes learn to prune out entire dimensions in the FFN layers. Interestingly we find that the block methods can also learn to remove entire heads from the MHA. This pruning pattern makes it possible to remove entire heads from the model during inference. For this reason, we focus on the Hybrid approach as our main method, which can both eliminate feed- forward dimensions while using blocks to remove attention heads gradually.
Results on SQuAD are shown in Figure 2, which compares our approach for speed and density to baseline BERT-Base tuned models such as TinyBERT-6 and DistilBERT (MobileBERT is discussed below). The main result is that the Hybrid Pruning model is as fast as the baseline and approaches the same accuracy while at the same time producing significantly smaller models in terms of density. Moving to the Hybrid Filled model leads to a further gain in speed at a small cost in model density. For instance, for the same F1 performance of 87.5, Hybrid Filled models display a 2.5x speedup against 1.88 for TinyBERT. TinyBERT and Distil-BERT have 50% of BERT's encoder parameters, whereas Hybrid Filled models have 25% BERT parameters for the same level of accuracy.
The figures also include two intrinsic baselines: our reimplementation of Movement pruning and pure Block pruning. We find that our implementation of Movement pruning is highly effective at producing sparse models (even leading to a small increase in accuracy) but does not produce significant speedups. Square Block pruning does better, but not as well as hybrid blocks. Table 2 gives a full comparison of models with different compression rates. As linear layers represent a very large part of the flops of a transformer model, this compression rate is actually a good measure of the maximum achievable speedup. This number is much higher than the actually measured speedup. This indicates that our setup for measur-   BERT performance on other tasks.
Comparison with MobileBERT All methods can be improved further using a larger teacher model. For these experiments, we compare with MobileBERT, which uses a BERT-large teacher and reaches an F1 of 90.0 on SQuAD v1.1 on its fastest version. It should be noted that Mo-bileBERT makes use of additional optimizations not present in the original BERT-large we are using: LayerNorms are replaced by purely linear NoNorms, and GeLUs are replaced by ReLUs. For these experiments, we use a BERT-large teacher to perform meaningful comparisons, using our best method Hybrid Filled. Figure 2 shows that we have comparable results on SQuAD v1.1, with a simpler optimization approach: we get a slightly better model (F1=90.3) for the same speedup of 1.6x, and we get a speedup of 2.2x at BERT-base accuracy (F1=88.5). We observe that using a large teacher is beneficial even at high levels of pruning: up to 80% of sparsity, the resulting student model has better accuracy for the same number of parameters when using a BERTlarge teacher instead of a base one. This trend reverses after this point: a larger teacher is detrimental to accuracy when the student is very heavily pruned.
Encoder-Decoder Finally, we apply these methods to two encoder-decoder architectures, BARTbase and BART-large for the task of summarization. For these architectures, the decoder parameters are responsible for a majority of the computational costs, so these are our main focus. Voita et al. (2019) observed that for machine translation models, encoder heads were much easier to prune than decoder ones. We found similar results, e.g. for identical λ att and λ ffn , the encoder was systematically more pruned than the decoder, for both MHA and FFN sub-layers. In order to increase speedup gain, we applied twice as much weight on the decoder compression, which resulted in even pruning ratios among the encoder and decoder. Table 4 shows the main results. We see that Hybrid pruning leads to large decoder compression ratios (3.4 on BART-base and 3.5 BART-large) with only a small drop in ROUGE score. Speedups reach 1.4 times of the original speed. (Given the large decoder compression rates, we would expect larger speedups to be possible with further engineering of the inference.) There is less comparable work for pre-trained encoder-decoders. We compare our approach with a distillation-based approach dBART (Shleifer and Rush, 2020). This approach yields a similar speedup gain with a smaller drop in performance but less sparsity. For models of comparable sizes (158M for our Hybrid NT vs 176M for dBART-6-6), we observe a drop of 0.7 in R2 and 0.4 in RL against 0.9 in R2 and 1.3 in RL for dBART-6-6. As with encoder-only models, the two approaches could likely be combined to yield even faster, smaller models. 3

Analysis
Large Model Pruning To test that this approach scales to large models, we apply Hybrid pruning on BERT-large on SQuAD v1.1. We observe similar results: a 18% dense BERT-large has a F1 of 90.2, with a speedup of 3.2x compared to BERT-large with a F1 of 93.2. This pruned model is actually   faster than a BERT-base model (  Li et al. (2020): the larger the model, the more pruning is effective. When pruning a larger model, the final model is actually better than a smaller one with the same absolute number of parameters.
Block Size Influence Figure 3 shows the impact of different block sizes on Block pruning: pruning is done on attention layers and FFNs with the same square block size, from (4, 4) to (32, 32), with a BERT-base teacher. We can see that we reach the We also note that, with the original Movement Pruning method, we see some speedup due to full dimension pruning. This likely comes from our improved set of hyper-parameters (compared to the original paper), allowing us to remove some empty rows and columns in the FFN layers. However we see that using blocks leads to a significant speed improvement compared to Movement Pruning.
Quantization Quantization is often of critical importance for practical applications. We, therefore, wanted to check that our networks could be subjected to quantization without significant loss of accuracy, especially when considering the issues that could arise with the high level of sparsity of some FFNs. Table 6 shows the results of full 8-bit quantization tests on our models. These indicate that the method is compatible with quantization, and the models using quantization on top of our pruning method achieve very high gains in terms of size (as well as speed).

Impact of Distillation
We report experimental results with the addition of a teacher distillation step as previous work showed this boosts movement pruning at little cost. In this section, we conduct an ablation study to evaluate the impact of distillation using a BERT-base teacher.   As shown in Table 7, combining hybrid pruning with distillation always performs better than pruning alone, but that it is not critical for the approach to work. The distillation effect is larger for smaller datasets such as SST-2, which are prone to over-fitting. We believe that the regularization brought by pruning and distillation counters overfitting caused by the additional number of steps needed for pruning.

Conclusion
We have shown that we can extract small pruned models that are at an equivalent or better than distilled networks. This approach can be done during fine-tuning and not pre-training. The method does not resort to techniques such as data augmentation or architecture search, and it works on a diverse set of tasks and base models. As better and larger models are published at an increasing pace, we can rely on a simple and robust method to accelerate them on specific tasks without sacrificing accuracy and distribute these models easily while keeping most of the original model accuracy.

Impact
We expect the method presented here to contribute to the reduction of the compute resources and energy needed to perform natural language tasks, while preserving the original model performance. It will contribute additionally to alleviating privacy concerns: smaller models running on user devices instead of server-side allow more information to stay private. This is especially relevant when considering the large anticipated demand for such NLP applications in the near future.

A Reproducibility & Hyper-Parameters Code
The complete code to run the experiments, analyze the results and finally create the figures and tables in this paper is available on the Hugging Face nn_pruning repository, at https://github.com/huggingface/nn_pruning.

Hyperparameters
The hyperparameters of the experiments are available as JSON files (one file per task) in the same repository: each entry contains all the information to fine-tune and prune the model, its evaluation results, and detailed statistics about its final sparsity.
For example, the SQuAD V1 checkpoints referenced in this paper are listed with the hyperparameters and related information.

Checkpoints
Some of the models we produced during this research can be used directly from the Hugging Face model hub.
The other models and the checkpoints, including the intermediary ones that were saved during training, are available on Amazon S3.

B Additional Data Block Shape & Head pruning
We show here the effect of the pattern on the head number reduction: using block instead of row/column pruning leads to a much larger number of pruned heads while improving accuracy, here on the SST-2 task.
We are using Block Movement pruning for each model, with different block patterns, pruning only the attention layers. Compression measures the reduction of the number of non-zero parameters in attention linear layers, whereas head compression measures the reduction of the number of complete non-zero heads.

Pruning Methods Comparison
We select speed as our main metric to compare with other techniques, as it is the major practical measure of inference efficiency. On this metric, we decided to compare our models to the best models available i.e. the distilled models (MobileBERT, TinyBERT), even though the method is different, as they are the strongest "speed/accuracy" baseline available.
In Table 9 we compare  with TinyBERT (Jiao et al., 2019) and MobileBERT (Sun et al., 2020).  We compare as well to Hybrid pruning, with and without a teacher, with the unstructured methods from Sanh et al. (2020) (the original Movement Pruning method we are using) and Gordon et al. (2020), and with Sajjad et al. (2020)