Structured Pruning Learns Compact and Accurate Models

The growing size of neural language models has led to increased attention in model compression. The two predominant approaches are pruning, which gradually removes weights from a pre-trained model, and distillation, which trains a smaller compact model to match a larger one. Pruning methods can significantly reduce the model size but hardly achieve large speedups as distillation. However, distillation methods require large amounts of unlabeled data and are expensive to train. In this work, we propose a task-specific structured pruning method CoFi (Coarse- and Fine-grained Pruning), which delivers highly parallelizable subnetworks and matches the distillation methods in both accuracy and latency, without resorting to any unlabeled data. Our key insight is to jointly prune coarse-grained (e.g., layers) and fine-grained (e.g., heads and hidden units) modules, which controls the pruning decision of each parameter with masks of different granularity. We also devise a layerwise distillation strategy to transfer knowledge from unpruned to pruned models during optimization. Our experiments on GLUE and SQuAD datasets show that CoFi yields models with over 10X speedups with a small accuracy drop, showing its effectiveness and efficiency compared to previous pruning and distillation approaches.


Introduction
Pre-trained language models (Devlin et al., 2019;Liu et al., 2019a;Raffel et al., 2020, inter alia) have become the mainstay in natural language processing. These models have high costs in terms of storage, memory, and computation time and it has motivated a large body of work on model compression to make them smaller and faster to use in real-world applications (Ganesh et al., 2021).  Table 1: A comparison of state-of-the-art distillation and pruning methods. U and T denote whether Unlabeled and Task-specific are used for distillation or pruning. The inference speedups ( ) are reported against a BERT base model and we evaluate all the models on an NVIDIA V100 GPU ( §4.1). The models labeled as ‡ use a different teacher model and are not a direct comparison. Models are one order of magnitude faster. 3 The two predominant approaches to model compression are pruning and distillation (Table 1).
Pruning methods search for an accurate subnetwork in a larger pre-trained model. Recent work has investigated how to structurally prune Transformer networks (Vaswani et al., 2017), from removing entire layers (Fan et al., 2020;Sajjad et al., 2020), to pruning heads (Michel et al., 2019;Voita et al., 2019), intermediate dimensions (McCarley et al., 2019;Wang et al., 2020b) and blocks in weight matrices (Lagunas et al., 2021). The trend of structured pruning leans towards removing finegrained units to allow for flexible final structures. However, thus far, pruned models rarely achieve large speedups (2-3× improvement at most).
By contrast, distillation methods usually first specify a fixed model architecture and perform a general distillation step on an unlabeled corpus, before further fine-tuning or distillation on task-specific data (Sanh et al., 2019;Turc et al., 2019;Sun et al., 2019;Jiao et al., 2020). Welldesigned student architectures achieve compelling speedup-performance tradeoffs, yet distillation to these randomly-initialized student networks on large unlabeled data is prohibitively slow. 4 For instance, TinyBERT (Jiao et al., 2020) is first trained on 2,500M tokens for 3 epochs, which requires training 3.5 days on 4 GPUs (Figure 1). 5 In this work, we propose a task-specific, structured pruning approach called CoFi (Coarse and Fine-grained Pruning) and show that structured pruning can achieve highly compact subnetworks and obtain large speedups and competitive accuracy as distillation approaches, while requiring much less computation. Our key insight is to jointly prune coarse-grained units (e.g., self-attention or feed-forward layers) and fine-grained units (e.g., heads, hidden dimensions) simultaneously. Different from existing works, our approach controls the pruning decision of every single parameter by multiple masks of different granularity. This is the key to large compression, as it allows the greatest flexibility of pruned structures and eases the optimization compared to only pruning small units.
It is known that pruning with a distillation objective can substantially improve performance (Sanh et al., 2020;Lagunas et al., 2021). Unlike a fixed student architecture, pruned structures are unkown prior to training and it is challenging to distill between intermediate layers of the unpruned and pruned models (Jiao et al., 2020). Hence, we propose a layerwise distillation method, which dynamically learns the layer mapping between the two structures. We show that this strategy can better lead to performance gains beyond simple prediction-layer distillation.
Our experiments show that CoFi delivers more accurate models at all levels of speedups and model sizes on the GLUE (Wang et al., 2019) and SQuAD v1.1 (Rajpurkar et al., 2016) datasets, compared to strong pruning and distillation baselines. Concretely, it achieves over 10× speedups and a 95% sparsity across all the datasets while preserving more than 90% of accuracy. Our results suggest that task-specific structured pruning is an appealing solution in practice, yielding smaller and faster models without requiring additional unlabeled data for general distillation.

Transformers
A Transformer network (Vaswani et al., 2017) is composed of L blocks and each block consists of a multi-head self-attention (MHA) layer, and a feed-forward (FFN) layer. An MHA layer with N h heads takes an input X and outputs: O ∈ R d×d h denote the query, key, value and output matrices respectively and Att(·) is an attention function. Here d denotes the hidden size (e.g., 768) and d h = d/N h denotes the output dimension of each head (e.g., 64).
Next comes a feed-forward layer, which consists of an up-projection and a down-projection layer, parameterized by W U ∈ R d×d f and W D ∈ R d f ×d : Typically, d f = 4d. There is also a residual connection and a layer normalization operation after each MHA and FFN layer.
MHAs, FFNs account for 1/3 and 2/3 of the model parameters in Transformers (embeddings excluded). According to Ganesh et al. (2021), both MHAs and FFNs take similar time on GPUs while FFNs become the bottleneck on CPUs.

Distillation
Knowledge distillation (Hinton et al., 2015) is a model compression approach that transfers knowledge from a larger teacher model to a smaller student model. General distillation (Sanh et al., 2019;Sun et al., 2020;Wang et al., 2020a) and task-specific distillation (Sun et al., 2019) exploit unlabeled data and task-specific data respectively for knowledge transfer. A combination of the two leads to increased performance (Jiao et al., 2020). General distillation or pre-training the student network on unlabeled corpus is essential for retaining performance while being computationally expensive (Turc et al., 2019;Jiao et al., 2020).  Figure 1: Comparison of (a) TinyBERT (Jiao et al., 2020) and (b) our pruning approach CoFi. TinyBERT trains a randomly-initialized network through two-step distillation: (1) general distillation on a large unlabeled corpus, which takes 3.5 days to finish on 4 GPUs, and (2) task-specific distillation on the task dataset. CoFi directly prunes the fine-tuned BERT model and jointly learns five types of mask variables (i.e., z FFN , z int , z MHA , z head , z hidn ) to prune different types of units ( §3.1). CoFi takes at most 20 hours to finish on 1 GPU on all the GLUE datasets (smaller datasets need < 3 hour). 6 Different distillation objectives have been also explored. Besides standard distillation from the prediction layer (Hinton et al., 2015), transferring knowledge layer-by-layer from representations (Jiao et al., 2020;Sun et al., 2020) and multi-head attention matrices (Wang et al., 2020a;Jiao et al., 2020;Sun et al., 2020) lead to significant improvements. Most distillation approaches assume a fixed student structure prior to training. Hou et al. (2020) attempt to distill to a dynamic structure with specified widths and heights.  adopt a one-shot Neural Architecture Search solution to search architectures of student networks.

Pruning
Pruning gradually removes redundant parameters from a teacher model, mostly producing taskspecific models. Previous works focus on pruning different components in Transformer models, from coarse-grained to fine-grained units. Fan et al. (2020) and Sajjad et al. (2020)  (2019) show that only a small subset of heads are important and the majority can be pruned. We follow these works to mask heads by introducing variables z (i) head ∈ {0, 1} to multi-head attention:

Layer pruning
Only removing heads does not lead to large latency improvement-Li et al. (2021) demonstrate a 1.4× speedup with only one remaining head per layer.

FFN pruning
The other major part-feed-forward layers (FFNs)-are also known to be overparameterized. Strategies to prune an FFN layer for an inference speedup include pruning an entire FFN layer (Prasanna et al., 2020;Chen et al., 2020b) and at a more fine-grained level, pruning intermediate dimensions (McCarley et al., 2019;Hou et al., 2020) by introducing z int ∈ {0, 1} d f : Block and unstructured pruning More recently, pruning on a smaller unit, blocks, from MHAs and FFNs have been explored (Lagunas et al., 2021). However, it is hard to optimize models with blocks pruned thus far:  attempt to optimize block-pruned models with the block sparse MatMul kernel provided by Triton (Tillet et al., 2019), but the reported results are not competitive. Similarly, unstructured pruning aims to remove individual weights and has been extensively studied in the literature (Chen et al., 2020a;Huang et al., 2021). Though the sparsity reaches up to 97% (Sanh et al., 2020), it is hard to obtain inference speedups on the current hardware. Combination with distillation Pruning is commonly combined with a prediction-layer distillation objective (Sanh et al., 2020;Lagunas et al., 2021). Yet it is not clear how to apply layerwise distillation strategies as the pruned student model's architecture evolves during training.

Method
We propose a structured pruning approach CoFi, which jointly prunes Coarse-grained and Finegrained units ( §3.1) with a layerwise distillation objective transferring knowledge from unpruned to pruned models ( §3.2). A combination of the two leads to highly compressed models with large inference speedups.

Coarse-and Fine-Grained Pruning
Recent trends in structured pruning move towards pruning smaller units for model flexibility. Pruning fine-grained units naturally entails pruning coarsegrained units-for example, pruning N h (e.g., 12) heads is equivalent to pruning one entire MHA layer. However, we observe that this rarely happens in practice and poses difficulty to optimization especially at a high sparsity regime.
To remedy the problem, we present a simple solution: we allow pruning MHA and FFN layers explicitly along with fine-grained units (as shown in §2.3) by introducing two additional masks z MHA and z FFN for each layer. Now the multi-head selfattention and feed-forward layer become: With these layer masks, we explicitly prune an entire layer, instead of pruning all the heads in one MHA layer (or all the intermediate dimensions in one FFN layer). Different from the layer dropping strategies in Fan et al. (2020); Sajjad et al. (2020), we drop MHA and FFN layers separately, instead of pruning them as a whole. Furthermore, we also consider pruning the output dimensions of MHA(X) and FFN(X), referred to as 'hidden dimensions' in this paper, to allow for more flexibility in the final model structure. We define a set of masks z hidn ∈ {0, 1} d , shared across layers because each dimension in a hidden representation is connected to the same dimension in the next layer through a residual connection. These mask variables are applied to all the weight matrices in the model, e.g., diag(z hidn )W Q . Empirically, we find that only a small number of dimensions are pruned (e.g., 768 → 760), but it still helps improve performance significantly ( §4.3).
CoFi differs from previous pruning approaches in that multiple mask variables jointly control the pruning decision of one single parameter. For example, a weight in an FFN layer is pruned when the entire FFN layer, or its corresponding intermediate dimension, or the hidden dimension is pruned. As a comparison, a recent work Block Pruning (Lagunas et al., 2021) adopts a hybrid approach which applies a pruning pruning strategy on MHAs and FFNs separately.
To learn these mask variables, we use l 0 regularization modeled with hard concrete distributions following Louizos et al. (2018). We also follow Wang et al. (2020b) to replace the vanilla l 0 objective with a Lagrangian multiplier to better control the desired sparsity of pruned models. 7 We adapt the sparsity function accordingly to accommodate pruning masks of different granularity: whereŝ is the expected sparsity and M denotes the full model size. All masking variables are learned as real numbers in [0, 1] during training and we map the masking variables below a threshold to 0 during inference and get a final pruned structure where the threshold is determined by the expected sparsity of each weight matrix (see Appendix B for more details).

Distillation to Pruned Models
Previous work has shown that combining distillation with pruning improves performance, where the distillation objective only involves a cross-entropy loss between the pruned student's and the teacher's output probability distributions p s and p t (Sanh et al., 2020;Lagunas et al., 2021): In addition to prediction-layer distillation, recent works show great benefits in distillation of intermediate layers (Sun et al., 2019;Jiao et al., 2020).
In the context of distillation approaches, the architecture of the student model is pre-specified, and it is straightforward to define a layer mapping between the student and teacher model. For example, the 4-layer TinyBERT 4 model distills from the 3, 6, 9 and 12-th layer of a 12-layer teacher model. However, distilling intermediate layers during the pruning process is challenging as the model structure changes throughout training.
We propose a layerwise distillation approach for pruning to best utilize the signals from the teacher model. Instead of pre-defining a fixed layer mapping, we dynamically search a layer mapping between the full teacher model and the pruned student model. Specifically, let T denote a set of teacher layers that we use to distill knowledge to the student model. We define a layer mapping function m(·), i.e., m(i) represents the student layer that distills from the teacher layer i. The hidden layer distillation loss is defined as where W layer ∈ R d×d is a linear transformation matrix, initialized as an identity matrix. H m(i) s , H i t are hidden representations from m(i)-th student FFN layer and i-th teacher FFN layer. The layer mapping function m(·) is dynamically determined during the training process to match a teacher layer to its closest layer in the student model: Calculating the distance between two sets of layers is highly parallelizable and introduces a minimal training overhead. To address the issue of layer mismatch, which mostly happens for small-sized datasets, e.g., RTE, MRPC, we add a constraint to only allow matching a teacher layer to a lower student layer than the previously matched student layer. When pruning with larger sized datasets, layer mismatch rarely happens, showing the superiority of dynamic matching-layers between student and teacher models match in a way that benefits the pruning process the most. Finally, we combine layer distillation with the prediction-layer distillation: where λ controls the contribution of each loss. Training setup In our experiments, sparsity is computed as the number of pruned parameters divided by the full model size (embeddings excluded). Following Wang et al. (2020b); Lagunas et al. (2021), we first finetune the model with the distillation objective, then we continue training the model with the pruning objective with a scheduler to linearly increase the sparsity to the target value. We finetune the pruned model until convergence (see Appendix A for more training details).
We train models with target sparsities of {60%, 70%, 75%, 80%, 85%, 90%, 95%} on each dataset. For all the experiments, we start from the BERT base model 8 and freeze embedding weights following Sanh et al. (2020). We report results on development sets of all datasets.
MobileBERT (Sun et al., 2020) and AutoTiny-BERT  in Appendix F. 9 For TinyBERT and DynaBERT, the released models are trained with task-specific augmented 9 We show these results in Appendix F as they are not directly comparable to CoFi. data. For a fair comparison, we train these two models with the released code without data augmentation. 10 For Block Pruning, we train models with their released checkpoints on GLUE tasks and use SQuAD results from the paper.
Speedup evaluation Speedup rate is a primary measurement we use throughout the paper as the compression rate does not necessarily reflect the actual improvement in inference latency 11 . We use an unpruned BERT base as the baseline and evaluate all the models with the same hardware setup on a single NVIDIA V100 GPU to measure inference speedup. The input size is 128 for GLUE tasks and 384 for SQuAD, and we use a batch size of 128. Note that the results might be different from the original papers as the environment for each platform is different.

Main Results
Overall performance In Figure 2, we compare the accuracy of CoFi models to other methods in terms of both inference speedup and model size. CoFi delivers more accurate models than distillation and pruning baselines at every speedup level and model size. Block Pruning (Lagunas et al., 2021), a recent work that shows strong performance against TinyBERT 6 , is unable to achieve comparable speedups as TinyBERT 4 . Instead, CoFi has the option to prune both layers and heads & intermediate units and can achieve a model with a comparable or higher performance compared to TinyBERT 4 and all the other models. Additionally, DynaBERT performs much worse speed-wise because it is restricted to remove at most half of the MHA and FFN layers.
Comparison with TinyBERT 4 In Table 2, we show that CoFi produces models with over 10× inference speedup and achieves comparable or even better performance than TinyBERT 4 . General distillation (GD), which distills information from a large corpus, is essential for training distillation models, especially for small-sized datasets (e.g., TinyBERT 4 w/o GD performs poorly on CoLA, RTE and STS-B). While general distillation could take up to hundreds of GPU hours for training, CoFi trains for a maximum of 20 hours on a taskspecific dataset with a single GPU. We argue that pruning approaches-trained with distillation objectives like CoFi-are more economical and efficient in achieving compressed models.
We further compare CoFi with TinyBERT 4 under the data augmentation setting in Table 3. As the augmented dataset is not publicly released, we follow its GitHub repository to create our own augmented data. We train CoFi with the same set of augmented data and find that it still outperforms TinyBERT 4 on most datasets. 12

Ablation Study
Pruning units We first conduct an ablation study to investigate how additional pruning units such as 12 We only conduct experiments with data augmentation on four datasets because training on augmented data is very expensive. For example, training on the augmented dataset for MNLI takes more than 200 GPU hours in total. See more details in Appendix E. Table 4 for models of similar sizes. Removing the option to prune hidden dimensions (z hidn ) leads to a slightly faster model with a performance drop across the board and we find that it removes more layers than CoFi and does not lead to optimal performance under a specific sparsity constraint. In addition, removing the layer masks (z MHA , z FFN ) brings a significant drop in speedup on highly compressed models (95%, 5M). This result shows that even with the same amount of parameters, different configurations for a model could lead to drastically different speedups. However, it does not affect the lower sparsity regime (60%, 34M). In short, by placing masking variables at different levels, the optimization procedure is incentivized to prune units accordingly under the sparsity constraint while maximizing the model performance.

MHA layers, FFN layers and hidden units in CoFi affect model performance and inference speedup beyond the standard practice of pruning heads and FFN dimensions. We show results in
Distillation objectives We also ablate on distillation objectives to see how each part contributes to the performance of CoFi in Table 5. We first observe that removing distillation entirely leads to a performance drop up to 1.9-6.8 points across various datasets, showing the necessity to combine pruning and distillation for maintaining performance. The proposed hidden layer distillation objective dynamically matches the layers from the teacher model to the student model. We also experiment with a simple alternative i.e., "Fixed Hidden Distillation", which matches each layers from the teacher model to the corresponding layer in the student model -if a layer is already pruned, the distillation objective will not be added. We find that fixed hidden distillation underperforms the dynamic layer matching objective used for CoFi. Interestingly, the proposed dynamic layer matching objective consistently converges to a specific alignment between the layers of the teacher model and student model. For example, we find that on QNLI the training process dynamically matches the 3, 6, 9, 12 layers in the teacher model to 1, 2, 4, 9 layers in the student model. 13

Structures of Pruned Models
Finally, we study the pruned structures produced by CoFi. We characterize the pruned models of sparsities {60%, 70%, 80%, 90%, 95%} on five datasets. For each setting, we run CoFi three times. Figure 3 demonstrates the number of remaining heads and intermediate dimensions of the pruned models for different sparsities. 14 Interestingly, we discover common structural patterns in the pruned models:  Table 6 for highly compressed models (sparsity = 95%). Although all the models are roughly of 14 We show more layer analysis in Appendix H.  We study different sparsities {60%, 70%, 80%, 90%, 95%}.
the same size, they present different patterns across datasets, which suggests that there exist different optimal sub-networks for each dataset. We find that on SST-2 and QNLI, the first MHA layer is preserved but can be removed on QQP and SQuAD. We also observe that some layers are particularly important across all datasets. For example, the first MHA layer and the second MHA layer are preserved most of the time, while the middle layers are often removed. Generally, the pruned models contain more MHA layers than FFN layers (see Appendix H), which suggests that MHA layers are more important for solving downstream tasks. Similar to Press et al. (2020), we find that although standard Transformer networks have interleaving FFN layers and MHA layers, in our pruned models, adjacent FFN/MHA layers could possibly lead to a better performance.

Related Work
Structured pruning has been widely explored in computer vision, where channel pruning (He et al., 2017;Luo et al., 2017;Liu et al., 2017Liu et al., , 2019cMolchanov et al., 2019;Guo et al., 2020) is a standard approach for convolution neural networks. The techniques can be adapted to Transformerbased models as introduced in §2.3. Unstructured pruning is another major research direction, especially gaining popularity in the theory of Lottery Ticket Hypothesis (Frankle and Carbin, 2019;Zhou et al., 2019;Renda et al., 2020;Chen et al., 2020a). Unstructured pruning produces models with high sparsities (Sanh et al., 2020;Xu et al., 2021;Huang et al., 2021) yet hardly bring actual inference speedups. Developing computing platform for efficient sparse tensor operations is an active research area. DeepSparse 15 is CPU inference engine that leverages unstructured sparsity for speedup. Huang et al. (2021) measure the real inference speedup induced by unstructured pruning on Moffett AI's latest hardware platform ANTOM. We do not directly compare to these methods because the evaluation environments are different. While all the aforementioned methods produce task-specific models through pruning, several works explore upstream pruning where they prune a large pre-trained model with the masked 15 https://github.com/neuralmagic/deepsparse language modeling task. Chen et al. (2020a) show a 70%-sparsity model retains the MLM accuracy produced by iterative magnitude pruning. Zafrir et al. (2021) show the potential advantage of upstream unstructured pruning against downstream pruning. We consider applying CoFi for upstream pruning as a promising future direction to produce task-agnostic models with flexible structures.
Besides pruning, many other techniques have been explored to gain inference speedups for Transformer models, including distillation as introduced in §2.2, quantization (Shen et al., 2020;Fan et al., 2021), dynamic inference acceleration  and matrix decomposition (Noach and Goldberg, 2020). We refer the readers to Ganesh et al. (2021) for a comprehensive survey.

Conclusion
We propose CoFi, a structured pruning approach that incorporates all levels of pruning, including MHA/FFN layers, individual heads, and hidden dimensions for Transformer-based models. Coupled with a distillation objective tailored to structured pruning, we show that CoFi compresses models into a rather different structure from standard distillation models but still achieves competitive results with more than 10× speedup. We conclude that task-specific structured pruning from large-sized models could be an appealing replacement for distillation to achieve extreme model compression, without resorting to expensive pre-training or data augmentation. Though CoFi can be directly applied to structured pruning for task-agnostic models, we frame the scope of this work to task-specific pruning due to the complexity of the design choices for upstream pruning. We hope that future research continues this line of work, given that pruning from a large pre-trained model could possibly incur less computation compared to general distillation and leads to more flexible model structures.

A Reproducibility & Hyperparameters
We report the hyperparameters that we use in our experiments in Table 7.  For four relatively larger GLUE datasets, MNLI, QNLI, SST-2 and QQP, and SQuAD, we train the model for 20 epochs in total and finetune the finalized sub-network for another 20 epochs. In the first 20 epochs, following Lagunas et al. (2021) and Wang et al. (2020b), we first finetune the model with the distillation objective for 1 epoch, and then start pruning with a linear schedule to achieve the target sparsity within 2 epochs. For the four small GLUE datasets, we train the model for 100 epochs in total and finetune for 20 epochs. We finetune the model with the distillation objective for 4 epochs and prune till the target sparsity within the next 20 epochs. Note that even if the final sparsity is achieved, the pruning process keeps searching better performing structures in the rest of the training epochs. In addition, we find that finetuning the final subnetwork is essential for high sparsity models. Hyperparameters like λ, batch size, and learning rate do not generally affect performance much. Louizos et al. (2018) propose l 0 optimization for model compression where the masks are modelled with hard concrete distributions as follows:

B Optimization Details
U (0, 1) is a uniform distribution in the interval [0, 1]; l < 0 and r > 0 are two constants that stretch the sigmoid output into the interval (l, r). β is a hyperparameter that controls the steepness of the sigmoid function and log α is the main learnable parameter. We learn the masks through updating the learnable parameters of the distributions from which the masks are sampled in the forward pass.
In our preliminary experiments, we find that optimizing λ z 0 with different learning rates and pruning schedules may converge to models of drastically different sizes. Hence, we follow Wang et al. (2020b) to add a Lagrangian term, which imposes an equality constraintŝ = t by introducing a violation penalty: whereŝ is the expected model sparsity calculated from z and t is the target sparsity.

D Data Statistics
We show train sizes and metrics for each dataset we use in Table 8.

E TinyBERT 4 w/ Data Augmentation
We conduct task-specific distillation with the script provided by the TinyBERT repository. 16 However, our reproduced results are slightly lower than the reported results in (Jiao et al., 2020). The difference between these two sets of scores may stem from augmented data or teacher models. Note that the authors of TinyBERT did not release the augmented dataset. We run their codes to obtain augmented datasets. We compare CoFi and TinyBERT under the same setting where we use the same teacher model and the same set of augmented data.  We compare CoFi with a state-of-the-art unstructured pruning method, Movement Pruning (Sanh et al., 2020) in Figure 4. As Movement Pruning is trained with prediction-layer (logit) distillation only, we also show results of CoFi trained with the same distillation objective. We observe that CoFi largely outperforms Movement Pruning even without layerwise distillation on MNLI and is comparable to SQuAD on models with a size over 10M parameters. CoFi, as a structured pruning method, is less performant on models of a sparsity up to 95%, as pruning flexibility is largely restricted by the smallest pruning unit. However, pruned models from CoFi achieve 2 − 11× inference speedups while no speedup gains are achieved from Movement Pruning.

F.2 Comparison to Block Pruning
In Figure 6, we compare CoFi with Block Pruning while unifying the distillation objective. Without the layer distillation objective, CoFi still outperforms or is on par with Block Pruning. Block Pruning never achieves a speedup of 10 even the pruned model is of a similar size as CoFi (SST-2), backing up our argument that pruning layers for high sparsity models is the key to high speedups.

F.3 More Baselines
We show additional pruning and distillation methods that are not directly comparable to CoFi in Table 10. CoFi still largely outperforms these baselines even though these methods hold an inherent advantage due to a stronger teacher or base model.

G.1 Layer Alignment
We find that the alignment between the layers of the student model and the teacher model shifts during the course of training. To take SST-2 for an example, as the training goes on, the model learns the alignment to match the 7, 9, 10, 11 layers of the student model to the 3, 6, 9, 12 layers of the teacher model. For QQP, the model eventually learns to map 2, 5, 8, 11 layers to the four layers of the teacher model. The final alignment shows that our dynamic layer matching distillation objective can find task-specific alignment and improve performance.

G.2 Ablation on Distillation Objectives
In Table 11, we show ablation studies on adding the dynamic layer distillation onto prediction distillation across all sparsities. Using the layer distillation loss clearly helps improve the performance on all sparsity rates and different tasks. Figure 5 shows the average number of FFN layers and MHA layers in the pruned models by CoFi. We study different sparsities {60%, 70%, 80%, 90%, 95%}. It is clear that when the sparsity increases, the pruned models become shallower (i.e., the number of layers becomes fewer). Furthermore, we find that the pruned models usually have more MHA layers than FFN layers. This may indicate that MHA layers are more important for solving these downstream tasks than FFN layers.

I RoBERTa Pruning
We show CoFi results with RoBERTa in Figure 7 across sparsities from 60% to 95%. Similar to BERT, models with 60% weights pruned are able to maintain the performance of a full model. Pruning from RoBERTa outperforms BERT on sparsities lower than 90% but as the sparsity further increases, BERT surpasses RoBERTa. Similar patterns are observed from DynaBERT (Hou et al., 2020).

J Training Time Measurement
We use NVIDIA RTX 2080Ti GPUs to measure the training time of TinyBERT. For the general distillation step of TinyBERT, we measure the training time on a small corpus (containing 10.6M tokens) on 4 GPUs and estimate the training time on the original corpus (containing 2500M tokens) by scaling the time with the corpus size difference. Specifically, it takes 430s to finish one epoch on 10.6M tokens with 4 GPUs, and we estimate that it will take 338 GPU hours (or 3.5 days with 4 GPUs) to finish three epochs on 2500M tokens.