Structured Pruning for Efﬁcient Generative Pre-trained Language Models

The increasing sizes of large generative Pre-trained Language Models (PLMs) hinder their deployment in real-world applications. To obtain efﬁcient PLMs, previous studies mostly focus on pruning the attention heads and feed-forward networks (FFNs) of the Transformer. Nevertheless, we ﬁnd that in generative PLMs, the hidden dimension shared by many other modules (e.g., embedding layer and layer normalization) contains persistent outliers regardless of the network input. In this study, we propose S IMPLE , a new structured pruning framework for generative PLMs that comprehensively investigates all the above compress-ible components. To identify redundant network structures, we assign learnable masks over compressible components followed by sparse training. Various sizes of PLMs can be ﬂexibly extracted via different thresholds, and are then task-speciﬁcally ﬁne-tuned for further improvement. Extensive experiments on language modeling, summarization and machine translation validate the effectiveness of the proposed method. For example, the pruned BART brings 1.51x/6.96x inference speedup on GPU/CPU with 67% size reduction, and can be further combined with quantization for more than 25 × compression.


Introduction
Large-scale generative pretrained language models (PLMs) (Radford and Narasimhan, 2018;Brown et al., 2020;Lewis et al., 2020;Raffel et al., 2020) show remarkable performance on various tasks.However, their increasing sizes also lead to expensive memory and computation, hindering their deployment in real applications.
Recent attempts (Tao et al., 2022;Frantar et al., 2022;Dettmers et al., 2022;Xiao et al., 2022;Wang et al., 2021) propose to compress generative PLMs models by quantization.However, hardware-dependent low-bit kernels need to be specially developed for real inference speedup.Com-pared to quantization, structured pruning methods prune parts of the model structures without requiring designing extra operators to achieve inference speedup and run-time memory saving.Recently, Anonymous (2023) show that the feed-forward networks (FFNs) of GPT models can be pruned to smaller widths, and Li et al. (2022) compress the BART models by combining layer pruning and model quantization for a higher compression rate.However, these models consider only limited components for pruning, i.e., the FFNs or Transformer layers, which can be restrictive for various deployment requirements.
In this work, we propose a new structured pruning framework named SIMPLE (Sparsity-Induced Mask learning for Pruning generative pre-trained Language modEls), which offers a wider range of compressible components.Aside from the attention heads and the width of FFNs commonly used for structured pruning of discriminative PLMs, we also propose to prune the hidden dimension, to further push the trade-off between performance and model size.It is motivated by the observation that persistent outliers exist in the hidden dimension of both decoder-only and encoder-decoder generative PLMs.The observation implies that the hidden dimension may be slimmed sharply with a slight performance drop.Additionally, as the dimension of the hidden state is shared by many modules, e.g., embedding layer, attention heads, FFN and layer normalization, the model size thus can be collectively slimmed for further compression.
The crux of pruning lies in ranking the importance of different compressible components.Towards that end, we assign learnable masks over the output of all compressible components.These masks are optimized in an end-to-end fashion together with a sparsity-induced penalty.Unlike conventional pruning criteria based on magnitudes (He et al., 2018) or gradients (Voita et al., 2019), these masks can be mathematically interpreted to prune away components shrinking towards zero during the training.Moreover, the learning of masks is one-shot, i.e., one can flexibly obtain various pruned models by different thresholds over the learned mask values.To mitigate the accumulated compression error, a causal distillation objective is proposed to fine-tune the pruned networks for further improvement on downstream tasks.
We verify the efficacy of SIMPLE on various generation tasks (i.e., language modeling, summarization and machine translation) and network architectures (i.e., GPT2 and BART/mBART).Empirical results show that the proposed SIMPLE outperforms other pruning methods, significantly speeds up the inference, and can be combined with quantization for more aggressive compression.
2 Related Work

Structured Pruning of Transformers
Pruning away unimportant parameters is a widelyused approach to compress neural networks.Unlike unstructured pruning which removes individual weights, structured pruning prunes parts of network structures, and achieves inference speedup/runtime memory saving without designing extra operators.Two commonly used structured components in Transformers are attention heads and FFN neurons (Michel et al., 2019;Hou et al., 2020;McCarley, 2019;Anonymous, 2023).Depending on how the importance of the compressible components is determined, current structured pruning methods for Transformer based methods can be divided into two categories.The first category uses heuristics to calculate the importance.For instance, magnitudebased method (He et al., 2018) uses the component's magnitude as its importance, while the lossaware method (Voita et al., 2019) measures the importance of a component based on the variation in the training loss if it is removed (Hou et al., 2020;Michel et al., 2019).The other category considers the changes in the component's weights during training, and the importance is not determined until the training is finished.For instance, movement pruning (Sanh et al., 2020;Lagunas et al., 2021) captures the changes in the weights during fine-tuning as the dynamic importance metric and prunes weights that shrink during fine-tuning.

Compression of Generative PLMs
In contrast to the extensive research on compressing the discriminative PLMs like BERT (Sanh et al., 2019;Lan et al., 2019;Jiao et al., 2020), the study on the compression of Generative PLMs still at its early stages.Early attempts try to compress generative PLMs by tensor decomposition (Edalati et al., 2021) and knowledge distillation (Song et al., 2020), but suffer from severe performance drop.Some recent attempts propose to apply weight quantization (Tao et al., 2022;Frantar et al., 2022;Dettmers et al., 2022;Xiao et al., 2022) to reduce the storage and run-time memory of generative PLMs, but require hardware-dependent low-bit quantization kernels to achieve inference speedup.Recently, DQ-BART (Li et al., 2022) combines weight quantization and layer pruning to compress BART models on abstractive summarization tasks.GUM (Anonymous, 2023) proposes to prune the neurons in the feed-forward network of GPT models based on importance measured by uniqueness and sensitivity.In addition to these neurons, our proposed SIMPLE also prunes the attention heads and the hidden state dimension, to better explore the trade-off between performance and model size.

Methodology
In this section, we elaborate on the proposed SIM-PLE for structurally pruning generative PLMs.In Section 3.1, we include the hidden dimension as compressible components in addition to the attention heads and FFN neurons, due to the existence of persistent outliers.Then, we introduce sparse learning of masks together with their mathematical interpretations in Section 3.2.The structurally pruned models are followed by task-specific fine-tuning, as detailed in Section 3.3.An overall framework of SIMPLE is depicted in Figure 1.

Preliminary on Pruning PLMs
A standard Transformer layer has a multi-head attention (MHA) layer and a feed-forward network (FFN).Previous works (McCarley, 2019;Hou et al., 2020) show that the width of a Transformer layer can be reduced by pruning the heads of MHA and the neurons in the intermediate layer of FFN.
Specifically, suppose the input of MHA is X ∈ R n×d where n and d are the sequence length and hidden state size, respectively.The computation of the MHA can be reformulated as the summation of all N H attention heads (Hou et al., 2020).For the h-th head, denote its projection matrices as where 10881 Suppose the mask of attention heads is m A l ∈ R N H for the l-th layer, the output of the MHA is: We use m A = {m A l } L l=1 to denote the masks of MHA over all L Transformer layers.
For FFN, denote d f f as the number of neurons in the intermediate layer of FFN, and the weights and biases of two linear layers are The output of FFN can also be reformulated as the summation of computations of d f f neurons.With a slight abuse of notation, we still use X ∈ R n×d to denote the input of FFN.Suppose the mask of these neurons is m F l ∈ R d f f for the l-th Transformer layer, the output of the FFN is computed as: We use m F = {m F l } L l=1 to denote the masks of FFNs over all L Transformer layers.

Persistent Outliers and Hidden Dimension Pruning
In McCarley (2019), it is shown that for discriminative BERT-like models, the cross-layer coupling caused by skip connections makes the hidden dimension d is far more difficult to prune than the attention heads or the FFN neurons.
However, as shown in Figure 2 and Figure 7 in Appendix B.3, for both decoder-only and encoderdecoder generative PLMs, large-magnitude outliers appear only in a small fraction of the hidden dimensions.Moreover, if one dimension has an outlier, it almost persistently appears for all tokens, i.e., the variance of all tokens in a particular index is often smaller than the variance among different indices.Similar observations are also found in (Xiao et al., 2022;Dettmers et al., 2022), and are used to guide weight quantization, i.e., assign the data in outlier dimension with high bit-width.If the magnitude is used as the importance metric (i.e., magnitudebased pruning methods) (Gordon et al., 2020;Shen et al., 2022), the high sparsity of these outliers indicate that a large fraction of the hidden dimensions can be pruned away.
Similar to the attention heads and the FFN neurons, we also set a mask on the hidden dimension.Since the hidden dimensions at adjacent residual blocks are connected with an identity skip connection, we use a shared mask m H ∈ R d across all Transformer layers.For ease of notation, we use h as a general term denoting the output of MHAttn(X) in Eq.(1), FFN(X) in Eq.( 2), the embedding layer and layer normalization layer.The output ĥ ∈ R d of each hidden dimension is then computed as ĥ = m H × h. (3)

Sparsity-induced Mask Learning
After setting these masks over the compressible components (i.e., attention heads, FFN neurons and hidden dimension).Determining which component to prune away is then equivalent to ranking their learned mask values.To reduce the ac- Figure 2: Boxplots of the magnitudes of the hidden dimensions in Layer 1 and 6 of a 12-layer GPT-2 fine-tuned on the dataset PTB.For each layer, we show both (i) magnitudes of all 768 dimensions; and (ii) the top-30 magnitudes.
The other layers follow similar patterns.
curacy drop caused by pruning, it is desired to learn sparse masks.In this section, we propose a sparsity-inducing objective to learn the masks.For a pruned student model, we use the cross-entropy loss (Hinton et al., 2015) between teacher and student model's output logits to guide the learning of these masks.Specifically, for the i-th token t i , suppose the logits of the student and teacher network are Denote the learnable masks as m = {m A , m F , m H }. Inspired by (Liu et al., 2017;Chavan et al., 2022), we use the 1 regularizer over m to promote sparsity, and the optimization objective is: (5) The learnable mask values are initialized to 1, and is updated with gradient descent during training.After learning the mask, these masks are binarized according to a threshold determined by a given sparsity (Section 3.3).λ > 0 is a penalty parameter that controls the weights of the two loss terms The learning stage is computation-efficient as we only need to learn the masks once, and then we can extract sub-networks given any required sparsity, by simply binarizing masks according to the desired threshold.
Method Interpretation.Intuitively, consider one attention head with output ĥ = mh, the gradient with respect to its mask m1 can be computed as For m > 0, the magnitude of m is increasing when ∂ mask ∂m < 0; while for m < 0, the magnitude of m is increasing when ∂ mask ∂m > 0. Thus the magnitude of m increases if which means ∂ ∂h i h i < 0 dominates in the term The above two cases mean h i is increasing while being positive or is decreasing while being negative.Thus ( 6) is equivalent to saying that m is increasing when most entries in the output h are moving away from 0. Inversely, m is decreasing when most entries in h are shrinking towards 0. A similar analysis also holds for the masks m F for the neurons in the intermediate layer of FFN and m H for the hidden dimension.In addition, ( 6) also shows that when m gets larger, h ∂ ∂h also needs to be further away from 0 to make m even larger.
As can be seen from ( 5), the gradient w.r.t.m keeps track of the output h during the learning stage.This is similar to the movement pruning (Sanh et al., 2020) which considers the changes in weights during fine-tuning.The difference is in two aspects.1)Movement assigns the weights of high magnitude with high importance, while SIMPLE considers the output h in each module, which is directly connected to the final output.2) Instead of learning a specific sparse network with a given sparsity like movement, SIMPLE learns the importance of each compressible component in a separable mask learning stage, therefore numerous sub-neworks can be obtained during the fine-tuning stage, given different pruning configurations.

Fine-tuning
After learning the importance metric during the first stage, we fine-tune the model to the required sparsity with the masks fixed.Since generative models compute tokens in left-to-right order, the compression error incurred in previous tokens will pass on to future tokens, making the learning signal noisier over time.
To alleviate this problem, we design a novel causal loss to guide the fine-tuning.Specifically, denote the input sequence of the self-attention layer in an auto-regressive Transformer as {x i } n i=1 .At the i-th time step, given the query q i = W Q x i , the model first accesses the current key memory and generate the output y i by retrieving a convex combination of the value memory slots Therefore, historical key and value memory slots K i−1 , V i−1 affect the generation of the current token.Thus we propose a causal loss causal to align the distribution of key and value in the teacher and student models.With a slight abuse of notation, for the h-th head, denote K s h , V s h and K t h , V t h as the key and memory slots over all tokens of the pruned student and full teacher models, respectively.The causal distillation loss for each Transformer layer is computed over the remaining heads as: where MSE(•) is the mean square error.
Besides the causal distillation loss, we also adopt the conventional logits distillation loss in (4) and the hidden state distillation loss hidden (Jiao et al., 2019).The overall objective during fine-tuning is: To make the pruning process smooth, progressive pruning is applied during fine-tuning.The sparsity is linearly increased to the target value in the first half of the training steps and then fixed.Finally, we remark that SIMPLE can be easily extended to other pruning granularities such as block pruning and unstructured pruning by simply adjusting the corresponding masks (Appendix B.1).

Setup
Tasks and Models.We evaluate the effectiveness of the proposed method on three types of generative tasks, i.e., language modeling, abstractive summarization and machine translation with two types of network architectures, i.e, the decoderonly model GPT-2 and encoder-decoder model BART.More datasets and details can be found in Appendix A.1, and Appendix A.3, respectively.Implementation Details.Note that once the masks are learned, the network can be pruned to any desired sparsity level.We empirically evaluate the method by reporting results on three different sparsity levels.For all compressible components, we set the compression ratio as 1.2x, 1.5x, and 2x respectively.For each compression ratio r, we retain N H /r attention heads and d f f /r neurons in the width of FFN in each Transformer layer and reduce the hidden dimension to d/r .These compression levels result in 26%, 48%, and 67% total parameters reduction.We replace the original GeLU activation with ReLU, which speeds up the inference with comparable performance.
Compared Methods.To comprehensively evaluate SIMPLE across different tasks and network architectures, we re-implement three state-of-theart pruning techniques: magnitude-based pruning (He et al., 2018), loss-aware pruning (Voita et al., 2019), and movement pruning (Sanh et al., 2020) 2 given the lack of baselines.For a fair comparison, we keep the same experimental setting, including the distillation loss, progressive pruning, and hidden dimension pruning, where the importance of each hidden dimension is averaged from all Transformer layers.We also list the comparisons with the existing records once their tasks and models are aligned.For instance, GUM (Anonymous, 2023) prunes FFN in GPT models on language modeling.DQ-BART (Li et al., 2022) combines layer pruning and quantization for abstract summarization.

Language Modeling
We perform language modeling on WikiTxt2 (Merity et al., 2016), PTB (Mikolov and Zweig, 2012) and WikiTxt103 (Merity et al., 2016).This task predicts the probability distributions of a sequence of tokens.Perplexity (PPL) is used to evaluate performance.The results under three different sparsity levels are shown in Table 1.For all pruning methods and datasets, the performance drops as the sparsity level increases.Among the three comparison methods, loss-aware pruning performs better than magnitude-based and movement pruning when the compression ratio is relatively small.However, when the compression becomes more aggressive, loss-aware pruning's performance becomes similar or even worse compared to the other two methods.In contrast, our proposed method, SIMPLE, consistently outperforms all three baseline pruning methods at each of the three sparsity levels.
In Table 2, we compare our proposed SIMPLE method with GUM, the latest work that considers pruning the width of FFN in GPT-like models.
The GUM does not prune other modules like attention heads or hidden dimension, and the reduction in model size is only 22.7% even when half of the FFN neurons are pruned away.For a fair comparison, we also prune away half of the FFN neurons.
As shown, the proposed SIMPLE improves the language modeling compared to GUM by a significant margin.In addition, considering also the attention heads and hidden stage dimension as compressible components enables SIMPLE to achieve a better tradeoff between model size and performance (refer to Section 4.3.1).

Abstractive Summarization
In our experiments, we use the XSum (Narayan et al., 2018) and CNN/DailyMail (See et al., 2017) datasets for evaluating the summarization performance of the pruned BART model.The ROUGE 1, 2, L metrics are used as the evaluation metric.The results are presented in Table 3.
As can be seen, our proposed SIMPLE achieves the best performance on both datasets under all compared sparsity levels.The performance gain over other pruning baselines becomes more pronounced as the sparsity level increases.In particular, SIMPLE reduces 68% of the parameters with less than 2.5 Rouge 1 drop on the XSum dataset.This demonstrates the effectiveness and efficiency of our proposed pruning method for abstractive summarization tasks.
Comparison with DQ-BART (Li et al., 2022).DQ-BART (Li et al., 2022) applies weight quantization and layer pruning to compress the BART model.To compare with it, we also use quantization on the pruned model.Specifically, we first prune 48% parameters from the BART model with our proposed method, and then we quantize the pruned model with the same quantization method as DQ-BART.Table 4 shows the comparison between our proposed method and DQ-BART in terms of the performance of the quantized models.As can be seen, our method outperforms DQ-BART under both 8-bit and 2-bit quantization settings by a significant margin, under similar model sizes.This indicates that the pruned models by SIMPLE are also quantization-friendly.In particular, the ROUGE 1 score decreases by only 3 points only with over 25x model size reduction.

Machine Translation
The WMT16 En-Ro dataset (Bojar et al., 2016), which is used for machine translation of English to Romanian, is used to evaluate the performance of the model pruned from a 24-layer mBART model (Liu et al., 2020).The BLEU metric is used as the evaluation metric.The results are shown in Table 5.Our proposed SIMPLE method preserves good translation performance at different sparsity levels, demonstrating its effectiveness in compressing and accelerating the Transformer-based neural network for machine translation tasks.

Pruning the Hidden Dimension
In Figure 4(a), we study the effect of whether to prune the hidden dimension (Hid) using PTB dataset on the GPT-2.The stage of mask learning is shared for all these pruned sub-networks.Under the same sparsity level, the pruning considering the hidden dimension has lower perplexities.As the compression ratio increases, the performance gain becomes even more pronounced.This indicates that the hidden dimension of generative PLMs contains a non-negligible amount of redundancy.By considering pruning hidden dimensions and other dimensions (attention heads, width of FFN) jointly can the model achieve a much higher compression rate under a similar performance drop.
Note that the importance of different compressible components only needs to be computed once for our proposed method and the magnitude-based and loss-aware method.To demonstrate that our sparsity-induced mask learning offers a good model initialization and importance calculation, we compare the initial perplexity before fine-tuning in Figure 4(b).Movement pruning is not compared as the metric is evaluated during fine-tuning.As can be seen, the proposed method has the lowest perplexity, since SIMPLE captures the dynamics of activations when evaluating the importance metric.In contrast to movement pruning that converges to a certain sparsity, SIMPLE provides a good sparsity-   agnostic model initialization.

Fine-tuning with Casual Distillation
Table 6 illustrates the effect of the causal loss function on fine-tuning the pruned model for language modeling tasks.As can be seen, the proposed causal consistently reduces the perplexity across different datasets.In Figure 5, we visualize the Mean Absolute Error (MAE) of the key and value in the last Transformer layer between the full model and the pruned model on the PTB dataset.The error is computed along the sequence length.The errors of the key and value both increase as the sequence length increases, which illustrates that the missed information in the pruned generative model accumulates along the sequence length, in line with the auto-regressive nature of GPT.However, by adopting the causal loss function, the errors on the key and value are decreased, which improves the causality in the compressed generative model.It's worth noting that maintaining causality in compressed generative pre-trained language models is a topic that deserves further research.

Inference Speedup
Finally, we study the practical speed-up of the pruned models on both CPU and GPU in Figure 3.The batch size is set to 4 and 32 for GPT and BART, respectively.The length of the source text is fixed at 512.We vary the length of the generated text in {1,8}.
Single token generation can be used in scenarios like deterministic tasks (Conneau et al., 2018).When the length of the generation is 1, the speedup of GPT-2 is at most 2.76x/2.72xon GPU/CPU, and the speedup of BART is at most 2.06x/9.92xon GPU/CPU.When the length of the generation is larger than 1 (e.g., 8), the key and value of source text and generated history are cached, which causes the inference to move from computation-bound to memory-bound gradually.The proportion of time spent on data movement becomes greater, resulting in a lower speedup than that of single token generation.To further speed up the inference, one could combine the proposed method with other memory-saving techniques (Dao et al., 2022).

Conclusions
We propose SIMPLE, an efficient method to structurally prune generative pre-trained language models.SIMPLE includes attention heads, the neurons in the intermediate layer of the feed-forward network, and the hidden dimension as compressible components.By learning the masks of these compressible components through a sparsity-induced objective, different sized pruned models can be obtained, and further fine-tuned with a causal distillation objective for better performance.Empirical evaluations on different generative tasks, model architectures, and pruning configurations demonstrate the efficacy of the proposed method.

Limitations
Although SIMPLE achieves great performance with size reduction and generation speedup on various generative language tasks, it is interesting to explore combining the stage of mask learning during the pre-training.Then, one pre-trained model can be applied to downstream tasks with any required sparsity with a stage of fine-tuning.In addition, the pruning of the larger generative pre-trained language models during the fine-tuning is also worth trying.In the future, we would like to investigate the generation ability of the compressed models with more pre-trained data and larger models.

Ethics Statement
During the pre-training process, the information from the trained corpus may contain part of improper expressions like violence or discrimination.In addition, the compression of the generative pretrained language models may result in relatively weak sentence generation.
Hence, the trained models are likely to be confronted with some potential risks in the large language models as mentioned in (Weidinger et al., 2021).With the tools proposed in (Thoppilan et al., 2022), the harmful training data can be removed to make the trained model conform to the norms of society.It is also noteworthy that the safety check is necessary before we deploy the generative language models.

A Implementation Details
A.1 Dataset Splits The train/val/test splits for different datasets are shown in Table 7.We adopt the default data splits for these datasets.

A.2 Dataset Description
WikiTxt2 is a compilation of corpus sourced from Wikipedia's verified Good and Featured articles.
Penn Treebank (PTB) is a widely recognized and utilized corpus for evaluating the capability of language modeling.
WikiTxt103 is another compilation of corpus origin from Wikipedia, and has more data volume than WikiTxt2.
XSum is used to evaluate abstractive singledocument summarization systems.The objective is to generate a concise, one-sentence summary that accurately captures the main idea of the article.
CNN/DailyMail is another dataset for abstractive summarization.The articles that needed to be summarized are from the stories on CNN and Daily Mail website.
WMT16 En-Ro is a translation dataset with the source language as English and the target language as Romanian.It is released by the Workshops on Statistical Machine Translation (WMT) (Bojar et al., 2016).

A.3 Hyper-parameters Setting
For all tasks, the coefficients of the mask regularizer are set as 2e-4, 5e-5 and 1e-4 for m A , m F and m H , respectively.
Language Modeling.We use the 12-layer GPT-2 (Radford et al., 2019) as the backbone.The sequence length is set as 512.We initialize the learning rate as 5e-4 and linearly decay it to 0 with AdamW optimizer (Loshchilov and Hutter, 2017).The overall batch size is 32.λ 1 and λ 2 are both set to 0.001.In addition, language modeling loss with coefficient 1 is added for WikiTxt103 during fine-tuning, which improves performance.
Summarization.We use the 12-layer BART (Lewis et al., 2020) as the backbone.For XSum and CNN/DailyMail datasets, the length of the source sequence is set as 512/512 and the length of the target sequence is set as 64/128, respectively.the beam size is utilized to generate summaries with size 4/6 for XSum and CNN/DailyMail datasets, respectively, following the default hyperparameters in BART.Both λ 1 and λ 2 are set to 0.1.The learning rate is initialized as 2e-4/2e-5 for XSum and CNN/DailyMail datasets with a linear scheduler.The AdamW optimizer is used with batch size 96.
We set the learning epochs as 3 and fine-tuning epochs as 12 for both XSum and CNN/DailyMail datasets.
Machine Translation.We use the 24-layer multilingual BART (M-BART)3 (Liu et al., 2020) as the backbone due to the lack of 12-layer opensourced pre-trained M-BART.For the WMT16 En-Ro dataset, the maximum length of both the source sequence and target sequence is set as 128.
The beam size is set as 5, λ 1 , λ 2 to 0.05 and 0.001, respectively.In addition, language modeling loss with coefficient 1 is added during fine-tuning, which boosts translation performance.The learning rate is initialized as 5e-5 with a linear scheduler.We learn the model for 1 epoch and fine-tune it for 3 epochs, with the batch size 32.

A.4 Training Budget
In stage 1, the masks need to be learned once in a relatively short time.For example, it requires 1.0/2.3hours only for GPT-2 on the PTB/WikiTxt103 dataset with 8 cards.And this cost can be amortized by N times for N different pruned neural networks, since all the pruned networks shared the masks learned in stage 1.Then, the model is fine-tuned with the masks fixed in stage 2.

A.5 Compression Ratio vs. Total Parameter Reduction
In practice, we assign the same compression ratio to all the compressible components, the attention heads, the width of FFN and the hidden dimensions for ease.The aforementioned compression ratio is not equal to the ratio of total parameter reduction.For instance, in the case with a compression ratio 2x, assuming a matrix M with the shape m*(nd), reducing both the hidden dimension (m) and the number of attention heads (n) to m/2, n/2 respectively will lead to a 75% reduction.On the other hand, for the embedding layer and layer normalization, the setting of compression ratio 2x leads to 50% parameter reduction.Overall, the setting of compression ratio 2x results in approximately 67% total parameter reduction.

B.1 Extension to Block Pruning and Unstructured Pruning
Extension to Block Pruning.Block pruning (Lagunas et al., 2021) is an extension of the common structured pruning, which balances the finegrained model sparsification and dense hardware optimization.In block pruning, the weights are divided into different blocks that can be computed efficiently via appropriate GPU kernels.Pruning algorithms are expected to keep the important blocks and remove the redundant ones.
The proposed SIMPLE can be easily combined with block pruning by adapting the size of learn-able masks according to the block size.Assume the weight matrix requires to be divided into multiple blocks with block size (M, N ), the shapes of m H , m A and m F are R d/M , R N H ×d/(N H N ) and R d f f /N , respectively.The learnable masks score these blocks and select the important ones.Extension to Unstructured Pruning.In SIM-PLE, the granularity of pruning can be easily controlled by the shape of the mask.For example, the shape of the attention mask m A can be set as R N H ×d to perform unstructured pruning on the attention module.Here, unstructured pruning does not refer to removing part of connections in any neuron, but removing different dimensions in different attention heads, similar to ViT-slim (Chavan et al., 2022).However, this does not directly accelerate the model inference.By default, we rank the learned masks in each Transformer layer locally to ensure that the number of attention heads in the Multi-Head Attention (MHA) and the width of the Feed-Forward Network (FFN) are equal in each Transformer layer.This is referred to as uniform structured pruning.If we rank the masks globally in each compressible component, then the number of attention heads in the MHA and the width of the FFN vary in different Transformer layers, which is referred to as non-uniform structured pruning.
In Table 8, we report the performance of different variants of SIMPLE by changing the shape of the mask or the range to rank the mask.The attention module of the learned subnets is visualized in Figure 6.Interestingly, the setting of uniform structured pruning not only performs competitively against non-uniform pruning and unstructured pruning but also requires no specific hardware design for speedup.Additionally, when using nonuniform structured pruning, attention heads in the shallow and deep layers are more likely to be retained, while those in the intermediate layers are considered relatively less important and are discarded.Furthermore, SIMPLE can be extended to block pruning by setting the shape of the trained masks to adapt to the assigned block size.However,     it should be noted that block pruning can improve performance but may not bring real speedup when attention heads are not entirely removed (Lagunas et al., 2021).It is desirable for a small model pruned from a large model to outperform a similar-sized unpruned model trained from scratch.To evaluate this, we compare the performance of a model pruned from a 24-layer GPT to a similar-sized unpruned 12-layer model in Table 9.As is shown, the pruned model has a lower perplexity than the un-pruned pre-trained small model, demonstrating the effectiveness of SIMPLE.Additionally, by pretraining one large model and then compressing it to various sizes during task-specific fine-tuning, we save the effort required for pretraining varioussized small models.

B.3 Visualization of the Persistent Outliers on BART
From Figure 7, we visualize the distribution of the magnitude of the hidden dimensions on the fine-tuned BART, which is averaged on the XSum dataset.As is shown in the visualization, a few hidden dimensions have large magnitudes while most of the hidden dimensions maintain a small vibration, despite the input data samples.Similar to the observation on GPT, it indicates that the hidden dimension of BART has redundancy, which motivates us to prune the hidden dimension with a sharp compression rate.D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: The left part shows the overall framework of the proposed SIMPLE.The stripes with the same color represent the pruning dimension shared with the same mask, i.e., the hidden dimension (blue), the attention head (yellow) and the intermediate dimension of FFN layers (red).The right part demonstrates that after sparse learning of masks, one can flexibly extract various Transformer sizes with different pruning thresholds on the mask.
d h = d/N H .The softmax function is applied to the scaled dot product of queries and keys to get attention scores Attn h

Figure 3 :
Figure 3: Throughput (sequence/second) comparison during generation.We vary the length of the generated text in {1,8} with beam size as 6.We use Intel Xeon Gold 6278C CPU and Nvidia V100 GPU devices, respectively.

Figure 4 :
Figure 4: (a) The effect of pruning the hidden dimension.(b) The initial perplexity before fine-tuning.

Figure 5 :
Figure 5: Visualization of the Mean Absolute Error (MAE) of the key and value in GPT.

Figure 6 :
Figure 6: Illustration of adapting different pruning configurations using SIMPLE from the perspective of MHA: (a)-(f) Dimension per attention head in different pruning configurations based on the SIMPLE on the 12-layer GPT-2, which has 12 heads in each attention module and the dimension of a head is 64 before compression.

Figure 7 :
Figure 7: Boxplots of the magnitudes of the hidden dimensions in Layer 1 and 6 of a BART fine-tuned on the dataset XSum.For each layer, we show both (i) magnitudes of all 768 dimensions; and (ii) the top-30 magnitudes.The other layers follow similar patterns.

Table 1 :
Results of language modeling on the test set of WikiTxt2, PTB and WikiTxt103 datasets with GPT-2."↓ " denotes the percentage of reduced parameters.

Table 2 :
Comparison with GUM which prunes the FFN of the GPT-2 on language modeling tasks.

Table 3 :
Results of abstractive summarization on the test set of the XSum and CNN/DailyMail dataset with BART.

Table 4 :
(Li et al., 2022)uned BART combined with quantization on the test set of the XSum and CNN/DailyMail dataset, compared with DQ-BART(Li et al., 2022)."W-E-A" denotes the bit-width for weight, embedding layer and activation."E3D1" denotes the pruned subnet has 3 encoder layers and 1 decoder layer.Although DQ-BART uses a stronger full model as the teacher, the proposed SIMPLE outperforms DQ-BART by a clear margin.

Table 6 :
Ablation study on the causal .

Table 7 :
Data splits of different datasets.

Table 8 :
Ablation of sub-net configurations.

Table 9 :
Un-pruned Small  ModelsComparison between 1) compression from a large fine-tuned model by SIMPLE; 2) pre-training a small model and fine-tuning it on the target dataset.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Section A.3 in the Appendix C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Section 3.1 C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Section 4 D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.