NASH: A Simple Unified Framework of Structured Pruning for Accelerating Encoder-Decoder Language Models

Structured pruning methods have proven effective in reducing the model size and accelerating inference speed in various network architectures such as Transformers. Despite the versatility of encoder-decoder models in numerous NLP tasks, the structured pruning methods on such models are relatively less explored compared to encoder-only models. In this study, we investigate the behavior of the structured pruning of the encoder-decoder models in the decoupled pruning perspective of the encoder and decoder component, respectively. Our findings highlight two insights: (1) the number of decoder layers is the dominant factor of inference speed, and (2) low sparsity in the pruned encoder network enhances generation quality. Motivated by these findings, we propose a simple and effective framework, NASH, that narrows the encoder and shortens the decoder networks of encoder-decoder models. Extensive experiments on diverse generation and inference tasks validate the effectiveness of our method in both speedup and output quality.


Introduction
In recent years, pre-trained language models (LMs) have demonstrated their effectiveness in various downstream tasks, such as natural language understanding (NLU) and natural language generation (NLG).Especially, there have been three main types of research, e.g.,encoder-only LMs (Devlin et al., 2019;He et al., 2023), decoder-only LMs (Touvron et al., 2023;OpenAI, 2023), and encoder-decoder LMs (Lewis et al., 2020;Raffel et al., 2020;Chung et al., 2022b;Tay et al., 2023), which aim for their specific expertise.Among these various types of LMs, we will focus on the widely studied and utilized encoder-decoder LMs due to their flexibility in application across a range of tasks (Guo et al., 2022;Wang et al., 2023b).On the other perspective of LM researches rather than performances, efficiency of LMs (e.g.,computational and memory cost) have been intensively studied because of their huge computational requirements.This research direction is called model compression.Among the various model compression techniques (Jiao et al., 2020;Yao et al., 2022), pruning (Frankle and Carbin, 2018;Sanh et al., 2020;Wang et al., 2020c;Xia et al., 2022) is a promising method that aims to remove redundant weights from networks, resulting in improved efficiency by saving storage capacity and enhancing inference speed.Between structured pruning and unstructured pruning approaches, structured pruning is typically preferred in practice due to its relative ease of deployment on various types of hardware platforms compared to unstructured pruning (Han et al., 2016;Gupta and Agrawal, 2020).
Therefore, we focus on the structured pruning method specifically tailored for encoder-decoder LMs.Despite the remarkable advancements in encoder-decoder models, little attention has been given to structured pruning methods for encoderdecoder LMs.This can be attributed to the inherent differences in the components that enhance pruning efficiency between encoder and decoder networks.Consequently, traditional structured pruning methods that rely on encoder-only models may not effectively optimize encoder-decoder models.For instance, CoFi (Xia et al., 2022), one of the SoTA encoder-only pruning methods, demonstrates a maximum speedup improvement of 1.53× on the CNN/DailyMail (See et al., 2017) dataset, with a ROUGE-L drop of 7.36%.This gain is considerably lower compared to the original result achieved on the encoder-only model applied to the worst QQP case in the GLUE (Wang et al., 2018), where the speedup reaches 11.0× with an accuracy drop of 1.20%.Thus, it becomes crucial to investigate structured pruning methods that are specifically tailored for the encoder and decoder networks.
To this end, in this paper, we pose the following question: How can we design a structured pruning method that effectively accelerates the encoderdecoder model while maintaining its performance?To the best of our knowledge, this study represents the first attempt to address this question.In order to accomplish this, we conduct systematic studies to examine the impact of structured pruning on the encoder and decoder networks, respectively.
Contribution.In this study, we propose an algorithm, NASH, which is strongly motivated by two findings derived from our preliminary experiments.
(1) The number of decoder layers is the primary factor for inference speedup.(2) The sparsity of the encoder network is a key factor affecting the output quality of encoder-decoder LMs.
Based on these findings, we propose an algorithm, illustrated in Figure 1, that consists of two parts: the encoder network, which enhances output quality by gradually reducing the width of each layer, and the decoder network, which achieves faster inference speed by uniformly selecting layers to reduce depth.
We empirically evaluate the performance of NASH on various NLG datasets including standard fine-tuning on a single task (Gliwa et al., 2019;Xiong et al., 2019), multi-task learning scenarios, and recent instruction-tuning datasets (Conover et al., 2023;Wang et al., 2023a).Notably, in our experiments using T5-base, NASH achieves a speedup of 2.5-4.2×while preserving 95% of the output quality.Our experimental results show that NASH can be a unified framework which is regardless of task difficulty and model type.

Preliminary
Transformers.We focus on the Transformer network (Vaswani et al., 2017), which consists of the encoder and decoder architecture.The encoder architecture is composed of L blocks, and each block consists of a multi-head attention (MHA) layer and a feed-forward (FFN) layer.An MHA layer in the i-th Transformer layer with N h heads outputs: where Att represents a dot product attention head, and Q, K, and V are the input sequences for query, key, and value, respectively.In self-attention layers, all of the keys, values, and queries come from the outputs of the previous layer.On the other hand, in cross-attention layers, the queries come from the previous decoder layer, while the memory keys and values come from the output of the encoder.It is important to note that the j-th head is parameterized by which represent the query, key, value, and output matrices, respectively.Here, d and d h denote the hidden state dimension and attention head dimension, respectively.
The output of the MHA layer, denoted as X, is then fed into the FFN layer in the i-th Transformer layer: Here, the two fully-connected layers are parameterized by W 1 ∈ R d×df and W 2 ∈ R df ×d , with d f representing the dimension of the FFN layer.
Structured Pruning.Structured pruning gradually removes unnecessary parameters from a model, targeting width-related components (e.g.,MHA heads, FFN hidden units) and depth-related elements (e.g.,Transformer layers) during training.Recent advancements have demonstrated significant speedups with minimal output quality reduction.For example, block pruning (Lagunas et al., 2021) and CoFi (Xia et al., 2022) have enhanced flexibility, optimization, and enabled simultaneous pruning at multiple levels.
Pruning the components of the i-th layer related Figure 2: Comparing model size (or speedup) vs. output performance for four pruning options: with or without depth pruning applied to either the encoder or decoder network individually.The results emphasize that (1) the number of layers in the decoder network is the primary factor contributing to speedup improvements.and (2) the sparsity of the encoder network is the key factor of output quality.
to MHA can be formulated as follows: where z MHA and z head ∈ {0, 1} and to mask MHA layer and individual head of MHA.
The FFN layer, which is another major component of the Transformer network, is also known to be over-parameterized (Dong et al., 2021 Various techniques have been employed to learn these mask variables used in structured pruning.For example, Wang et al. (2020c) and Xia et al. (2022) utilized L0 regularization to eliminate redundant parameters.On the other hand, Lagunas et al. (2021) adopted the movement score introduced by Sanh et al. (2020) as a measurement for their pruning approach.

Experimental Motivations
In this section, we separately investigate the behavior of the encoder and decoder when depth pruning is applied or not, using CoFi-T5, the modified version of CoFi (Xia et al., 2022) tailored for T5 (Raffel et al., 2020).Particularly, in Figure 2, we study the results of four cases: encoder with depth-pruning (△), encoder without depth-pruning (⃝), decoder with depth-pruning (△), and decoder without depth-pruning (⃝).From these four types of cases, we aim to address the following questions: (1) Does depth pruning exhibit different phenomena in each case?(2) What is the key factor for accelerating inference speed while preserving sufficient output quality?We provide detailed answers to each question by training T5-Base with target sparsities of {60%, 70%, 80%, 90%, 95%} for the decoder cases, and {20%, 40%, 60%, 70%, 80%, 90%, 95%} for the encoder cases. 1efore delving into the detailed answers, we briefly address the first question: the impact of depth pruning when applied to the encoder and decoder, respectively.As depicted in Figure 2, depth pruning exhibits a significant influence on the decoder (as indicated by △ and ⃝), while the encoder shows negligible effects (as observed in △ and ⃝).Consequently, the appropriate utilization of depth pruning becomes crucial.In the following paragraphs, we outline our key findings related to the second question to establish an effective structured pruning mechanism for encoder-decoder LMs.Finding 3.1.The number of layers in the decoder network is the dominant factor affecting the inference speed, while the decoder width does not have a significant impact.
We evaluate the findings regarding the decoder network from two perspectives: (1) the effectiveness of layer-wise pruning and (2) the ineffectiveness of width pruning.Firstly, as demonstrated in the second and fourth plots of Figure 2, the decoder exhibits a significant speedup with minor degradation (when comparing △ and ⃝), whereas the encoder shows no such effect (when comparing △ and ⃝).This indicates that layer-wise pruning plays a dominant role in pruning the decoder.On the other hand, when comparing the model size and speedup (as shown in the first and second plots of Figure 2 with ⃝), width pruning reduces the model size but leads to both performance degradation and negligible speedup.This suggests that width pruning is not effective for the decoder.
To further investigate Finding 3.1, we investigate the inference speed of Transformer layers, with a specific focus on understanding why width pruning is ineffective.This analysis involves two key observations: (1) finding the metric that synergizes with width pruning, and (2) identifying the component that predominantly consumes computational resources.According to Figure 3, width pruning can have a significant impact on the computational cost as the sequence length increases.However, due to the inherent nature of the autoregressive decoding process, the decoder network is constrained to a sequence length of 1.As a result, width pruning cannot effectively improve the speed of the decoder network.Furthermore, as illustrated in Figure 4,Layer Normalization (LN) and dropout (DO) collectively contribute approximately 20-25% to the overall inference time.Consequently, the time allocated to these fixed operations remains constant, leading to diminished efficiency in terms of inference speed.In conclusion, width pruning is not an appropriate approach for optimizing the decoder.
Finding 3.2.From the perspective of encoder pruning, while achieving high-level sparsity may not be desirable, attaining low-level sparsity not only slightly accelerates inference speed but also enhances performance.
By comparing the ⃝ points and ⋆ in the second and fourth plots of Figure 2, we observe that encoder pruning yields a slight speedup along with improved performance.However, when the encoder network is heavily pruned, it experiences sig- nificant performance degradation.These findings emphasize the significance of considering pruning in both the decoder and encoder networks.Furthermore, they provide insights into the necessity of employing distinct pruning strategies for these two networks, considering their unique characteristics.
Comparison with Prior Observations.Our key findings provide valuable insights: the appropriate strategy for encoder-decoder models involves using a small number of layers for the decoder and minimal pruning for the encoder networks.Importantly, our observations offer a more generalized understanding compared to previous works (Kasai et al., 2020;Tay et al., 2021).Unlike prior studies that manually determined model configurations for specific tasks such as machine translation (Kasai et al., 2020) or NLU (Tay et al., 2021), our conclusions are derived automatically through gradual structured pruning and have been validated across both NLG and NLU tasks.Furthermore, while the DeepNarrow strategy proposed by Tay et al. (2021) demonstrates effectiveness in NLU tasks with short output sequences, it exhibits computational inefficiency when applied to NLG tasks.Similarly, the contribution of processing time for encoder networks varies, necessitating the use of a narrower encoder architecture contrary to the approach proposed by Kasai et al. (2020).

Narrow Encoder and Shallow Decoder
Based on the findings presented in Section 3, we propose a structured pruning framework called NASH (Narrow encoder and Shallow decoder) that is specifically optimized for encoder-decoder LMs.Our approach focuses on enhancing inference speed by utilizing uniform layer selection in the decoder network, deviating from the gradual pruning technique commonly employed in encoder-only models.Additionally, we improve generation per-formance by applying gradual L0 regularization pruning specifically to the encoder network, inducing low sparsity instead of solely prioritizing inference speed improvement.

Shallow Decoder: Uniform Layer Selection for Decoder Networks
For a given number of selected layers d s , we can generate a sub-network of the decoder network with a set of selected layers as follows: We match the hidden states of the sub-networks to those of unpruned decoder networks: ).
While uniformly selecting layers work well on various domains such as knowledge distillation (Jiao et al., 2019;Shleifer and Rush, 2020) or structured pruning of encoder-only model (Hou et al., 2020), our work first proposes using uniform layer selection of decoder network for structured pruning of encoder-decoder LMs.
The key philosophy of our proposed module is twofold: (1) As shown in Finding 3.1, the number of layers in the decoder network is the main factor affecting inference speedup.(2) Uniform selection is proven to be an effective approach for selecting layers (Hou et al., 2020).To verify this second statement, we compare various candidates, including uniform selection, selection with lower layers, selection with higher layers, and the L0 regularization-based approach (Louizos et al., 2018).Through our empirical evaluation, we confirm that uniform selection is the best approach among these candidates (see Section 5.3 and Table 3 for details).Based on this philosophy, we construct shallow decoder pruning by selecting the layers using uniform selection.

Narrow Encoder: Gradual
L0-regularization with Low Sparsity Among various structured pruning methods (Hou et al., 2020;Lagunas et al., 2021), we utilize the L0 regularization-based pruning method, which has shown the state-of-the-art performances in encoderonly language models (Wang et al., 2020c;Xia et al., 2022).The application of L0 regularization in practice is achieved by enforcing an equality constraint between the target sparsity and the current sparsity: where λ 1 , λ 2 , ŝ, and t denote the learnable Lagrange multipliers, current sparsity, and target sparsity, respectively.The detailed derivation of R is described in Appendix B. The current sparsity, ŝ, is calculated as follows: where M , L, N h , and d f indicate the number of model parameters, encoder layers, attention heads, and feed-forward layer dimensions, respectively.We only conduct the pruning of individual attention heads and intermediate layer dimensions by introducing variables z We further use hidden states distillation by matching the hidden states of pruned and unpruned networks at the same layers as follows: As we demonstrated in Finding 3.2, structured pruning with low sparsity enables output quality enhancement rather than inference speedup gain.Motivated by this finding, unlike previous methods (Wang et al., 2020c;Xia et al., 2022) that mainly use L0 regularization to achieve high inference speedup, we use such L0 regularization to accomplish improvement of output quality.

Training Loss Function of NASH
We combine hidden states distillation with prediction-layer distillation by using Kullback-Leibler divergence (KLD) function.
where the f (•) and (•) are softmax outputs for the sub-network of pruned model and unpruned model, Table 1: The summary of Figure 5 which compares the generation quality and latency speedup of NASH against other acceleration methods on TweetQA (Xiong et al., 2019), XSum (Narayan et al., 2018), SAMSum (Gliwa et al., 2019), and CNN/DailyMail (See et al., 2017).The numbers of parameters for all models are around 60M, except for T5-Base.The best and second-best results of sharing the dataset are highlighted in bold and underline.respectively.Then, the total training objective for a pruned model is where λ enc and λ dec are coefficients for controlling the contribution of hidden state distillation for the encoder and decoder network, respectively.

Experimental Setup
Dataset.We evaluate our proposed method on various tasks using the versatility of encoderdecoder LMs.For abstractive question answering, we conduct experiments on TweetQA (Xiong et al., 2019).For the text summarization task, we experiment on XSum (Narayan et al., 2018), SAM-Sum (Gliwa et al., 2019), and CNN/DailyMail (See et al., 2017).We evaluate the output quality using METEOR (Banerjee and Lavie, 2005) for abstractive question answering and ROUGE (Lin, 2004) for the summarization tasks.We conduct experiments on multi-task scenario that consists of SAM-Sum, TweetQA, and five tasks from GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019) benchmarks.The detailed explanations for the datasets used are described in Appendix C.
Implementation.First, we fine-tune a model and perform uniform layer selection on the decoder network of the fine-tuned model to generate a subnetwork model.Subsequently, we further train the sub-network model with the pruning objective, utilizing a scheduler to gradually increase the sparsity until reaching the desired target value.In our experiments, we calculate sparsity by dividing the number of pruned parameters (excluding the embedding) by the size of the full model.Following the approach of Xia et al. (2022), we continue fine-tuning the pruned model until convergence.We set the target sparsity of the encoder networks as 30% for all experiments.The reported results are based on  the validation sets of all datasets.Additionally, our models are implemented using Huggingface (Wolf et al., 2020) library.

Main Results
Standard Fine-tuning.Table 1 and Figure 5 summarize that the proposed method outperforms CoFi-T5, 6-layer T5-Small, (Raffel et al., 2020), and 4-layer T5-mini, (Tay et al., 2021) in terms of output quality and inference speedup.Our results consistently demonstrate the superiority of NASH with three decoder layers over the baselines in both performance metrics, although there is a trade-off between output quality and inference speedup with our method.Moreover, our method is particularly effective in improving inference speed, especially for tasks involving longer sequence outputs, such as SAMSum or CNN/DailyMail.However, CoFi-T5 falls short in achieving comparable speedups to T5-Small while maintaining the same output quality.Figure 6 illustrates the pruned model structure trained on SAMSum, following the method described in Table 1 to provide a deeper understanding of the results.Despite CoFi-T5 removing more modules in the encoder network compared to NASH, it cannot remove as many layers in the decoder layer as NASH does.
Multi-task Learning.We conducted an analysis to observe how performance trend change with varying numbers of tasks.As depicted in Figure 7, both fine-tuned T5-Base and our algo-rithm (NASH) exhibit fluctuations in accuracy according to the number of tasks.However, it is worth noting that our proposed method demonstrates significantly less variation in generation performance when compared to T5-Small.This robustness in multi-task learning is consistently observed across all datasets.Overall, NASH consistently outperforms T5-Small in terms of both generation performance and speedup, similar to standard fine-tuning.Through the results shown in complex scenarios, we verify that the generation performance of encoder-decoder models is robust to the number of decoder layers.
Instruction Tuning.To verify versatility of proposed NASH, we consider the instruction-tuning scenario with the databricks-dolly-15k (Conover et al., 2023) dataset.By following Gu et al. (2023), we split the total dataset into 14k train samples and 1k evaluation samples.We evaluate the trained model (with 14k of train samples) on three instruction-following datasets to check the generalizability of the model across the different datasets: (1) 1k dolly evaluation; (2) Self-Instruct (Wang et al., 2023a); (3) Vicuna evaluation (Chiang et al., 2023).Similar to previous instances of task-specific fine-tuning and multi-task scenarios, our algorithm with three decoder layers consistently outperforms T5-Small across various metrics, including GPT-4 evaluation, ROUGE-L, and speedup, as shown in Table 2.These results suggest that our proposed method is well-suited for developing general- purpose language models, a usage pattern widely adopted in recent large language models (Chung et al., 2022a;Tay et al., 2023).

Ablation Studies
Different Layer Selection.To validate the effectiveness of uniform layer selection in shrinking the decoder network, we investigate other layer selection methods in a two-layer selection problem.We compare four selection methods: lower selection (first 2 layers), higher selection (last 2 layers), L0 regularization-based automatic selection (Louizos et al., 2018;Xia et al., 2022), and our uniform selection.The results presented in Table 3 demonstrate that uniform selection consistently outperforms the other manual selection methods.The performance margin becomes more pronounced in NLG tasks.Notably, we observe that automatic selection fails to achieve the target sparsity consistently across all tasks, except for BoolQ.This instability in automatic selection aligns with our preliminary findings discussed in Appendix A.
Different Pruning on Encoder.To evaluate the effectiveness of our pruning strategy on the encoder network, we investigate the following strategies at different levels of sparsity: (1) without encoder pruning, (2) uniform layer selection (similar to the decoder network), and (3) the proposed L0 regularization approach.We prune the encoder network of the T5-Base, which has four decoder layers se-  lected uniformly.The results presented in Table 4 clearly demonstrate that our chosen approach, with low sparsity, outperforms both the unpruned baseline and the uniform layer selection.We also observe that the advantage of this approach is only noticeable at low sparsity, as evidenced by the comparison between 30% and 60% sparsity.
NASH on Different LMs.We also conducted a deeper model experiment using T5-Smallefficient, (Tay et al., 2021), which is a variant of T5-Small with up to four times more layers while maintaining the same configuration.This experiment aimed to determine the effectiveness of our method regardless of the model architecture.The results presented in Table 5 consistently demonstrate that NASH improves inference speed without significantly compromising the quality of generated outputs, regardless of the depth of the decoder networks.It is noteworthy that the acceleration compared to the original model increases as the number of decoder layers increases.Furthermore, NASH exhibits faster inference and higher output quality compared to CoFi-T5, which is consistent with the results presented in Table 1.
Comparison with Tao et al. (2023).We applied NASH to BART-Base (Lewis et al., 2020) using the CNN/DailyMail dataset, conducting a direct comparison with SIMPLE (Tao et al., 2023).SIMPLE introduced a structured pruning method for generative LMs, which is relevant to our work.Notably, NASH exhibits higher ROUGE-L scores than SIM-PLE when both models are at 27% sparsity.Additionally, despite having larger parameters, NASH outperforms SIMPLE with 50% sparsity in terms of speedup.Our approach achieves more than three times the speedup, while SIMPLE reaches a maximum of 1.5 times on the GPU.

Related Works
Language Model Compression.With the advancement of NLP, LMs have grown in size, making it difficult to deploy them on edge devices and resulting in slower inference speed.As a result, there has been active research on language model compression which has three main approaches: quantization, knowledge distillation, pruning.Quantization (He et al., 2016;Alom et al., 2018;Zafrir et al., 2019;Shen et al., 2020;Yao et al., 2022) minimizes the storage requirements for weight values by reducing the number of bits needed to represent them.Knowledge distillation (Sanh et al., 2019;Jiao et al., 2019;Sun et al., 2019Sun et al., , 2020;;Wang et al., 2020b,a) transfers the knowledge of a large-scale teacher model with high performance to a smaller-scale student model, enabling the student model to replicate the behavior of the teacher model.Pruning (Chen et al., 2020;Sanh et al., 2020;Kwon et al., 2022;Frantar and Alistarh, 2023) reduces the size of a model by removing unnecessary parts of large networks such as neurons, weights, or layers.
Pruning.Pruning can be categorized into two parts: (1) unstructured pruning and (2) structured pruning.In unstructured pruning (Chen et al., 2020;Prasanna et al., 2020), weights, which are connections between neurons, are removed from the network based on various criteria.However, this line of methods produces sparse weight matrices, requiring specific hardware support.On the other hand, structured pruning (Xia et al., 2022;Kwon et al., 2022;Kurtic et al., 2023), prunes away structures such as neurons, weight matrix blocks, or layers.Most previous works on structured pruning have focused on encoder-based models (Xia et al., 2022;Kwon et al., 2022;Kurtic et al., 2023), which remove attention heads, columns, and rows of weight matrices using different importance score metrics, including magnitudes or Hessians of weight matrices, and L0 loss.However, structured pruning on generative models has been significantly underinvestigated, with only a few available works (Lagunas et al., 2021;Yang et al., 2022;Santacroce et al., 2023).Lagunas et al. (2021) extended movement pruning (Sanh et al., 2020) into structured prun-ing, but their method can only achieve up to 1.4× speedup for encoder-decoder based BART (Lewis et al., 2020).Yang et al. (2022) released an opensource toolkit that combines structured pruning and vocabulary pruning for various pre-trained language models, but only vocabulary pruning is applicable to T5 and BART.

Conclusion
We propose NASH to address the lack of exploration in structured pruning of encoder-decoder LMs.To design a structured pruning method suitable for encoder-decoder models, we first examine the behavior of pruned models with different strategies, focusing on inference speed and generation performance.Our findings reveal that (1) the number of decoder network layers is the key factor in accelerating inference speed and ( 2) low sparsity pruning on the encoder network can enhance model performance.Based on these insights, we develop NASH, which constructs a narrow encoder and a shallow decoder network for encoder-decoder LMs through gradual L0 regularization pruning and uniform layer selection, respectively.We demonstrate the superiority of NASH in terms of speedup and output quality across various tasks.We strongly believe this work lays a strong foundation for further investigation into effective pruning approaches for encoder-decoder LM.

Limitations
Although we were unable to conduct research on unstructured pruning due to device limitations, collaboration with devices could facilitate performance enhancements.Furthermore, owing to the motivating analysis and algorithm construction of this paper, i.e., analysis of separate encoder and decoder networks, further exploration of a cooptimized method is necessary, and there is potential for improvement in this aspect.

A Instability of CoFi-T5
In this section, we observe that pruning of the T5 shows more varied accuracy and pruned sparsity than those of BERT under the same condition of pruning (e.g., target sparsity, warm-up epochs) which means instability of encoder-decoder model pruning.
Experimental Setup.We study the instability of CoFi-T5 with T5-Base compared to the original CoFi with BERT-Base (Devlin et al., 2019).
To compare the T5 and BERT, we conduct the experiments on the RTE task of the GLUE benchmark (Wang et al., 2018) with 90% of target sparsity 2 for both models.Additionally, we extend our investigation to the SAMSum (Gliwa et al., 2019) which contains messenger-like conversations with summaries, utilizing T5-Base to observe the training instability of the encoder-decoder model on more challenging tasks.We prune with 20 random seeds to compare different settings.
Results. Figure 8 presents two main observations.Firstly, the output performance of pruned T5 models exhibits more variability than pruned BERT models at high-level target sparsity.Secondly, CoFi-T5 fails to achieve the target sparsity at high-level sparsity in both RTE and SAMSum.In the case of RTE, we observed a high proportion of overpruning, while in the case of SAMSum, all experiments were significantly under-pruned.These results indicate the instability of CoFi-T5 in terms of sparsity and output quality, which aligns with the findings discussed by Zhang et al. (2021).L0 regularization-based structured pruning methods (Wang et al., 2020c;Xia et al., 2022) commonly incorporate linear warm-up to gradually increase the target sparsity during the training process, aiming to ensure stable training of mask variables.Based on this understanding, we employ longer warm-up epochs for gradual structured pruning and observe that this approach partially mitigates the instability of CoFi-T5.While this mitigates the training instability to some extent, it does not completely address the challenge associated with CoFi-T5.These results motivate us to investigate the appropriate strategy for structured pruning in encoder-decoder models. 32 The sparsity is computed as the number of pruned parameters divided by the full model size (embeddings and classifier excluded).
3 As longer warm-up epoch is shown to be effective, we applied it to both CoFi-T5 and NASH in our experiments.Figure 8: The pruned sparsity (left) and output performance (right) distributions for 90% of target sparsity across 20 random trials on RTE (Wang et al., 2018) with BERT-Base, RTE with T5-Base, and SAMSum (Gliwa et al., 2019) with T5-Base.Unlike the results of BERT on RTE, which show that the accuracy and pruned sparsity are concentrated around the target sparsity, we observe that both output performance and pruned sparsity vary across different trials.This variation confirms the instability of pruning-aware training for encoderdecoder models.It is important to note that accuracy and ROUGE-L are used as output performance metrics for RTE and SAMSum, respectively.

B Pruning with L0 Regularization
In this section, we give the details of the pruning with L0 regularization.Structured pruning through L0 regularization (Louizos et al., 2018) is proposed to construct the sparse neural network efficiently.In addition, this scheme using L0 regularization is widely applied to prune LMs (Xia et al., 2022;Wang et al., 2020c).With a given LM f (•; θ) that is parameterized by θ = {θ j } n j=1 , we can define binary mask variable z = {z j } n j=1 .Note that θ j and z j denote a unit of weights (e.g., weights of attention heads, or column of an MLP layer) and mask variable corresponding to θ j , respectively.
In this formulation, a pruned LM is written as f (•; θ), where θ = θ ⊙ z and the pruning is a problem to find optimal mask variables and weights.In the L0 regularization-based pruning, these masks and weights are learned based on the following objective function: In the objective function above, every mask variable z j is chosen based on the prior distribution q(z j |π j ) = Bernoulli(π j ).Considering the number of binary masks, n, the possible choices of z can be represented as 2 n .The discrete feature of the mask and this tremendous amount of choices make mask optimization practically intractable.Louizos et al. (2018) mitigate this issue with a re-parameterization trick, enabling z to be differentiable and updates with the model parameter θ.In detail, the masks z are defined as continuous variables determined by min(1, max(0, s)), where continuous random variable s is sampled from the range of [0, 1].Note that it is equivalent to sample u from the uniform distribution, U(0, 1) and calculate s as follows: where l and r are constant values that satisfy l < 0 and r > 0, and α is a learnable parameter.From this formulation, the learning objective can be rewritten as: This process obtains z ={z j } n j=1 where every z j is in the range of [0, 1].However, Wang et al. (2020c) observes that optimizing with those relaxed regularizations makes models converge to differentsize subnetworks depending on a learning rate and pruning schedule.To mitigate this problem, they suggest using a Lagrangian relaxation instead of the L0 regularizer λ∥ θ∥ 0 as follows: where λ 1 and λ 2 are learnable Lagrange multipliers.ŝ represents the current sparsity calculated by the masks z, while t represents the target sparsity.Motivated by these works, we also utilize the relaxed regularization term, R, for gradually structured pruning on the encoder network.

C Description of Datasets
Natural Language Generation Tasks.Since we study the structured pruning for encoder-decoder models, a sort of generative model, we conduct comprehensive experiments on the NLG tasks.The NLG datasets used in our study encompass two tasks: summarization and abstract question answering.We employed the XSum (Narayan et al., 2018), SAMSum (Gliwa et al., 2019), and CNN/DailyMail (See et al., 2017) datasets to assess the summarization capability of our proposed method.These datasets are widely used in evaluating the effectiveness of summarization techniques.Regarding abstractive question answering, we employed the TweetQA (Xiong et al., 2019) dataset to evaluate our method.
• XSUM (Summarization): XSUM (Narayan et al., 2018) comprises articles sourced from BBC, along with corresponding single sentence summaries.More specifically, each article begins with an introductory sentence that serves as a summary.These summaries are professionally written and are usually provided by the article's author.
• SAMSum (Summarization): SAMSum (Gliwa et al., 2019) consists of 16K messenger-like conversations annotated with a summary for providing a concise overview of the conversation's content in third person.The conversations encompass a variety of styles and registers, ranging from informal to semi-formal and formal.Additionally, they may include slang words, emoticons, and typographical errors.
• CNN/DailyMail (Summarization): CNN / Dai-lyMail (See et al., 2017) consists of over 300K English news articles that were originally designed for machine-reading and comprehension as well as abstractive question answering, but it now also supports extractive and abstractive summarization.In this work, we utilize the 3.0.0version.
• TweetQA (Abstract Question Answering): TweetQA (Xiong et al., 2019) is the first largescale dataset for question answering (QA) over  Natural Language Understanding Tasks.We apply GLUE (Wang et al., 2018) and SuperGLUE benchmarks (Wang et al., 2019) for evaluating on NLU tasks.While these benchmarks consist of classification datasets, we generate the phrase related to the label instead of the class index.The detailed descriptions and labels of each task are described in Table 7.
• Self-Instrcut: The authors of Self-Instruct (Wang et al., 2023a) have introduced a dataset comprising 52,000 instructions matched with 82,000 instance inputs and outputs.This dataset serves as a resource for fine-tuning language models to improve their adherence to instructions.Additionally, they've provided 252 expert-created tasks and instructions designed for user-centric applications, which are used in the human evaluation section of their research.Furthermore, the Self-Instruct dataset includes 50,000 examples from the P3 and Super Natural Instructions datasets for the purpose of facilitating comparisons with existing public datasets.
• Vicuna: Vicuna (Chiang et al., 2023) utilized approximately 70,000 multi-round conversations between users and ChatGPT collected from the ShareGPT website (Geng et al., 2023), which allows sharing of ChatGPT dialogues, as a dataset for fine-tuning.In this work, we utilize 80 challenging questions used in the Vicuna evaluation.

D Hyperparameters
In this section, we describe the hyperparameter setup of experiments.We report the hyperparameters that we utilized in Table 8 and 9 for NLG and NLU tasks, respectively.We use different hyperparameter sets for small NLG datasets (TweetQA, SAMSum) and large NLG (XSum, CNN/DailyMail) datasets.Similarly, for NLU tasks, we use different hyperparameters depending on the dataset size.For the small-size NLU datasets (CB, COPA) that the number of samples is smaller than 1,000, we use the hyperparameters (small) described in Table 9.We especially train our model for 150 epochs because the data size is too small to learn the weights for the L0 regularization.We use the hyperparameters (middle) described in Table 9 for the middlesize NLU datasets (MRPC, RTE, STS-B, CoLA, WIC, BOOLQ) that the number of samples is more than 1, 000 but less than 10, 000.We use the hyperparameters (high) described in Table 9 for the large-size NLU datasets (MNLI, QQP, QNLI, SST-2) that the NLU datasets whose size is larger than 10, 000.

E Instruction-tuning Details
In the evaluation process of GPT-4, feedback is solicited by instructing the model to compare its generated responses with the authentic, reference answers and assign a numerical score ranging from 1 to 10 to each response.Drawing upon the methodology outlined by Gu et al. (2023), we calculate the ratio between the cumulative scores assigned to the model's responses and those of the ground truth answers.Further details regarding the specific prompt employed for this evaluation are presented in Figure 9.

F Speedup Evaluation Metric
To measure the inference speed, we conducted inference predictions for each dataset and examined configuration using the PyTorch (Paszke et al., 2019) compiled function.This was done on a single server equipped with a NVIDIA GeForce RTX We would like to request your feedback on the performance of two AI assistants in response to the user instruction and input displayed above.
Please rate the helpfulness, relevance, accuracy, and level of detail of their responses.Each assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance.
Please first output a single line containing only two values indicating the scores for Assistant 1 and 2, respectively.The two scores are separated by a space.
In the subsequent line, please provide a comprehensive explanation of your evaluation, avoiding any potential bias and ensuring that the order in which the responses were presented does not affect your judgment.
3090 GPU and an AMD EPYC 7502 32-Core Processor CPU.For each inference prediction, we utilized a batch size of 32.Additionally, we generated output sequences using a beam size of 4. The time taken for the measurements included all decoding steps until completion.We compare our algorithm, NASH, with models designed to have a shallow decoder depth originally from the pre-training stage, as proposed by Tay et al. (2021).In our evaluation, we examine the performance of our method on two tasks, namely TweetQA and SAMSum, using 2, 4, 6, and 8 decoder layers.As shown in Table 10, NASH demonstrates superior performance in most cases.This result is noteworthy as our method can construct a small yet effective model without requiring any costs to make the small pre-trained language model.

G.2 Results on NLU Tasks
We also compare NASH to the baseline methods on the GLUE and SuperGLUE benchmarks, which are focused on NLU tasks.Since these tasks involve relatively longer input sequences and shorter  output sequences compared to NLG tasks, our proposed method exhibited less effectiveness.However, NASH still demonstrates superiority over the baselines, as depicted in Figure 10.It is important to note that the performance of our method remains robust across different compression rates.We also provide detailed performance results of our proposed method for the full GLUE and Su-perGLUE benchmarks in the Appendix G.2 in Table 11.The results demonstrate the effectiveness of our proposed NASH method in achieving high output quality while significantly improving inference speed.The superior performance of NASH across both GLUE and SuperGLUE benchmarks highlights its potential as an efficient acceleration method for NLU tasks as well as for NLG tasks.

Figure 1 :
Figure 1: Brief illustration of the proposed algorithm, NASH: NArrow encoder and SHallow decoder.It is composed of two main components, width and depth pruning for encoder and decoder, respectively.

Figure 3 :
Figure 3: Processing time per one Transformer layer depending on the model configuration and the sequence length.As depicted in the sequence length 1 case, the factors, such as the number of attention heads and FFN dimensions not affect the processing time.

Figure 4 :
Figure 4: Componentwise processing time of the T5base.The layer normalization and dropout contribute 20-25% of the total inference time.

Figure 7 :
Figure7: Evaluation of generation quality and latency speedup of NASH on multi-task learning scenario.We compare NASH on 220M T5-Base against T5-Base and 60M T5-Small.On all datasets, NASH is able to outperform T5-Small.

Table 2 :
Evaluation of generation quality and latency speedup of NASH on instruction-tuning scenario.Note that NASH-L indicates that we conduct our NASH with decoder layer number of L.

Table 3 :
The performance of various layer selection strategies on different datasets is presented."Gray" indicates a failure to achieve the target sparsity.Additionally, we report the number of remaining SA, CA and FF layers for the automatic selection method.

Table 4 :
Comparison of pruning strategy on encoder network.We conduct our method on T5-base with a uniform selection of 4 decoder layers.

Table 5 :
Comparison of results achieved by CoFi-T5 and NASH on deeper models(Tay et al., 2021)and SAMSum.

Table 6 :
Comparison of results of NASH and Tao et al. (2023) on BART-Base and CNN/DailyMail.Results of Tao et al. (2023) are from the original work.

Table 7 :
Description and label of NLU datasets in GLUE and SuperGLUE benchmarks.

Table 11 :
(Wang et al., 2019)re10which compares the understanding accuracy and latency speedup of NASH against other acceleration methods on GLUE(Wang et al., 2018)and SuperGLUE(Wang et al., 2019)benchmarks.The number of parameters for all models is around 60M, except for T5-Base.For NASH, we apply two layers of decoder sub-network.The best results of sharing the dataset are highlighted in bold.