Towards Robust Pruning: An Adaptive Knowledge-Retention Pruning Strategy for Language Models

The pruning objective has recently extended beyond accuracy and sparsity to robustness in language models. Despite this, existing methods struggle to enhance robustness against adversarial attacks when continually increasing model sparsity and require a retraining process. As humans step into the era of large language models, these issues become increasingly prominent. This paper proposes that the robustness of language models is proportional to the extent of pre-trained knowledge they encompass. Accordingly, we introduce a post-training pruning strategy designed to faithfully replicate the embedding space and feature space of dense language models, aiming to conserve more pre-trained knowledge during the pruning process. In this setup, each layer's reconstruction error not only originates from itself but also includes cumulative error from preceding layers, followed by an adaptive rectification. Compared to other state-of-art baselines, our approach demonstrates a superior balance between accuracy, sparsity, robustness, and pruning cost with BERT on datasets SST2, IMDB, and AGNews, marking a significant stride towards robust pruning in language models.


Introduction
Pruning is a widely recognized compression method employed to decrease the model size and accelerate model inference (Frankle and Carbin, 2018;Chen et al., 2020;Prasanna et al., 2020;Chen et al., 2021).In the age of large language models (Andrew and Gao, 2007;Brown et al., 2020;Chowdhery et al., 2022;OpenAI, 2023;Touvron et al., 2023;Ouyang et al., 2022;Smith et al., 2022), the necessity of pruning has increased because it greatly reduces deployment costs (Frantar and Alistarh, 2023).In addition to the significant computation cost, the robustness of language models has emerged as a crucial factor that demands attention.This is primarily because models need to remain resilient against adversarial attacks, even in challenging real-world circumstances (Tran et al., 2022;Wang et al., 2023).Therefore, exploring robust pruning strategies against adversarial attacks in language models could potentially yield a substantial impact (Xu et al., 2021;Du et al., 2023).
Recent research has extended the pruning of language models beyond accuracy and sparsity, with an emphasis on the trade-off between accuracy, sparsity, robustness and cost (Du et al., 2023;Xu et al., 2021;Liang et al., 2021;Xi et al., 2022).Zheng et al. (2022) propose a joint optimization objective to guide the pruning and adversarial training simultaneously.Their approach views the identified subnetworks as robust tickets, which can be trained as normal and offer enhanced robustness.Despite achieving state-of-the-art results on target datasets, these methods still display vulnerabilities, as evidenced by a significant gap between metrics of clean accuracy1 and accuracy under attack.Moreover, the performance also rapidly declines when sparsity exceeds a moderate level.Expanding on their work, Xi et al. (2022) propose using robust early-bird tickets to reduce the computational cost from adversarial training.However, they face similar challenges regarding the trade-off between robustness and sparsity.In summary, existing robust pruning works often demonstrate limited sparsity, insufficient robustness, and expensive cost, indicating the ongoing challenge of the balance between accuracy and the other three aspects.
To address this challenge, this paper investigates why language models are susceptible to adversarial attacks.(Wang et al., 2021;Garg and Ramakrishnan, 2020;Jin et al., 2020).Previous studies have indicated that language models frequently capitalize on biases and artifacts inherent in datasets as predictive shortcuts, which impedes reasoning ability and skills to develop advanced semantic comprehension.(Du et al., 2021;Niven and Kao, 2019;McCoy et al., 2020;Du et al., 2023).This reliance leads to a more severe loss of pre-trained knowledge during the pruning process.Furthermore, the adversarial samples in Natural Language Processing (NLP) are crafted by replacing components of sentences with semantically similar counterparts, thereby retaining high semantic similarity in the entire sentence (Li et al., 2020a;Ren et al., 2019;Jin et al., 2020).In this way, language models that depend on spurious features from particular words can not defend against adversarial attacks constructed by replacing those words with semantically similar alternatives.To put it more plainly, this primarily stems from the fact that, without pre-trained knowledge, the sparse language model treats the substitute word simply as an integer identifier.Based on the above observation, we explore the following questions in this paper: Question 1.What is the core to defend against adversarial attacks for sparse language models?
This paper proposes that the robustness of sparse language models is directly proportional to the amount of pre-trained knowledge retained after pruning.Intuitively, the robustness of a sparse language model is fundamentally tied to its capability to distill advanced semantic features from input sentences.This capability is largely established during the pre-training phase of dense language models, emphasizing the pivotal role of acquired semantic knowledge.The extensive experiments well support our statement.
Question 2. How can we efficiently prevent the loss of pre-trained knowledge in pruning to preserve or even enhance robustness?
Previous research has demonstrated that pruning exacerbates the model's dependency on spurious features (Xu et al., 2021;Du et al., 2023).We further confirm that traditional pruning methods lead to a considerable loss of pre-trained knowledge and poor robustness.To prevent the above things, we propose a pruning approach that minimizes damage to the embedding space and feature space of dense language models, striving to replicate the features in each layer completely.Specifically, for each layer, we iteratively eliminate a single weight at a time and counterbalance the loss by updating the remaining weights based on the Hessian Matrix.In this setup, the reconstruction error at each layer arises not only from its own layer but also incorporates the accumulated error from preceding layers.This is achieved by adaptively updating the pruning-dependent information in accordance with the sparse output generated by previous layers.Concurrently, there's an ongoing effort to correct these errors collectively.Moreover, our method, being a post-training approach, is cost-effective for current language models, as it circumvents rigorous retraining processes.Extensive experiments show that our approach achieves a better trade-off between accuracy, sparsity, robustness, and pruning cost in SST2, AGNews, and IMDB compared with other state-of-art methods.

Related Work
Textual Adversarial Attacks and Defense.Textual adversarial attacks pose a significant challenge to the robustness of language models.These attacks, formulated by carefully altering certain segments of sentences with semantically similar counterparts, aim to fool language models (Jin et al., 2020;Li et al., 2020a).To enhance the robustness of language models and defend against adversarial attacks, a range of potent defensive strategies, such as adversarial training, has been proposed.(Madry et al., 2017;Zhu et al., 2019;Li and Qiu, 2021).Different from their research, which focuses on dense models, we explore the robustness in the context of language model pruning.

Robust Model Pruning.
Prior studies indicate that sparse models tend to underperform in Compression Identified Examples (CIE), suggesting that the pruning process exacerbates the inherent algorithmic biases hidden within the datasets (Hooker et al., 2020).In Computer Vision (CV), simultaneous optimization of model pruning and adversarial training has been advocated as an effective solution to this issue (Gui et al., 2019;Ye et al., 2019;Sehwag et al., 2020;Vemparala et al., 2021).In NLP, Du et al. (2023) propose to prevent model overfitting on easy samples by leveraging sample difficulty in the context of pruning.Concurrently, Xu et al. (2021) suggest the generation of robust subnetworks through Knowledge Distillation and Posttraining Quantization.Taking a different approach, Liang et al. (2021) strive to enhance model generalizability by extracting the super tickets, while Zheng et al. (2022) and Xi et al. (2022) seek to identify robust tickets.Despite recent advancements, achieving enhanced robustness alongside increased sparsity remains a challenge.This paper significantly promotes a better trade-off among accuracy, robustness, sparsity, and pruning cost.1230 3 Preliminary

Shortcut Learning and Mitigation
Recent studies provide evidence that language models are inclined to capitalize on inherent biases and spurious features present in datasets, using these as convenient predictive shortcuts (Niven and Kao, 2019;Du et al., 2021;McCoy et al., 2020).This tendency impedes the development of more advanced semantic understanding and reasoning capacity necessary for NLU tasks.Various preliminary studies have begun to address this bias issue, such as adversarial training and posterior regularization (Stacey et al., 2020;Chen et al., 2021).From a unique perspective, we let language models against adversarial attacks by mitigating this shortcut issue through weight averaging.This method will be elaborated further in Section 4.2.

Pruning with Hessian Matrix
Drawing inspiration from (LeCun et al., 1989;Hassibi et al., 1993), previous study has provided mathematical formulations for effectively eliminating a single weight from a layer and updating the remaining weights to correct the resulting error according to the information from Hessian Matrix (Frantar and Alistarh, 2022).The equations are presented below: where H is the Hessian Matrix, w p represents the single weight that will be pruned, while w r denotes the remaining weights that will be updated.The notation [H −1 ]pp refers to the p th diagonal entry of the inverse Hessian Matrix, and H −1 :,p represents its p th column.However, the inversion of the Hessian Matrix requires updates at each weight removal, which is exceedingly costly.Frantar and Alistarh (2022) observes that Hessian values across different weight matrix rows are independent, as a single weight removal only impacts its respective row output.Accordingly, they simplify the calculation of Hessian Matrix H and leverage the Gaussian elimination technique to accelerate the update of H −1 , as described mathematically below: Here, −p denotes the removal action of a single weight at index p.A more detailed explanation can be found in the Appendix.

Methodology
This section proposes a pruning method for language models that can better balance accuracy, sparsity, robustness, and pruning cost.Figure 1 depicts the architecture of this method.

Rethink Robust Model Pruning
Given that the predominant challenge in robust pruning primarily centers on robustness and pruning cost, we mainly focus on these two aspects in this paper.To enhance the robustness, we explore the root cause of the poor performance of sparse language models under adversarial attacks.We note that adversarial samples are often crafted by replacing certain words in the sentence with semantically similar substitutes.Thus it is essential to ensure that the representation of the original words and their substitutes remain similar in the embedding space and feature space even after pruning.Based on the above observation, we propose to maintain a highly close alignment between the sparse and dense language models.In other words, robust pruning is supposed to seek sparse parameters Ŵl that minimize the discrepancy between the outputs of dense and sparse layers.The problem can be formally expressed as follows: Here, each layer of language models is represented by a mathematical function f l (W l , X l ), and X l denotes inputs, k designates the total number of weights that remain non-zero after the pruning process.Predominantly, the Mean Squared Error (MSE) is usually employed to measure the pruning error of each layer.Therefore, the preceding problem can be further reformulated using the MSE, as expressed in the subsequent equation: To reduce the pruning cost, we adopt a posttraining setting in our strategy.Specifically, we only utilize a small subset of data to calibrate the weights and generate sparse substitutes to replace them.In summary, our pruning method does not need a rigorous retraining process.

Weight Averaging for Robust Dense Model
We also realize that language models may rely on surface-level or spurious features in the data 1231  A: First, we generate a robust and dense language model in two steps: 1 we fine-tune the pre-trained weight with various hyperparameters and settings, resulting in multiple models with different knowledge; 2 we then employ a greedy algorithm to only average the weights of models that contribute to the final performance.B: Second, 3 we apply our adaptive pruning method to generate robust and sparse language models in a layer-wise setting.Specifically, we optimize the 1 original independent pruning process of each layer to 2 an adaptive way.This requires subsequent layers to update the Hessian Matrix and the optimal dense weight according to the sparse outputs of preceding layers, thereby inheriting and correcting the accumulated error together.
rather than capturing sophisticated semantic features.Thus, when sparse language models fail to defend against adversarial attacks, it becomes challenging to determine whether the failure stems from the pruning methods or inherent issues within the dense model.We circumvents this risk by constructing a robust and dense model before pruning.
Inspired by Croce et al. (2023) and Wortsman et al. (2022), we generate a robust language model via weight averaging.The key idea is to train multiple models with different hyperparameters and settings, allowing each model to capture distinct nuances of the data and generalize in diverse ways.By averaging their weights, we can create a robust model that benefits from collective knowledge.Specifically, we order these models in descending order based on the accuracy under attack.Then, we selectively average the weights that contribute to the final robustness.Finally, we obtain a robust and dense model as the foundation of subsequent operations.This approach ensures that any detected vulnerabilities in sparse language models result from the pruning process, eliminating the possibility of them arising from spurious features.More details can be found in Algorithm 3.

Notation
To accurately replicate the dense model's behavior regarding embedding space and feature space of each layer, we use the method described in Section 3.2 as the backbone.However, its layer-wise setting, which treats each layer as an independent pruning problem, introduces limitations in realizing a globally optimal solution.To elaborate, let's consider a single layer as an example in the following sections.We'll use X l , W l , and Y l to represent the input, weight, and output of the layer, respectively, with the subscript l indicating l th layer.The use of a hat, as seen in Xl , Ŵl , or Ŷl , represents the input, weight, or output within a sparse context.

Adaptive Dense Weight
We also note that the loss generated by removing a single weight depends on the current weight W l from corresponding layer, as derived from Equation 1.However, an inevitable fact is that the original dense weight W l is not optimal for the expected dense output Y l after pruning the preceding layers ( 0th . . .(l − 1) th ).Given that the input X l has been altered to Xl due to the accumulated error, it would be suboptimal to continue using the original weight W l to calculate the pruning loss for the current layer.To be more clear, the result of Xl W l could substantially deviate from the original output Y l .This is incompatible with our goal of producing an output Ŷl identical to the original Y l in the pruning process.Thus, it's essential to update the dense weight so that Xl Wl can approximates the original output Y l more closely.Here, Wl denotes the updated dense weight, and we design the following equations to derive Wl : where T represents the transpose operation, and −1 denotes the inverse operation.To ensure that XT l Xl is invertible, we also introduce a regularization term, such as 1e − 4, to the diagonal entries of the matrix.Finally, we can compute the pruning loss more accurately with the updated weight Wl .
We also calibrate the optimal weights for nonpruned layers (such as the pooler layer and classification layer in BERT) with Equation 5, aligning the dense layers' output with the altered input.Algorithm 1 provides detailed steps for the code implementation, offering a comprehensive overview of our methodology.We also provide a comprehensive analysis of the computational complexity of our method in the Appendix.

Experiments
We first compare our method against several baseline methods, assessing accuracy, robustness, sparsity, and cost.Then, an ablation study is performed to elucidate the contributions of each part in our method.Finally, we augment our core findings with additional experiments and analyses to further illuminate our method.

Baselines and Datasets
Consistent with the previous works (Devlin et al., 2018;Du et al., 2023;Xu et al., 2021;Zheng et al., 2022;Xi et al., 2022), BERT base serves as the foundational model for all our experiments.We compare our approach with various baselines including:RobustT (Zheng et al., 2022), which optimizes the pruning mask and input perturbation simultaneously for robust tickets; Bag-of-Ticks (Xu et al., 2021), which improves sparse model robustness via Knowledge Distillation and Post-Training Quantization; RMC (Du et al., 2023), a technique preventing sparse language models from overfitting on easy samples using sample difficulty; Su-perTicket (Liang et al., 2021), which identifies a super mask during pruning to reduce variance while preserving bias.Our evaluation primarily involves three text classification datasets: Internet Movie Database (IMDB, Maas et al. 2011) The entry highlighted with an orange background denotes our robust and dense model, which serves as the initialization for a range of robust pruning methods except RobustT (RobustT is generated from the pre-trained weight).Obviously, our method consistently outperforms all baselines in terms of the Aua% and Asr% metrics.Regarding Acc%, there is a minor decrease in our method's performance at lower sparsity levels, yet it regains superiority at higher sparsity levels.The highest performance is highlighted in bold.The column Re-T indicates whether the method necessitates model retraining.
Consistent with previous research, we exclude embedding matrices from the calculation of parameter count.

Robustness Evaluation
We assess our model's effectiveness against adversarial attacks using the TextFooler, which substitutes crucial words in sentences with semantically similar synonyms (Jin et al., 2020).Following previous works (Zheng et al., 2022;Xi et al., 2022), our evaluations utilize key metrics like Clean Accuracy Acc% (accuracy on clean test data), Accuracy Under Attack Aua% (accuracy when subjected to adversarial attacks), and Attack Success Rate Asr% (ratio of successful text perturbations to total attempts).A robust method is expected to show higher clean accuracy and accuracy under attack coupled with a lower attack success rate.We also evaluate more attack methods in the Appendix.

Implementation Details
To begin with, we employ the technique mentioned in Section 4.2 to generate a robust language model for each dataset.Subsequently, we use our method to prune these robust language models with a small calibration dataset.All experimental results are the average of five trials, each initiated with different seeds.Furthermore, we assess the performance under three different levels of sparsity: 30%, 50%, and 87.5%.Additional implementation details can be found in Appendix.

Main Result on Robustness Evaluation
Table 1 provides a comprehensive comparison of various robust pruning methods, evaluated across three distinct datasets: SST2, AGNEWS, and IMDB, and under varying degrees of model sparsity.Key observations can be made as follows: 1) Our strategy even enhances the robustness of language models after pruning.We believe this enhancement stems from the regularization effect of sparse architecture.2) Our strategy distinguishes itself by consistently surpassing other methods in the Aua% and Asr%s, regardless of the dataset or the level of sparsity.These results imply that our strategy effectively maintains robustness during the pruning of language models.3) Impressively, our method achieves higher robustness even with fewer parameters compared to several other approaches, which further underscores the effectiveness of our robust pruning method.4) Although the Acc% of 1234  our method is generally lower than other baselines at lower sparsity levels, the improvement of robustness (reflected in Aua% and Asr%) far outweighs the degree of accuracy degradation.5) At higher levels of sparsity, our method outperforms other baselines across all metrics.6) Our method does not require model retraining, confirming that our approach offers a better trade-off between accuracy, robustness, sparsity, and pruning cost.Beyond Bert base , our methodology was also extended to Bert large , a model encompassing 330M parameters.The resulting performance, as presented in Table 3, reaffirms the superiority of our method when compared to the baselines.Moreover, we explore the effectiveness of our methods within a structured pruning context, and once again, our approach outperforms the state-of-the-art method: EarlyRobust (Xi et al., 2022).More details can be found in Appendix.

Ablation Study
To elucidate the contributions of each part of our approach, we conduct an ablation study with the following settings:We replace our pruning technique with methods known as LTH and IMP (Frankle et al., 2020;Frankle and Carbin, 2018) ment them with the additional adversarial training method FreeLB (Zhu et al., 2019).The results are presented in Table 2. From the results, we can make the following key observations: 1) Sparse language models generated by traditional pruning methods performs even worse than the vanilla finetuned dense model.This highlights the challenges associated with robust pruning.2) Our approach consistently generates more robust sparse language models than conventional pruning methods, even supplemented with adversarial training methods.3) We conjecture that the limited effect of adversarial training here stems from the discrete nature of word tokens and the substantial loss of pre-trained knowledge during pruning.

Discussion
In this section, we design additional experiments to illustrate our robust pruning method further.

Pretrained Knowledge Detection
To demonstrate the effectiveness of our robust pruning mechanism in preserving pre-trained knowledge, we've chosen adversarial samples that are effectively defended by our method but not by others.We then visualize the attention scores of them in Figure 2. Our method demonstrates superior performance, as evidenced by more reasonable attention scores that align more closely with those from the robust and dense model.In addition, we visualize the distance of sentence representation from sparse language models and their dense counterparts in the feature space.As depicted in Table 4 and Figure 5, our method results in smaller distances between the dense and sparse representations.These findings indicate the superior ability of our robust pruning method to preserve semantic knowledge and maintain cognizance.In other words, our method outperforms others in maintaining robustness during pruning.

Impact of Calibration Data
The calibration data is crucial for our methodology because it directly affects the computation of the Hessian Matrix.As outlined in Algorithm 1, the Hessian Matrix can be derived from H = X T X.
To further explore the impact of the number of data points, we designed experiments that gradually increased the number of data points used in our strategy.The results of these experiments are detailed in Figure 3.Our observations indicate that as the number of used data points increases, the robustness and accuracy of the sparse language modes increase, but only up to a certain threshold.
We hypothesize that the model can initially retain 1236 more general knowledge as data points increase.However, once a threshold is crossed where the new data cannot provide additional information for general features, adding more data points from a similar distribution no longer contributes to model robustness and accuracy.

Impact of Sparsity
As illustrated in Figure 4, we explore the robustness and accuracy of our sparse language models across a range of sparsity levels.In a departure from previous studies Zheng et al. (2022), our observations indicate that as sparsity increases, robustness decreases with a similar pace like accuracy.This trend suggests that the impact of increasing sparsity on model robustness might be less severe than previously assumed.This disparate pattern may stem from the post-training nature of our method.Furthermore, our observations regarding the trend in robustness align with the findings of previous studies by Zheng et al. (2022) and Liang et al. (2021).
We note that the robustness of our sparse language models initially improves as sparsity escalates up to a certain threshold.After crossing this threshold, the robustness begins to decline.However, it sustains a level of robustness that is higher than the peak value observed in other models and does not collapse even with 10x compression.This finding further highlights the outstanding performance of our method in robust pruning.

Conclusion
In this paper, we investigate the application of robust pruning methods for language models.We propose an adaptive pruning method and place a special emphasis on replicating the embedding and feature space of dense models to preserve as much pre-trained knowledge as possible.The effectiveness of this approach is confirmed through a series

Limitations
This work introduces a post-training method that can robustly prune the language models without model retraining.Despite bypassing the rigorous retraining process, the computational cost of our method remains significant due to the calculation of the Hessian Matrix and its inverse.Consequently, this approach may not be feasible for language models comprised of billions of parameters.As a next step, we aim to refine our technique to devise a more efficient strategy to replicate the feature space and embedding space of language models thermore, these advancements could contribute to reducing the computational resources required for training and using large language models, which aligns with efforts to reduce the environmental impact of machine learning.However, the increased robustness of models against adversarial attacks could also be used maliciously if the technology falls into the wrong hands.Bad actors could potentially exploit robust models for the generation of disinformation or manipulation of public sentiment, for instance.Furthermore, although our technique aims to faithfully replicate the feature space of dense models, bias present in the original training data could be preserved in the pruned models.Consequently, decisions made based on the output of these models could perpetuate these biases.
We encourage the use of our findings and methods for applications that promote the public good and contribute to human welfare.Further, we recommend that researchers and practitioners using this technique take into account potential biases in their training data and consider strategies for minimizing their impact.In the future, we hope to conduct more research on mitigating bias and other ethical issues associated with our pruning strategy.It is our belief that technology should be developed and used in a way that is transparent, fair, and beneficial to all.

A Appendix-A A.1 Pruning with Hessian Matrix
As described in Section 3.2, we prune each layer of language models in a layer-wise setting.It involves an iterative step that removes a single weight for each step and updates the remaining weights until the desired sparsity level is attained.While this approach yields a locally optimal solution, it involves a computationally expensive step: calculating the Hessian matrix at each iteration.It is important to note that storing the information for a Hessian Matrix, denoted as H, requires d × d memory, and updating it has a computational complexity of O(d 4 ), where d = d row • d col .

A.2 Accelerated Pruning with Hessian Matrix
Previous research highlights that the Hessian values across different rows of the weight matrix are independent.This is because the removal of a single weight in each row of the matrix only affects its corresponding row value.Consequently, we can simplify the objective function with drow i=1 ∥W i,: X − Ŵi,: X∥ 2 2 , and a separate Hessian Matrix of appropriate size (d col × d col ) for each row is sufficient to locate the optimal weight for removal.Additionally, since the output Y = W X of the dense layer remains fixed, and the objective function for each row takes the standard form of least squares, its Hessian Matrix can be calculated by H = 2XX T (Frantar and Alistarh, 2022).
As the Hessian Matrix H is no longer dependent on the weight, we only need to compute H once.After each pruning step, the Hessian Matrix H M (M means the operation of removing or masking one single weight) can be obtained by masking the value at the corresponding location.However, when it comes to H −1 , the aforementioned trick cannot be applied as (H −1 ) M ̸ = (H M ) −1 , making the computation still expensive.Frantar and Alistarh (2022) uses the Gaussian elimination technique for a more efficient update of H −1 .A mathematical exposition of this technique is provided below: where −p meas remove single weight at index p.
For more comprehensive details, please refer to the work of Frantar and Alistarh (2022).

B Appendix-B B.1 Efficiency Analysis of Hessian Matrix
We recognize the importance of addressing the efficiency concern related to Hessian Matrix calculation.However, grasping the intricate balance between computational complexities and their broader implications is crucial.To provide clarity, we offer an in-depth analysis of computational complexities from both micro and macro viewpoints, contrasting it with approaches that necessitate model retraining.

B.2 Micro Perspective
When considering models like Bert base and Bert large , the computational requirements for the Hessian Matrix of one layer do not exceed that of model retraining in most cases.To clarify it, we analyze the complexity of our method step by step based on the Algorithm 2.
Algorithm 2 Prune a linear layer l of BERT with target sparsity s and calibration data X Pruning with Hessian Matrix: end for Adaptive Update 2:

17:
Update X of next layer with postprocess(Y ) 18: end procedure Notations: To facilitate the understanding, we first introduce the notations essential for the complexity analysis.The sparsity ratio, a value lying between 0 and 1, is denoted by s.The input dimension of the linear layer is represented by d in , and the output  Frantar and Alistarh (2022).
Key observations: A pivotal observation is that this complexity remains uninfluenced by the batch size n because calibration data keeps n restricted to a constant fall in [128,1024].The cubic relationship with d in is the primary driver behind the complexity, and for larger d in , this can escalate substantially.

B.3 Comparison with Re-Training Method
In contrast, when training a single layer using SGD, the complexity is approximately O(n×seq×d in × d out ).This complexity scales linearly with the batch size n, which can increase markedly with large datasets and the number of training epochs.Although the complexity of the pruning operation remains consistent regardless of n, the training complexity escalates, posing computational challenges for extensive datasets, prolonged sequences, and increased training epochs.We also dive deeper into the comparative insights.
Batch Size: Our pruning method capitalizes on calibration data, thus constricting n to moderate values, notably between 128 to 1024.This sharply diverges from the conventional training paradigm where n can inflate significantly due to extensive datasets and number of training epochs, thereby magnifying its computational requisites.
Dimensionality Dependency: The intrinsic complexity of our pruning algorithm reveals a cubic dependency on d in .This can render it computationally onerous, especially for layers endowed with an extensive d in .Conversely, traditional training exhibits a linear correlation with both d in and d out .
In summary, the computational demands of our pruning method, particularly for layers with a large d in , are unquestionably stringent.However, it's important to recognize the significant computational burden introduced by traditional training, mainly because of its responsiveness to large dataset sizes.Understanding this balance and trade-off is crucial when comparing the effectiveness and suitability of our pruning approach to traditional retraining.

B.4 Macro Perspective
Predicable Processing Time and Promised Output: Notably, from a broader view, while our approach introduces a dependency for each layer and potentially increases processing times, the number of layers in common language models is limited.This suggests that we can accurately predict the time needed to complete the pruning process, and expect positive results in return.
Layer-by-Layer Computation for Resource Efficiency: While the sum of Hessian Matrix computations of the entire language model is time-intensive, our approach uniquely addresses this by employing a layer-by-layer resolution strategy.This methodology means there's no necessity to simultaneously load the entire model into the memory of computational resources.Consequently, from a memory allocation standpoint, our pruning with the Hessian Matrix can be viewed as a resource-saving measure.
Efficient Post-training Pruning: A post-training pruning strategy is at the heart of our methodology.Unlike many other approaches that might require recurrent training sessions or exhaustive reiterations, ours stands out in its ability to save significant resources by strategically avoiding these processes.
Computational Commitment: While it's acknowledged that pruning with the Hessian Matrix does possess computational time costs, it's paramount to understand our larger vision.The ultimate objective isn't merely to save time but to sculpt a model characterized by three pillars: sparsity, robustness, and high performance.Such a model offers considerable advantages in real-world scenarios.Thus, the computational expenses encountered in the training phase can be viewed less as costs and more as strategic investments.

C Appendix-C C.1 More Adversarial Attacks
To demonstrate the superiority of our method, we have incorporated further experiments targeting two more recognized adversarial attacks: BERT-Attack and TextBugger (Li et al., 2020b(Li et al., , 2018)).BERT-Attack, powered by BERT, guarantees fluency and retains semantics in its adversarial outputs.Conversely, TextBugger integrates both character and word-level perturbations to yield adversarial instances, thereby introducing a new set of challenges for our defense mechanism.We use stateof-the-art methods (RobustT and EarlyRobust) as baselines and describe the results in Table 6 (Zheng et al., 2022;Xi et al., 2022).Our approach consistently demonstrated superiority in the robustness of sparse language models across various sparsity levels and datasets.

C.2 More Pruning Baseline
As recommended by the reviewer, we have included Movement Pruning (Sanh et al., 2020) as an additional baseline in our experiments.Our original selection of baselines was grounded on their capacity to simultaneously address accuracy, sparsity, robustness, and pruning cost.It should be noted that Movement Pruning predominantly emphasizes accuracy and sparsity.Nevertheless, to offer a complete perspective, we have included Movement Pruning in our experimental evaluation.The comparative results are presented in Table 7.It is evident that, while our method may trail slightly in terms of clean accuracy, it significantly outperforms Movement Pruning under adversarial conditions, highlighting the robustness of our approach.

D Appendix-D D.1 More Implementation Details
We utilize various hyperparameters and settings to fine-tune multiple downstream models for each dataset.The hyperparameters and settings employed are presented in Table 8.Subsequently, we apply the technique of weight average in a greedy manner to derive robust and dense models.The detailed procedure is outlined step-by-step in Algorithm 3. We adopt Textattack (Morris et al., 2020) to implement the method of adversarial attacks.Moreover, Aua% and Suc% are evaluated on all 872 test examples for SST-2, 500 randomly selected test samples for IMDB and AG NEWS.
The number of calibration data in our main experiments ranges from 256 to 1024 sentences.During pruning, we conduct our experiments on a server with a single NVIDIA 3090 GPU.Due to the layer-wise setting, we do not need to occupy substantial GPU memory, and our adaptive rule enables us to obtain an end-to-end rectification effect similar to SGD optimization.

D.2 Impact of Structured Pruning
Drawing inspiration from the work by Xi et al. (2022), we also investigate the impact of structured pruning in our strategy.In particular, we evaluate our method's performance under N:M structured patterns and summarize the results in Table 4.We made several key observations from these experiments: 1) our method consistently produces better robust pruning results than other robust pruning methods in the context of structured pruning.2) As proven by Xi et al. (2022), structured pruning enhances the robustness of subnetworks in comparison to unstructured pruning.Our experiments confirm the positive impact of structured patterns in pruning, solidifying the effectiveness of our robust pruning method.

E Appendix-E E.1 Model Pruning
Pruning aims to eliminate redundant elements in neural networks, traditionally targeting elements of the smallest magnitude, which includes weights, output sensitivity, gradients, and Hessian matrices of training loss, among others.Pruning pre-trained language models like BERT has been an active field of research.Prasanna et al. (2020) demonstrated that unstructured pruning yields more sparse and accurate models.Pruning at the pre-training stage has been favored by researchers like Gordon et al. (2020) and Chen et al. (2021), due to its efficiency and effective knowledge transfer to downstream tasks.Sanh et al. (2020) adds penalty terms to the loss function to eliminate redundant weights.Frantar and Alistarh (2022) introduce an effective post-training pruning method, which is the first approach that prunes a language model in a one-shot manner without significant degradation in accuracy.However, these studies neglect robustness, focusing mainly on the accuracy-sparsity trade-off.Recent work has begun to note the issue of robustness for sparse language models, but the challenge of enhancing robustness with increased sparsity per- sists (Zheng et al., 2022;Du et al., 2023;Xu et al., 2021;Liang et al., 2021;Xi et al., 2022), and the underlying causes of low robustness in language models remain elusive.

E.2 Post-Training Pruning
Pruning methods can be categorized into Post-Training Pruning and In-Training Pruning according to if the pruning methods need extra model retraining.In the former, we are given a trained but uncompressed model, together with a small amount of calibration data.we must produce an accurate compressed model in one shot, i.e., a single compression step, without retraining and with limited computational costs.This is motivated by practical scenarios such as the large language models, which are hard to train or even finetune because of the complicated training process.In this paper, our method is a Post-Training pruning method.

E.3 Layer-wise Pruning
Layerwise Pruning is an important approach to optimizing language models, offering a distinct methodology compared to end-to-end pruning.Unlike end-to-end pruning, which simultaneously evaluates and prunes the entire model as a whole, layerwise pruning tackles each layer of the neural network individually.This means pruning decisions are based on a layer-specific analysis, often using a metric like the magnitude of the weights to determine which parameters within that layer are least significant and can be removed without substantially impacting the layer's output.By selectively reducing the number of parameters in each layer, layerwise pruning can effectively decrease the computational requirements and memory footprint of language models while maintaining their accuracy.The layerwise approach offers an advantage in that it provides a more granular level of control over the pruning process, which can be beneficial in preserving model performance while achieving efficiency gains.

Figure 1 :
Figure 1: Architecture of Main Strategy.A: First, we generate a robust and dense language model in two steps: 1 we fine-tune the pre-trained weight with various hyperparameters and settings, resulting in multiple models with different knowledge; 2 we then employ a greedy algorithm to only average the weights of models that contribute to the final performance.B: Second, 3 we apply our adaptive pruning method to generate robust and sparse language models in a layer-wise setting.Specifically, we optimize the 1 original independent pruning process of each layer to 2 an adaptive way.This requires subsequent layers to update the Hessian Matrix and the optimal dense weight according to the sparse outputs of preceding layers, thereby inheriting and correcting the accumulated error together.

Figure 2 :
Figure2: Attention Score Visualisation in BERT base .We have selected an adversarial sample ("it's a bewitching and often repercussions journey.")from SST2 and visualized the attention scores in the robust and dense model (2b, 2e), the sparse language model generated with IMP+FreeLB (2a, 2d), and the sparse language model created using our method (2c, 2f).Here, Figures2a, 2b, and 2c depict the attention scores from the first transformer block of BERT Base , while Figures2d, 2e,and 2f show scores from the last transformer block.Evidently, the attention scores produced by our method align more closely with those from the robust and dense model.

Figure 3 :
Figure 3: Impact of # of Calibration Data from SST2.

Figure 4 :
Figure 4: Impact of Sparsity Levels on SST2

Table 2 :
Ablation Study with Pruning Methods Replacement.We replace our pruning method with most famous others (IMP and LTH) supplemented with adversarial training (FreeLB).Similarly, the orange entry is used for model initialization.Once again, our method outperforms others in preserving or even enhancing robustness.

Table 3 :
Summary of Adversarial Robustness Assessment on BERT large .Similarly, the entry highlighted with an orange background is used for model initialization.Once again, our method consistently outperforms all baselines in terms of the Aua% and Suc% metrics.

Table 4 :
Quantitative Analysis of Distance from Sentence Embeddings.We compare the distances between sentence embeddings derived from various layers of dense and sparse language models.Our findings reveal that our method aligns better with the dense model, regardless of whether we use the original or adversarial sentence.Refer to Figure5for a visualization of these sentence embeddings.

Table 5 :
Summary of Adversarial Robustness Assessment on BERT base in Structured Pruning."Stru32:64"refers to a pruning strategy where, for every 64 continuous weights (a bank) in a weight matrix, 32 of them are retained.dimension,aligningwith the weight matrix's other dimension, is symbolized by d out .We use d = d in × d out to illustrate the comprehensive size of the weight matrix.The batch size and the sequence length are, respectively, given by n and seq.In this phase, the matrix multiplication XX T plays a pivotal role.Given the dimensions of X as n × seq, d in and that of X T as d in , seq × n, the resulting matrix has a shape ofd in × d in .This multiplication alone possesses a complexity of O(n × seq × d 2 in ).Additionally, matrix inversion is another vital step with a complexity of O(d 3 in ).The computation of H −1 i X T i Y i further contributes to the complexity, having a magnitude of O(n × seq × d in × d out ). with the Hessian Matrix: In this context, the outer loop spans d out iterations.Within each row of W , an inner loop determined by k = int(d in × s) is executed.This loop comprises various operations with O(d 2 in ).Summing up, the inner loop complexity is O(k ×d 2 in ).Consequently, the combined complexity for the pruning phase is O(d in × s × d 2 in × d out ), simplifying to O(d 3 in × s × d out ). Adaptive Update (2): The matrix multiplication Y = W X dominates with a complexity of O(n × seq × d in × d out ). Summing complexities for a single layer yields O(2n × seq × d in × d out + n ×

Table 6 :
Evaluation of various methods and datasets against different adversarial attacks.

Table 7 :
Comparison between our method and Movement Pruning under various attacks and sparsity levels.

Table 8 :
A Range of Hyperparameters and Settings for Weight Averaging