The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models

Compressing large language models (LLMs), often consisting of billions of parameters, provides faster inference, smaller memory footprints, and enables local deployment. Two standard compression techniques are pruning and quantization, with the former eliminating redundant connections in model layers and the latter representing model parameters with fewer bits. The key tradeoff is between the degree of compression and the impact on the quality of the compressed model. Existing research on LLM compression primarily focuses on performance in terms of general metrics like perplexity or downstream task accuracy. More fine-grained metrics, such as those measuring parametric knowledge, remain significantly underexplored. To help bridge this gap, we present a comprehensive analysis across multiple model families (ENCODER, ENCODER-DECODER, and DECODER) using the LAMA and LM-HARNESS benchmarks in order to systematically quantify the effect of commonly employed compression techniques on model performance. A particular focus is on tradeoffs involving parametric knowledge, with the goal of providing practitioners with practical insights to help make informed decisions on compression. We release our codebase1 to enable further research.


Introduction
Large language models (LLMs) have demonstrated exceptional performance across diverse tasks.However, their deployment in real-world applications is hindered by their substantial size and the associated costs, even for inference (Schwartz et al., 2020;Strubell et al., 2019).For instance, the LLama-65B model (Touvron et al., 2023), a pioneering open-sourced LLM, uses approximately 130GB of RAM for 16-bit inference.To address this challenge, recent research has focused on developing novel compression techniques that enable efficient local deployment and inference.Notable examples of such techniques include SparseGPT (Frantar and Alistarh, 2023) and LLM.int8() (Dettmers et al., 2022).
The tradeoff between model compression and quality is typically studied either through general metrics like perplexity (See et al., 2016;Michel et al., 2019) or standardized benchmark task accuracy (Liang et al., 2021;Du et al., 2021) on, e.g., GLUE (Wang et al., 2018).Furthermore, much of the literature studies such tradeoffs for one model or a particular class of models.Unfortunately, as a result, practitioners do not have access to reliable insights or rules-of-thumb to ensure they can make an informed decision for compression in their own models.This is because • Metrics like perplexity are too general, while benchmark prediction metrics are too easy to fool.For example, recent findings suggest that distilled versions of foundational LLMs, known as imitation models, may exhibit stylistic similarities but potentially lack knowledge when compared to the models they seek to imitate (Gudibande et al., 2023).• Most recent research on compression techniques has primarily focused on DECODER models.The applicability and effectiveness of such techniques for large ENCODER and ENCODER-DECODER models (Chung et al., 2022) has yet to be extensively studied.
These difficulties suggest that there is a need for a more fine-grained understanding of the effects of compression schemes, comparing a variety of model families, compression techniques, and specialized measurements.
We address these challenges, specifically focusing on the preservation of parametric knowledge, i.e., knowledge acquired during pretraining, that is stored in model weights.This is particularly crucial for tasks involving reasoning and for specialized applications.Concretely, we examine the impact of different compression schemes on parametric knowledge across multiple model families (ENCODER, ENCODER-DECODER and DE-CODER) where we apply pruning and quantization approaches and analyze the performance of such techniques on downstream reasoning tasks.To the best of our knowledge, this work represents one of the first large-scale investigations in this direction.Among the crucial observations resulting from this study include: • Pruning all modules together has the most significant impact on parametric knowledge, compared to pruning specific modules, • At pruning levels of >50%, the parametric knowledge of all the models declines rapidly, • Quantizing attention modules has less impact on performance compared to quantizing feedforward networks for all the models, • Across all models, structured pruning at the final layer has detrimental effects compared to unstructured pruning.

Background
In this section, we briefly discuss the various compression techniques we use in our study.

Pruning
Pruning involves reducing the model size by eliminating unnecessary or redundant connections between neurons or entire neurons altogether.Broadly speaking, pruning approaches can be classified into two types (Fig. 1): Unstructured Pruning: Each connection is treated as an individual entity, and sparsity is attained by eliminating connections with lower saliency.Although this approach enables the removal of less important connections without compromising performance, it leads to sparse matrix operations, which may not be optimal for certain hardware accelerators2 (Buluc and Gilbert, 2008;Gale et al., 2019).
Structured Pruning: This involves removing a group of connections, such as channels or entire neurons, instead of individual connections.Unlike unstructured pruning, this approach avoids introducing sparse matrix operations.However, aggres- sive structured pruning may disproportionately impact the model's performance (Yao et al., 2019).
Choosing Saliency of Weights: When choosing the criterion to determine saliency, various factors can be taken into account, such as weight magnitude, importance to the overall network functionality, or contribution to specific tasks.Typically, the saliency of weights is determined based on their magnitudes when selecting which ones to remove during pruning.A sparsity of k% means that the least salient k% connections are removed.
The most commonly used pruning types are: 1. L1-Unstructured: Connections between neurons are eliminated individually, and their saliency is determined by their L 1 -norm, i.e., the smallest weights are removed.2. Lp-Structured: Connections are eliminated in a structured way, i.e., an entire layer/channel is removed, and saliency is determined by their L p -norm where p is a hyperparameter.

Quantization
Model parameters can be categorized into weights and activations, which are typically represented using 32 bits.Quantization aims to reduce the number of bits used for representing these parameters.
A popular choice for this mapping is3 : where Q is the quantization operator, r is a realvalued input (weight or activation), S is a realvalued scaling factor, and Z is an integer zeropoint.An important factor in mapping r to an integer is the scaling factor S. This is usually given by Here [α, β] denotes the clipping range and b is the quantization bandwidth.The process of determining the clipping range is known as calibration.Extensive research has been conducted to determine the optimal range to reduce the bit representation while balancing accuracy, computational efficiency, and inference speed (Gholami et al., 2021).In most cases, statistics for weights are precomputed as they remain constant during inference.Often, it may be necessary to fine-tune the quantized model parameters to enhance performance on task-specific datasets.Taking these factors into account, various methods have been proposed (Nagel et al., 2021): Post Training Static Quantization (PTSQ): The clipping range for activations is pre-calculated using a representative dataset, which is a small subset derived from the task-specific dataset.Using this clipping range, the activations are quantized in advance and thus remain static during inference.
Post Training Dynamic Quantization (PTDQ): The clipping range is dynamically calculated for each activation during inference.Although this introduces additional computational overhead during inference, it yields improved performance compared to Post Training Static Quantization (PTSQ) as the signal range is exactly calculated for each input.
Quantization Aware Training (QAT): The model undergoes a process known as fakequantization, i.e., during training all the calculations involving forward and backward passes are performed in full-precision.Subsequently, after updating the weight parameters through gradient descent, the weights are quantized to a lower bit.While this approach achieves the highest performance, it requires finetuning the model.
We note that while a huge diversity of often sophisticated and specialized compression methods have been proposed, we focus on a subset of standard approaches.This enables us to seek more general insights on compression tradeoffs.

Experimental Setup
In this section, we present a comprehensive overview of our experimental setup, including the rationale behind our design choices, along with the selection of models and datasets.

Settings Under Consideration
The general transformer block consists of an attention module followed by a feed-forward network.As a result, we consider three choices for compression: compress the attention module alone §3.2, compress the feed-forward network alone §3.3, or compress both together §3.4. Figure 2 contains a visual representation of these modules.
Our chosen compression techniques include pruning, quantization, and a combination of pruning and quantization.Following the methodology proposed in Han et al. (2015), we adhere to the sequential order of pruning the selected group of modules first and then applying quantization.In addition, we also investigate the impact on distilled models and explore the effects of employing various combined compression techniques.

Attention-only Global Compression
We include all the linear layers within all the attention modules of the model.For encoderdecoder models, we also consider the crossattention blocks.
Attention-only Global Pruning, (Att GP ): We apply pruning to all the linear layers within the attention modules.
Attention-only Global Quantization, (Att GQ ): We quantize all the linear layers within the attention modules.
Attention-only Global Pruning + Quantization, (Att GP Q ): We prune the linear layers in the attention modules and subsequently quantize them.

Feed-forward-only Global Compression
We include all the linear layers within all the feedforward networks of the model.
Feed-forward-only Global Pruning, (F F GP ): We employ pruning to all the linear layers within the feed-forward networks.
Feed-forward-only Global Quantization, (F F GQ ): We quantize all the linear layers within the feed-forward networks.
Feed-forward-only Global Pruning + Quantization, (F F GP Q ): We prune all the linear layers from feed-forward networks and subsequently quantize them.

Overall Global Compression
We specifically target the linear layers within the attention and feed-forward network.Under this compression, the different setups are: Overall Global Pruning, (Overall GP ) : We employ pruning to all the linear layers (except the final dense layer).
Overall Global Quantization,(Overall GQ ) + : We apply quantization to all the linear layers (including the final dense layer).
Overall Global Pruning + Quantization (Overall GP Q ) + : We first apply pruning to all the linear layers (except the final dense layer), and subsequently, we quantize all the linear layers.

Final Dense Layer Pruning, (F L P )
Recent studies (Mitchell et al., 2021(Mitchell et al., , 2022;;Meng et al., 2022) provide evidence suggesting that the final layers of a language model play a significant role in storing information.Given its importance, we focus on understanding how knowledge is encoded in the final layer.Therefore, we treat the final layer as an individual module in our experimental setup and prune it.We consider L1-structured and L1-unstructured pruning as outlined in §2.1.
We note that the number of parameters compressed differs for different settings.We record all of the values required for normalizing measurements.However, our focus is predominantly aimed at understanding the effects of compressing modules and their combinations rather than presenting normalized results, and our insights reflect this framing.We provide full parameter counts that permit normalized quantities that can be used by practitioners who seek to directly apply our work and refer the readers to Sec.A.2 for more details.during inference, which is dynamic in nature, the order of inputs within a batch has a minor impact on the final accuracy (< 1%).Therefore, we seed the experiments to ensure consistent and reproducible results ( §A.1).• Previous studies (Gordon et al., 2020;Michel et al., 2019) suggest that pruning levels of 30%-40% do not affect the model on downstream tasks.Such rules-of-thumb may or may not hold for parametric knowledge.In our experimental settings (GP Q, F L P ), we select 20% and 40% as the levels to understand when a similar result holds.

Model Zoo
We consider the following models for our study.
Where available, we choose both the base and large versions of the model to understand if larger models exhibit different behavior.

Datasets
We use the following datasets for our empirical analysis: LAMA: To examine the effects of compression on encoder-only models, we use the LAMA (LAnguage Model Analysis) benchmark (Petroni et al., 2019).LAMA assesses the factual and commonsense knowledge of language models.Each example in LAMA is formulated as a cloze-style question, where either the subject or object is masked.By predicting the masked word, we can evaluate the model's ability to recover real-world facts.Specifically, we probe the encoder-only models with LAMA to investigate the impact of compression on various knowledge tasks.This benchmark consists of four datasets, namely TRex, Google-RE, ConceptNet, and SQUAD, each designed to assess specific types of relational knowledge.These datasets provide valuable insights into the model's performance and its understanding of different types of information.
Language model evaluation harness: To examine the effects of compression on encoderdecoder and decoder-only models, we use a subset of evaluation harness tasks (Gao et al., 2021): the BoolQ dataset (Clark et al., 2019), the PIQA dataset (Bisk et al., 2020), and the Winogrande dataset (Sakaguchi et al., 2021).These datasets provide a range of challenging prompts for each model type.We refer the reader to Table 2 for examples of samples from each dataset.

Experimental Results and Insights
To facilitate our discussion, we categorize pruning levels as follows: • p low : Sparsity levels of 10-30% • p medium : Sparsity levels of 30-50% • p high : Sparsity levels of >50% For encoder-only models, we report the % drop in top-1 accuracy, averaged across all the probes in LAMA.For the decoder-only and encoderdecoder models, we report the % drop in accuracy, averaged across BoolQ, PIQA and Winogrande.In the decoder-only and encoder-decoder plots, the majority-baseline indicates the accuracy when all the predictions are assigned to the majority class.

Global Pruning
We observe that for encoder-only models (Fig. 3, 19), there is a minimal decline in performance at p low .At p medium , the drop in performance is more significant for pruning feed-forward networks (F F GP ) as compared to attention modules (Att GP ).

Finding:
At p medium , for encoderonly models, pruning attention modules (Att GP ) has a smaller impact compared to pruning feed-forward networks (F F GP ).
We observe that for encoder-decoder (Fig. 5, 13) and decoder-only models (Fig. 4), there is a minimal decline in performance at p low .However, at p medium , the drop in performance is more significant for pruning attention modules (Att GP ) compared to feed-forward networks (F F GP ).

Finding:
At p medium , for encoderdecoder and decoder-only models, pruning the attention module (Att GP ) has more impact on performance compared to pruning feed-forward networks (F F GP ).We note that the number of parameters in the feed-forward networks is significantly higher than the number of parameters in the attention modules for all these models (Table 3).This observation provides a likely explanation for the pattern observed in encoder-only models, where pruning more parameters results in a higher loss of parametric knowledge.However, this finding is counterintuitive for encoder-decoder and decoder-only models, as we would expect that pruning the larger feed-forward networks would have a more significant impact on the parametric knowledge.We suspect that the feed-forward networks are overparameterized and thus they can be pruned without a significant drop in performance.
Finding: For all the models, pruning all the modules together (Overall GP ) has the most significant impact on performance.Among all the models analyzed, pruning all modules together (Overall GP ) has the most significant negative impact on performance.This finding suggests that when compressing models, pruning all modules simultaneously leads to a greater loss of parametric knowledge compared to pruning specific modules or components individually.Therefore, it is crucial to carefully consider the implications of employing global pruning techniques.We additionally note that at p high , the performance goes to zero as expected. Additional

Global Quantization
We observe that across all the models (Fig. 6, 15, 16), the performance drop is less significant when quantizing attention modules (Att GQ ) compared to quantizing feed-forward networks alone (F F GQ ).This contrasts with the results from global pruning ( §4.1), where pruning attentiononly modules had a more detrimental effect on encoder-decoder and decoder-only models.
Finding: For all the models, quantizing attention modules (Att GQ ) has lesser impact compared to quantizing feed-forward networks (F F GQ ).
We hypothesize that in the case of quantization where all connections are preserved, the parametric knowledge in cross-attention modules may remain relatively intact.However, in pruning, as connections are eliminated, there may have a greater impact on the parametric knowledge in cross-attention modules, thereby affecting the overall capabilities of the model.It is also interesting to observe that the performance drop during quantization is almost similar to that of p medium .
Finding: For all the models, quantizing all the modules together (Overall GQ ) hurts the most.
It is intuitive that quantizing all the modules together (Overall GQ ) has the most significant negative impact.Additional results are shown in Tables.4, 5, 6.

Global Pruning + Quantization
For all the models (Fig. 7, 17, 18), at 20% sparsity, compressing attention modules (Att GP Q ) results in a smaller performance drop compared to compressing feed-forward networks (F F GP Q ).At 40% sparsity, the same trend is observed for encoder-only and decoder-only models.However, we notice the reverse for ENCODER-DECODER models i.e., that compressing feed-forward networks affects performance less than compressing the attention modules at 40% sparsity.
Finding: For all the models, at 20% sparsity level, Att GP Q hurts less compared to F F GP Q .
We hypothesize that the sequential effects of  pruning and quantization on the cross-attention modules could be responsible for this change in the order of impact.To test this hypothesis, we selectively prune and quantize the self-attention and cross-attention modules separately and found out that it is indeed the case (Fig. 8) and aligns with the claim made in Michel et al. (2019).Additional results for compressing attention-only modules are shown in Fig 12,24.For fine-grained analysis on individual datasets, we refer the interested reader to Table 4, 5, 6.

Final Dense layer Pruning
For encoder-only models (Fig. 9), L1unstructured pruning has a smaller impact compared to L1-structured pruning.We hypothesize that the final layer of the encoder-only models might encode knowledge in a structured or modular manner, and any form of structured compression would disrupt this encoding, resulting in a larger performance drop.Such a result would be consistent with existing approaches that enable editing knowledge in language models and rely on structure (Mitchell et al., 2021).
Finding: For encoder-only models, L1unstructured leads to a smaller decrease in performance than L1-structured.For decoder-only (Fig. 10) and encoder-decoder (Fig. 14) models, even at a sparsity level of 20%, the predicted accuracy is very close to the majority baseline.This finding aligns with the claims made in Mitchell et al. (2022) that final layers encode significant amount of information.The drastic performance drop observed suggests that the final layers play a crucial role in encoding knowledge.Additional results for pruning the final layer are shown in Fig. 26, 27, 28.

Related Work
Early works seeking to understand large language model behavior focused on contextual representations and how such models gain linguistic capabilities (Goldberg, 2019;Ettinger et al., 2018;Jawahar et al., 2019).More recently, some lines of work have steered towards understanding how these models acquire factual and commonsense knowledge.Techniques such as probing evolved as a way to understand the knowledge capabilities of these models (Petroni et al., 2019;Kassner and Schütze, 2020;Talmor et al., 2020;Weir et al., 2020;Wallat et al., 2021).
Previous works including Gordon et al. ( 2020); Michel et al. (2019) pruned BERT and showed that it is resilient to a medium level of pruning.For example, Michel et al. (2019) showed that after finetuning for a particular downstream task, it is possible to prune about 40% of the attention weights without any loss in performance.A particular fo-cus has been to understand the importance of the attention mechanism (Voita et al., 2019;Michel et al., 2019) 2023) pushed the limits of quantization on language models.Most of these works have focused on one model class or a particular metric.
In another line of work, a variety of approaches (Li and Liang, 2021;Hu et al., 2021;Liu et al., 2021;Lester et al., 2021) focus on alternatives to traditional finetuning of the model due to its scale.In contrast to these works, our paper primarily focuses on the in-built parametric knowledge present in the model.This means we do not finetune and instead seek to understand whether some of the previously described phenomenona are applicable to other models as well.
Also connected to this work are techniques that edit factual knowledge in models.The goal for such works is to avoid retraining or even finetuning models, instead seeking to directly change parameters connected to certain facts (Mitchell et al., 2021(Mitchell et al., , 2022;;Meng et al., 2022).However, given our focus on compression, the main theme of our work differs.Nevertheless, it would be interesting to understand the impact of relying on compressed models when using such editing techniques.

Conclusion
Compression is crucial in deploying and using large language models.Despite its importance, existing empirical studies predominantly rely on generic measurements such as perplexity or standardized benchmark metrics when investigating the effects of compression.These coarse measurements are challenging to interpret.As a result, it is difficult to use them to develop meaningful heuristics for practitioners.
To address these limitations, we provided a large-scale study that focused on fine-grained effects on quantities like parametric knowledge.We studied a variety of compression choices across multiple model families, providing usable insights into what types of compression schemes have the least and most significant impact on models.We hope this work serves as a useful step towards developing users' intuition for rules-of-thumb when selecting appropriate compression techniques in large language models.For future work, we hope to add additional, more specialized techniques for large language model compression.

Limitations
Our research has tackled a diverse combination of models, compression schemes, and compression targets within the vast large language model research area.We note that sophisticated and specialized compression techniques tailored to specific objectives for a particular class of models may exhibit distinct behavior compared to the findings presented in this study.Hence, our work does not aim to present an exhaustive set of findings that universally characterize the impact on parametric knowledge across all conceivable models and compression approaches.We believe that our study serves as a valuable starting point, offering a nuanced examination of prevalent methodologies.
We note, additionally, that we do not directly address the tradeoff between wall-clock inference time versus compression.While this is also an important tradeoff, the impact of compression on inference time contains many intricacies that are best treated with a separate large-scale study.

A Appendix
The appendix contains all of the results we could not include in the body of the paper.We first discuss the statistical approach of the experiments and the performance drop against compression ratio.Then, we show individual plots for a set of experiments that track decrease in accuracy for several types of compression and models.Next, we provide a table that contains information on the datasets used in our experiments.Afterwards, we provide tables with model details, including parameter counts, and explicit results for compression results across model families.Afterwards, we present a large-scale comparison across datasets for encoder-decoder models under various attention module compression approaches.We provide LAMA probe results and finally, change-inaccuracy plots for a variety of datasets for different model classes.

A.1 Experimental Approach
Our experiments fall into two categories: deterministic and stochastic.Our experiments on pruning are deterministic as we used L1-unstructured pruning.On the other hand, our quantization experiments have an element of randomness.This is due to our use of PTDQ, which computes a dynamic clipping range.We deliberately struck a balance between the number of trials per setting and the overall number of settings studied.Consequently, we ran experiments with multiple seeds and recorded confidence intervals, as demonstrated in Table 1.

A.2 Performance drop against compression ratio
Normalizing the x-axis to account for the parameter ratio results in the same plots, but with a skewed x-axis (Fig. 11).Given a specific performance drop percentage, it is highly likely that we can achieve greater parameter compression by targeting feedforward modules rather than attention modules.It is worth noting that across all the models studied, feedforward modules have more parameters than attention module 1 .9 3 .95 .87 .7 9 .6 1 1 .6 1 3 .5 1 5 .4 1 7 .35 .9 1 1 .

Figure 2 :
Figure 2: Block diagram of a simplified Transformer describing modules compressed in our experiments.

Figure 3 :
Figure 3: Averaged drop in Top-1 accuracy for encoder-only models for global pruning.

Figure 4 :
Figure 4: Averaged drop in accuracy for decoder-only models for global pruning.

Figure 5 :
Figure 5: Averaged drop in accuracy for encoder-decoder models for global pruning.
results for global pruning on individual datasets for encoder-only models are shown in Fig 20, 21, 22; for decoder-only models at Fig 23; for encoder-decoder models at Fig 13, 25.

Figure 6 :Figure 7 :
Figure 6: Averaged drop in Top-1 accuracy for encoderonly models for global quantization.

Figure 9 :Figure 10 :
Figure 9: Averaged drop in Top-1 accuracy for encoderonly models for final layer pruning.

Figure 24 :Figure 27 :
Figure 24: Drop in accuracy across various datasets for encoder-decoder models under various attention modules compression.Top-to-Bottom: BoolQ, PIQA, Winogrande ( §4.3) Cross Att GP Q : Compressing only cross-attention modules, Encoder Att GP Q : Compressing attention modules in encoder only, Decoder Att GP Q : Compressing attention modules in decoder only.

Table 1 :
Top-1 Accuracy from quantizing BERT on SQUAD.Left: BERT-Base, Right: BERT-Large.Baseline for BERT-Base is 12.987 and BERT-Large is 15.909

Table 2 :
Datasets in our experiments (we use dev sets for BoolQ, PIQA, and Winogrande) fit into the brown suitcase because its too small.suitcase The trophy doesnt fit into the brown suitcase because its too large.trophy

Table 3 :
Number of parameters (in million) across all the models

Table 4 :
Results from compressing different modules for encoder-only models (numbers represent top-1 accuracy)