Are Compressed Language Models Less Subgroup Robust?

To reduce the inference cost of large language models, model compression is increasingly used to create smaller scalable models. However, little is known about their robustness to minority subgroups defined by the labels and attributes of a dataset. In this paper, we investigate the effects of 18 different compression methods and settings on the subgroup robustness of BERT language models. We show that worst-group performance does not depend on model size alone, but also on the compression method used. Additionally, we find that model compression does not always worsen the performance on minority subgroups. Altogether, our analysis serves to further research into the subgroup robustness of model compression.


Introduction
In recent years, the field of Natural Language Processing (NLP) has seen a surge in interest in the application of Large Language Models (LLMs) (Brown et al., 2020;Thoppilan et al., 2022;Touvron et al., 2023).These applications range from simple document classification to complex conversational chatbots.However, the uptake of LLMs has not been evenly distributed across society.Due to their large inference cost, only a few well-funded companies may afford to run LLMs at scale.To address this, many have turned to model compression to create smaller language models (LMs) with near comparable performance to their larger counterparts.
The goal of model compression is to reduce a model's size and latency while retaining overall performance.Existing approaches such as knowledge distillation (Hinton et al., 2015) have produced scalable task-agnostic models (Turc et al., 2019;Sanh et al., 2020;Jiao et al., 2020) heads (Michel et al., 2019) or embeddings (Gee et al., 2022) are essential.Although model compression has been proven to work well in practice, little is known about its influence on subgroup robustness.
In any given dataset, subgroups exists as a combination of labels (e.g.hired or not hired) and attributes (e.g.male or female) (Sagawa et al., 2020;Bartlett et al., 2022).A model is said to be subgroup robust if it maximizes the lowest performance across subgroups (Gardner et al., 2023).Due to the unbalanced sample size of each subgroup, the conventional approach to training via Empirical Risk Minimization (ERM) (Vapnik, 1999) produces models with a higher performance on majority subgroups (e.g.hired male), but a lower performance on minority subgroups (e.g.hired female).
Given the increasing role of LLMs in everyday life, our work seeks to address a gap in the existing literature regarding the subgroup robustness of model compression in NLP.To that end, we explore a wide range of compression methods (Knowledge Distillation, Pruning, Quantization, and Vocabulary Transfer) and settings on 3 textual datasets -MultiNLI (Williams et al., 2018), CivilComments (Koh et al., 2021), and SCOTUS (Chalkidis et al., 2022).The code for our paper is publicly available1 .
The remaining paper is organized as follows.First, we review related works in Section 2.Then, we describe the experiments and results in Sections 3 and 4 respectively.Finally, we draw our conclusions in Section 5.

Related Works
Most compression methods belong to one of the following categories: Knowledge Distillation (Hinton et al., 2015), Pruning (Han et al., 2015), or Quantization (Jacob et al., 2017).Additionally, there exists orthogonal approaches specific to LMs such as Vocabulary Transfer (Gee et al., 2022).Previous works looking at the effects of model compression have focused on the classes or attributes in images.Hooker et al. (2021) analyzed the performance of compressed models on the imbalanced classes of CIFAR-10, ImageNet, and CelebA.Magnitude pruning and post-training quantization were considered with varying levels of sparsity and precision respectively.Model compression is found to cannibalize the performance on a small subset of classes to maintain overall performance.Hooker et al. (2020) followed up by analyzing how model compression affects the performance on sensitive attributes of CelebA.Unitary attributes of gender and age as well as their intersections (e.g.Young Male) were considered.The authors found that overall performance was preserved by sacrificing the performance on low-frequency attributes.
Stoychev and Gunes (2022) expanded the previous analysis on attributes to the fairness of facial expression recognition.The authors found that compression does not always impact fairness in terms of gender, race, or age negatively.The impact of compression was also shown to be nonuniform across the different compression methods considered.
To the best of our knowledge, we are the first to investigate the effects of model compression on subgroups in a NLP setting.Additionally, our analysis encompasses a much wider range of compression methods than were considered in the aforementioned works.

Experiments
The goal of learning is to find a function f that maps inputs x ∈ X to labels y ∈ Y .Additionally, there exists attributes a ∈ A that are only provided as annotations for evaluating the worst-group performance at test time.The subgroups can then be defined as g ∈ {Y × A}.
Pruning.We analyze structured pruning (Michel et al., 2019) on BERT following a three-step training pipeline (Han et al., 2015).Four different levels of sparsity (BERT P R20 , BERT P R40 , BERT P R60 , BERT P R80 ) are applied by sorting all transformer heads using the L1-norm of weights from the query, key, and value projection matrices.Structured pruning is implemented using the NNI library3 .
Quantization.We analyze 3 quantization methods supported natively by PyTorch -Dynamic Quantization (BERT DQ ), Static Quantization (BERT SQ ), and Quantization-aware Training (BERT QAT ).Quantization is applied to the linear layers of BERT to map representations from FP32 to INT8.The calibration required for BERT SQ and BERT QAT is done using the training set.

Vocabulary Transfer (VT).
We analyze vocabulary transfer using 4 different vocabulary sizes (BERT V T 100 , BERT V T 75 , BERT V T 50 , BERT V T 25 ) as done by Gee et al. (2022).Note that BERT V T 100 does not compress the LM, but adapts its vocabulary fully to the in-domain dataset, thus making tokenization more efficient.
Further details regarding the subgroups in each dataset are shown in Appendix A.1.
MultiNLI.Given a hypothesis and premise, the task is to predict whether the hypothesis is contradicted by, entailed by, or neutral with the premise (Williams et al., 2018).Following Sagawa et al. (2020), the attribute indicates whether any negation words (nobody, no, never, or nothing) appear in the hypothesis.We use the same dataset splits as Liu et al. (2021).
CivilComments.Given an online comment, the task is to predict whether it is neutral or toxic (Koh et al., 2021).Following Koh et al. (2021), the attribute indicates whether any demographic identities (male, female, LGBTQ, Christian, Muslim, other religion, Black, or White) appear in the comment.We use the same dataset splits as Liu et al. (2021).
SCOTUS.Given a court opinion from the US Supreme Court, the task is to predict its thematic issue area (Chalkidis et al., 2022).Following Chalkidis et al. (2022), the attribute indicates the direction of the decision (liberal or conservative) as provided by the Supreme Court Database (SCDB).We use the same dataset splits as Chalkidis et al. (2022).

Implementation Details
We train each compressed model via ERM with 5 different random initializations.The average accuracy, worst-group accuracy (WGA), and model size are measured as metrics.The final value of each metric is the average of all 5 initializations.
Following Liu et al. (2021); Chalkidis et al. (2022), we fine-tune the models for 5 epochs on 0 [57498] 1 [11158] 2 [67376] 3 [1521] 4 [66630] 5 [1992 MultiNLI and CivilComments and for 20 on SCO-TUS.A batch size of 32 is used for MultiNLI and 16 for CivilComments and SCOTUS.Each model is implemented with an AdamW optimizer (Loshchilov and Hutter, 2019) and early stopping.A learning rate of 2 • 10 −5 with no weight decay is used for MultiNLI, while a learning rate of 10 −5 with a weight decay of 0.01 is used for Civil-Comments and SCOTUS.Sequence lengths are set to 128, 300, and 512 for MultiNLI, CivilComments, and SCOTUS respectively.As done by Gee et al. (2022), one epoch of masked-language modelling is applied before finetuning for VT.The hyperparameters are the same as those for fine-tuning except for a batch size of 8.

Results
Model Size and Subgroup Robustness.We plot the overall results in Figure 1 and note a few interesting findings.First, in MultiNLI and SCO-TUS, we observe a trend of decreasing average and worst-group accuracies as model size decreases.In particular, TinyBERT 6 appears to be an outlier in MultiNLI by outperforming every model including BERT Base .However, this trend does not hold in CivilComments.Instead, most compressed models show an improvement in WGA despite slight drops in average accuracy.Even extremely compressed models like BERT T iny are shown to achieve a higher WGA than BERT Base .We hypothesize that this is due to CivilComments being a dataset that BERT Base easily overfits on.As such, a reduction in model size serves as a form of regularization for generalizing better across subgroups.Additionally, we note that a minimum model size is required for fitting the minority subgroups.Specifically, the WGA of distilled models with layers fewer than 6 (BERT M ini , BERT T iny , and TinyBERT 4 ) is shown to collapse to 0 in SCOTUS.
Second, we further analyze compressed models with similar sizes by pairing DistilBERT with TinyBERT 6 as well as post-training quantization (BERT DQ and BERT SQ ) with BERT QAT according to their number of parameters in Table 1.We find that although two models may have an equal number of parameters (approximation error), their difference in weight initialization after compression (estimation error) as determined by the compression method used will lead to varying performance.In particular, DistilBERT displays a lower WGA on MultiNLI and Civil-Comments, but a higher WGA on SCOTUS than TinyBERT 6 .Additionally, post-training quantization (BERT DQ and BERT SQ ) which does not include an additional fine-tuning step after compression or a compression-aware training like BERT QAT is shown to be generally less subgroup robust.These methods do not allow for the recovery of model performance after compression or to prepare for compression by learning compressionrobust weights.

Task Complexity and Subgroup Robustness
To understand the effects of task complexity on subgroup robustness, we construct 3 additional datasets by converting MultiNLI into a binary task.From Figure 2, model performance is shown to improve across the binary datasets for most models.WGA improves the least when Y = [0, 2], i.e. when sentences contradict or are neutral with one another.Additionally, although there is an overall improvement in model performance, the trend in WGA remains relatively unchanged as seen in Figure 1.A decreasing model size is accompanied by a reduction in WGA for most models.We hypoth-esize that subgroup robustness is less dependent on the task complexity as defined by number of subgroups that must be fitted.
Distribution of Subgroup Performance.We plot the accuracies distributed across subgroups in Figure 3.We limit our analysis to MultiNLI and CivilComments with KD for visual clarity.From Figure 3, we observe that model compression does not always maintain overall performance by sacrificing the minority subgroups.In MultiNLI, a decreasing model size reduces the accuracy on minority subgroups (3 and 5) with the exception of TinyBERT 6 .Conversely, most compressed models improve in accuracy on the minority subgroups (2 and 3) in CivilComments.This shows that model compression does not necessarily cannibalize the performance on minority subgroups to maintain overall performance, but may improve performance across all subgroups instead.

Conclusion
In this work, we presented an analysis of existing compression methods on the subgroup robustness of LMs.We found that compression does not always harm the performance on minority subgroups.Instead, on datasets that a model easily overfits on, compression can aid in the learning of features that generalize better across subgroups.Lastly, compressed LMs with the same number of parameters can have varying performance due to differences in weight initialization after compression.

Limitations
Our work is limited by its analysis on English language datasets.The analysis can be extended to other multi-lingual datasets from the recent FairLex benchmark (Chalkidis et al., 2022).Additionally, we considered each compression method in isolation and not in combination with one another.

A.1 Datasets
We tabulate the labels and attributes that define each subgroup in Table 2. Additionally, we show the sample size of each subgroup in the training, validation, and test sets.

A.2 Results
We tabulate the main results of the paper in Table 3.The performance of each model is averaged across 5 seeds.

B Additional Experiments
B.1 Sparsity and Subgroup Robustness.
Besides structured pruning, we investigate the effects of unstructured pruning using 4 similar levels of sparsity.Connections are pruned via PyTorch by sorting the weights of every layer using the L1-norm.We tabulate the results separately in Table 4 as PyTorch does not currently support sparse neural networks.Hence, no reduction in model size is seen in practice.From Table 4, we observe similar trends to those in Figure 1.Specifically, as sparsity increases, the WGA generally worsens in MultiNLI and SCOTUS, but improves in Civil-Comments across most models.At a sparsity of 80%, WGA drops significantly for MultiNLI and SCOTUS, but not for CivilComments.

B.2 Ablation of TinyBERT 6 .
To better understand the particular subgroup robustness of TinyBERT 6 , we conduct an ablation on its general distillation procedure.Specifically, we ablate the attention matrices, hidden states, and embeddings as sources of knowledge when distilling on the Wikipedia dataset4 .The same hyperparameters as Jiao et al. (2020) are used except for a batch size of 256 and a gradient accumulation of 2 due to memory constraints.
From Table 5, we find that we are unable to achieve a similar WGA on MultiNLI and its binary variants as shown by the performance gap between TinyBERT 6 and TinyBERT AHE .On SCO-TUS, the WGA of TinyBERT AHE is found to also be much higher than TinyBERT 6 .We hypothesize that the pre-trained weights that were uploaded to HuggingFace5 may have included a further in-domain distillation on MultiNLI.Additionally, model performance is shown to benefit the least when knowledge from the embedding is included during distillation.This can be seen by the lower WGA of TinyBERT AHE compared to TinyBERT AH across most datasets.

Figure 2 :
Figure 2: Model performance is shown to improve across the binary datasets of MultiNLI.However, the overall trend in WGA remains relatively unchanged, with a decreasing model size leading to drops in WGA.

Figure 3 :
Figure 3: Distribution of accuracies by subgroup for KD.Sample sizes in the training set are shown beside each subgroup.In CivilComments, performance improves on minority subgroups (2 and 3) across most models as model size decreases contrary to the minority subgroups (3 and 5) of MultiNLI.

Table 1 :
. Meanwhile, other approaches have shown that not all transformer Model size and number of parameters.BERT Base is shown as the baseline with subsequent models from knowledge distillation, structured pruning, quantization, and vocabulary transfer respectively.
Plot of WGA against average accuracy.Compression method is represented by marker type, while model size is represented by marker size.In MultiNLI and SCOTUS, compression worsens WGA for most models.Conversely, WGA improves for most compressed models in CivilComments.

Table 1 .
. An overview of each model's size and parameters is shown in Knowledge Distillation (KD).We analyze seven models (BERT M edium , BERT Small ,

Table 3 :
Model performance averaged across 5 seeds.WGA decreases as model size is reduced in MultiNLI and SCOTUS, but increases instead in CivilComments.This trend is also seen in the binary variants of MultiNLI despite a reduction in task complexity.
(b) MultiNLI with different binary labels.

Table 4 :
Average and worst-group accuracies for unstructured pruning.BERT Base is shown with a sparsity of 0%.MultiNLI and SCOTUS generally see a worsening WGA when sparsity increases contrary to the improvements in CivilComments.
(b) MultiNLI with different binary labels.

Table 5 :
Ablation of TinyBERT 6 .The subscripts A, H, and E represent the attention matrices, hidden states, and embeddings that are transferred as knowledge respectively during distillation.A noticeable performance gap is seen between TinyBERT 6 and TinyBERT AHE on MultiNLI and SCOTUS.