Robustness Challenges in Model Distillation and Pruning for Natural Language Understanding

Recent work has focused on compressing pre-trained language models (PLMs) like BERT where the major focus has been to improve the in-distribution performance for downstream tasks. However, very few of these studies have analyzed the impact of compression on the generalizability and robustness of compressed models for out-of-distribution (OOD) data. Towards this end, we study two popular model compression techniques including knowledge distillation and pruning and show that the compressed models are significantly less robust than their PLM counterparts on OOD test sets although they obtain similar performance on in-distribution development sets for a task. Further analysis indicates that the compressed models overfit on the shortcut samples and generalize poorly on the hard ones. We further leverage this observation to develop a regularization strategy for robust model compression based on sample uncertainty.


Introduction
Large pretrained language models (PLMs) (e.g., BERT (Devlin et al., 2019), RoBERTa , GPT-3 (Brown et al., 2020)) have obtained state-of-the-art performance in several Natural Language Understanding (NLU) tasks. However, recent studies (Niven and Kao, 2019;Du et al., 2021;Mudrakarta et al., 2018) indicate that PLMs heavily rely on shortcut learning/spurious correlations, rather than acquiring higher level language understanding and semantic reasoning in several NLU tasks. Specifically, these models often exploit dataset biases and artifacts, e.g., lexical bias and * Most of the work was completed while the first author was an intern at Microsoft Research during summer 2021. overlap bias, as shortcuts for prediction. Due to the independent and identically distributed (IID) split of training, development, and test sets, these models that learn spurious decision rules from training data can perform well on in-distribution data (Du et al., 2022). Nevertheless, the shortcut learning behavior will result in models that have poor generalization performance on out-of-distribution (OOD) data, raising concerns about their robustness.
On the other hand, it is difficult to use these large PLMs models in real-world applications with latency and capacity constraints, e.g., on edge devices and mobile phones. Thus, model compression emerges as one of the techniques to reduce model size, speed up inference, and save energy without significant performance drop for downstream tasks. State-of-the-art model compression techniques such as knowledge distillation Sun et al., 2019) and pruning (Sanh et al., 2020) primarily focus on evaluating compressed model performance in in-distribution test data. However, in-distribution testing is insufficient to capture the generalizability of PLMs (D'Amour et al., 2020). In contrast to existing work that is geared towards general-purpose PLMs (Niven and Kao, 2019;Du et al., 2021;Mudrakarta et al., 2018), this work aims to study the impact of compression on the shortcut learning and OOD generalization ability of compressed models.
Towards this end, we conduct comprehensive experiments to evaluate the OOD robustness of compressed models, with BERT as the base encoder. We focus primarily on two popular model compression techniques in the form of pruning and knowledge distillation . For pruning, we consider two popular techniques including iterative magnitude pruning (Sanh et al., 2020) and structured pruning (Prasanna et al., 2020;. Specifically, we explore the following research questions: Are distilled and pruned models as ro-bust as their PLM counterparts for downstream NLU tasks? What is the impact of varying the level of compression on OOD generalization and bias of compressed models? We evaluate the performance of several compressed models obtained using the above techniques on both standard indistribution development sets and OOD test sets for downstream NLU tasks. Experimental analysis indicates that distilled and pruned models are consistently less robust than their PLM counterparts. Further analysis of the poor generalization performance of compressed models reveals some interesting observations. For instance, we observe that the compressed models overfit on the easy / shortcut samples and generalize poorly on the hard ones. This motivates our second research question: How to regularize model compression techniques to generalize across samples with varying difficulty? This brings some interesting challenges since we do not know which samples are easy or hard apriori. Based on the above observations, we propose a bias mitigation framework to improve the OOD robustness of compressed models, termed as RMC (Robust Model Compression). First, we leverage the uncertainty of the deep neural network to quantify the difficulty of a training sample. This is given by the variance in the prediction of a sample from multiple sub-networks of the original large network obtained by model pruning. Second, we leverage this sample-specific measure for smoothing and regularizing different families of compression techniques. The major contributions of this work can be summarized as follows: • We perform a comprehensive analysis to evaluate the OOD generalization ability and robustness of compressed models for NLU tasks. • We further analyze plausible reasons for the low generalizability of compressed models and demonstrate connections to shortcut learning. • We propose a mitigation framework for regularizing model compression, termed as RMC, which smoothes the knowledge distillation training based on the estimated sample difficulties. • We perform experiments to demonstrate that our RMC framework improves OOD generalization while not sacrificing the standard in-distribution task performance on multiple NLU tasks.

Related Work
Shortcut Learning and Mitigation. Recent studies indicate that PLMs tend to exploit biases and artifacts in the dataset as shortcuts for prediction, rather than acquiring higher level semantic understanding and reasoning for NLU tasks (Niven and Kao, 2019;Du et al., 2021;McCoy et al., 2019a). There are some preliminary work to mitigate the bias of general PLMs, including productof-experts He et al., 2019;Sanh et al., 2021), re-weighting (Schuster et al., 2019;Yaghoobzadeh et al., 2019;Utama et al., 2020), adversarial training (Stacey et al., 2020), posterior regularization , etc. Robustness in Model Compression. Current practice for evaluating model compression performance focuses mainly on standard benchmark performance . In the computer vision domain, previous work shows that compressed models perform poorly in Compression Identified Exemplars (CIE) (Hooker et al., 2019), and compression amplifies algorithmic bias towards certain demographics (Hooker et al., 2020). The most similar work to ours are two concurrent work (Xu et al., 2021a; that investigate the performance of compressed models beyond standard benchmarks for natural language understanding tasks. However, both work mainly focus on evaluating the robustness of compressed models with respect to the scenario of adversarial attacks, i.e., TextFooler (Jin et al., 2020), and the unified adversarial framework . In contrast, we comprehensively characterize the robustness of BERT compression in OOD test sets to probe the OOD generalizability of the compression techniques. Besides, we use insights from this robustness analysis to design a generalizable and robust model compression framework.

Are Compressed Models Robust?
We perform a comprehensive analysis to evaluate the robustness of compressed language models.

Compression Techniques
We consider two popular families of compression, namely, knowledge distillation and model pruning.
Knowledge Distillation: The objective here is to train a small-size model by mimicking the behavior of the larger teacher model using knowledge distillation (Hinton et al., 2015). In this work, we focus on task-agnostic distillation. In particular, we consider DistilBERT  and MiniLM  distilled from BERTbase. For a fair comparison, we select compressed models with similar capacities (66M parameters in this work). In order to evaluate the impact of compression techniques on model robustness, we also consider similar capacity smaller models without using knowledge distillation. These are obtained via simple truncation where we retain the first 6 layers of the large model, and via pre-training a smaller 6-layer model from scratch.
Iterative Magnitude Pruning: This is a taskspecific unstructured pruning method (Sanh et al., 2020). During the fine-tuning process for each downstream task, the weights with the lowest magnitude are removed until the pruned model reaches the target sparsity. Note that we utilize the standard pruning technique, rather than the LTH-based pruning (lottery ticket hypothesis) that uses rewinding . We also consider different pruning ratios to obtain pruned models with different levels of sparsity.
Structured Pruning: This method family is based on the hypothesis that there is redundancy in the attention heads (Prasanna et al., 2020;Voita et al., 2019;Bian et al., 2021;Chen et al., 2021). We also consider task-specific pruning. During the finetuning process for each task, it prunes the whole attention heads based on their importance to the model predictions. Please refer to Sec. A in Appendix for more details. We prune around 20% attention heads in total (i.e., 28 attention heads). Further pruning increases the sparsity with significant degradation of the model's performance on in-distribution development sets.

Evaluation Datasets
To evaluate the robustness of the compressed models introduced in the last section, we use three NLU tasks, including MNLI, FEVER, and QQP 1 . Please refer to Sec. B in Appendix for more details.
• MNLI (Williams et al., 2018): This is a natural language inference task. In this work, we report the accuracy metric on the matched subset. We use HANS (McCoy et al., 2019b) as the adversarial test set, which contains 30, 000 synthetic samples. Models that exploit shortcut features have been shown to perform poorly on the HANS test set.
• FEVER (Thorne et al., 2018): This is a fact verification dataset. Recent studies indicate that there are strong shortcuts in the claims (Utama et al., 2020). To facilitate the robustness and generalization evaluation of fact verification models, two symmetric test sets (i.e., Sym v1 and Sym v2) were created, where bias exists in the symmetric pairs (Schuster et al., 2019). Both OOD test sets have 712 samples. • QQP: The task is to predict whether a pair of questions is semantically equivalent. We consider the OOD test set PAWS-qqp, which contains 677 test samples generated from QQP corpus Yang et al., 2019). Besides, we also consider the PAWS-wiki OOD test set, which consists of 8, 000 test samples generated from Wikipedia pages.
For all three tasks, we employ accuracy as the evaluation metric and evaluate the performance of the compressed models on both the in-distribution development set and the OOD test set.

Evaluation Setup
In this work, we use the uncased BERT-base as the teacher network, and study the robustness of its compressed variants. The final model consists of the BERT-base encoder (or its compressed variants) with a classification head (a linear layer on top of the pooled output). Recent studies indicate that factors such as learning rate and training epochs could have a substantial influence on robustness (Tu et al., 2020). In particular, increasing training epochs can help improve the generalization of the OOD test set. In this work, we focus on the relative robustness of compressed models compared to the uncompressed teacher, rather than their absolute accuracies. For a fair comparison, we unify the experimental setup for all models. We use Adam optimizer with weight decay (Loshchilov and Hutter, 2017), where the learning rate is fixed as 2e-5, and we train all models for 5 epochs on all datasets. We perform the experiments using PyTorch and use the pre-trained models from the Huggingface model pool . We report the average results over three runs for all experiments.

Relative Robustness Metric
As we later demonstrate, with increase in compression ratio or model sparsity, the performance of the smaller models degrades for both in-domain and OOD test sets. To compare the gap between   Table 3: Accuracy comparison (in percent) and relative bias F bias (the smaller the better) of compressed models with structured pruning. Pruned models have relatively higher degradation in OOD test set compared to the development set. All compressed models have been pruned 28 attention heads.
in-distribution task performance and OOD generalizability, we define a new metric that measures this performance gap of the compressed models with respect to the uncompressed BERT-base (teacher). First, we calculate the accuracy gap between indistribution development set and OOD test set as for BERT-base (denoted by ∆ BERT-base ); and its compressed variant (denoted by ∆ compressed ). Second, we compute the relative bias as the ratio between the accuracy gap of the compressed model with respect to BERT-base: Here F bias > 1 indicates that the compressed model is more biased than BERT-base with the degree of bias captured in a larger value of F bias . Since FEVER has two OOD test sets, we use the overall accuracy of sym1 and sym2 to calculate F bias . Similarly, the OOD accuracy for QQP is the overall accuracy on PAWS-wiki and PAWS-qqp.

Experimental Observations
We report the performance of accuracy and the relative bias measure F bias for iterative magnitude pruning in Table 1, knowledge distillation in Table 2 and structured pruning in Table 3. We have the following key observations. Iterative Magnitude Pruning: First, for slight and mid-level sparsity, the pruned models have comparable and sometimes even better performance on the in-distribution development set. Consider FEVER as an example, where the compressed model preserves the accuracy on the in-distribution set even at 60% sparsity 2 . However, the generalization accuracy on the OOD test set has a substantial drop. This indicates that the development set fails to capture the generalizability of the pruned models. Second, as the sparsity increases, the generalization accuracy on the OOD test set substantially decreases while dropping to random guess for tasks such as MNLI. Third, at high levels of sparsity (e.g. 70%), both development and OOD test set performances are significantly affected. In general, we observe F bias > 1 for all levels of sparsity in Table 1. Note that we limit the maximum sparsity at 70% after which the training is unstable with a significant performance drop even on the development set . As in the previous cases, there is substantial accuracy drop on the OOD test set compared to the development set (e.g., 7.6% vs 1.9% degradation respectively for the MNLI task). Knowledge Distillation: Similar to pruning, we observe a higher accuracy drop in the OOD test set compared to the in-distribution development set for distilled models. Consider DistilBERT performance on MNLI as an example with 1.9% accuracy drop in development set compared to 8.6% drop in the OOD test set. This can also be validated in Table 2, where all F bias values are larger than 1, depicting that all the distilled models are less robust than BERT-base. Another interesting observation is that distilled models, i.e., DistilBERT and MiniLM, have higher bias F bias compared to the pre-trained models, i.e., Pretrained-l6 and Truncated-l6, as we compare their average F bias values in Table 2. This indicates that the compression process plays a significant role in the low generalizability and robustness of the distilled models. Structured Pruning: Recent studies have reported the super ticket phenomenon . The authors observe that, when the BERT-base model is slightly pruned, the accuracy of the pruned models improves on in-distribution development set. However, we observe that this finding does not hold for OOD test sets. From Table 3, we observe that all pruned models are less robust than BERT-base, with F bias much larger than 1.

Attribution of Low Robustness
In this section, we explore the factors that lead to low robustness of compressed models. Previous work has demonstrated that the performance of different models on the GLUE benchmark (Wang et al., 2018) tends to correlate with the performance on MNLI, making it a good representative of natural language understanding tasks in general (Phang et al., 2018;. For this reason, we choose the MNLI task for a study. For the MNLI task, we consider the dataset splits from (Gururangan et al., 2018). The authors partition the development set into easy/shortcut 3 and hard subsets. In this experiment, we use pruned models with varying sparsity to investigate the reason for the low robustness of the compressed models. We have the following key observations.
Observation 1: The compressed models tend to overfit the easy/shortcut samples and generalize poorly on the hard ones. The performance of 3 We use 'easy' and 'shortcut' interchangeably in this work. pruned models at five sparsity levels (ranging between [0.2 − 0.85]) on the easy and hard samples for the MNLI task is illustrated in Figure 1. It demonstrates that the accuracy on the hard samples is much lower compared to the accuracy on the easy ones. As the sparsity increases, we observe a larger accuracy drop on the hard samples compared to the easy ones. In particular, the accuracy gap between the two subsets is 22.7% at the sparsity of 0.85, much higher than the 16.1% accuracy gap at the sparsity of 0.4. These findings demonstrate that the compressed models overfit on the easy samples, while generalizing poorly on the hard ones. Furthermore, this phenomenon is amplified at higher levels of sparsity for the pruned models. Observation 2: Compressed models tend to assign overconfident predictions to easy samples. One of the potential reasons is that compressed models are more prone to capture spurious correlations between shortcut features in training samples with certain class labels for their predictions (Geirhos et al., 2020;Du et al., 2021).

Variance-based Difficulty Estimation
Based on the above observations, we propose a variance-based metric to quantify the difficulty degree of each sample. For each sample in the development set, we calculate its loss at five different levels of pruning sparsity as shown in Figure 1. We further calculate the variance of the above losses for each sample and rank them based on the variance. Finally, we assign the samples with low variance to the "easy" subset and rest to the "hard" one. Comparing our variance-based proxy annotation with the ground truth annotation in (Gururangan et al., 2018) gives an accuracy of 82.8%. This indicates that the variance-based estimation leveraging pruning sparsity is a good indicator of sample difficulty. This motivates our design of the mitigation technique introduced in the next section.  Figure 2: RMC framework for bias mitigation with two-stage training. In the first stage, we feed the training samples to pruned models at different levels of sparsity (ranging from [0.2 − 0.85]) as introduced in Section 4.1); compute corresponding losses and their variance to estimate the difficulty degree of each training sample. In the second stage, we use the difficulty degree to regularize the teacher network for robust model compression.

Mitigation Framework
In this section, we propose a general bias mitigation framework (see Figure 2), termed as RMC (Robust Model Compression), to improve the robustness of compressed models on downstream tasks. Our RMC framework follows the philosophy of taskspecific knowledge distillation (Sanh et al., 2020;Jiao et al., 2020), but with explicit regularization of the teacher network leveraging sample uncertainty. This prevents the compressed model from overfitting in the easy samples that contain shortcut features and helps improve its robustness. This regularized training is implemented in two stages.

Quantifying Sample Difficulty
In the first stage, our objective is to quantify the difficulty degree of each training sample. Variance Computation: Following the observations obtained in Section 4.1, we first use iterative magnitude pruning to obtain a series of pruned models from BERT-base with different levels of sparsity and then we use the losses of the pruned models at different levels of sparsity to compute their variance v i for each training sample . We choose five sparsity levels, i.e., n = 5, that are diverse enough to reflect the difficulty degree of each training sample. Here, samples with high variance correspond to hard ones. Difficulty Degree Estimation: Based on the variance v i for each training sample x i , we can estimate its difficulty degree as: where V min and V max denote the minimum and maximum values of the variances, respectively. Equation 1 is used to normalize the variance of the training samples in the range of [α, 1], where d i = 1 denotes the most difficult training sample, according to our criteria of loss variance. Samples with d i closer to α are treated as shortcut/biased samples. Prior work (Niven and Kao, 2019) show that the bias behavior of the downstream training set can be attributed to data collection and annotation biases. Since the bias level is different for each dataset, we assign a different α in Equation 1 to each training set to reflect its bias level.

Robust Knowledge Distillation
In the second stage, we fine-tune BERT-base on the downstream tasks to obtain the softmax probability for each training sample. We then use the difficulty degree of the training samples (discussed in the previous section) to smooth the teacher predictions. The instance-level smoothed softmax probability is used to guide the training of compressed models through regularized knowledge distillation.

Smoothing Teacher Predictions:
We smooth the softmax probabilityŷ T i from the teacher network, according to the difficulty degree d i of each training sample x i . The smoothed probability is given as: where K denotes the total number of class labels. We perform instance-level smoothing for each training sample x i . If the difficulty degree of a training sample d i = 1, then the softmax probability s i for the corresponding sample from the teacher is unchanged. In contrast, at the other extreme as d i → α, we increase the regularization to encourage the compressed model to assign less overconfident predictions to the sample. The difficulty degree range is [α, 1] rather than [0, 1] to avoid over-smoothing of the teacher predictions.  Table 4: Generalization accuracy comparison (in percent) and the corresponding F bias values for iterative magnitude pruning at 40% sparsity with different mitigation methods. The last column indicates average F bias over three tasks.

Smoothness-Induced Model Compression:
We employ the smoothed softmax probability s i from BERT-base to supervise the training of the compressed models, where the overall loss function is: where y i is the ground truth andŷ S i is the probability of the compressed model. L 1 denotes the cross-entropy loss, and L 2 represents the knowledge distillation loss with KL divergence. Hyperparameter λ manages the trade-off between learning from hard label y i and softened softmax probability s i . Among the different families of compression techniques introduced in Section 3.1, we directly fine-tune the distilled models using Equation 3. For iterative magnitude pruning, we use Equation 3 to guide the pruning during the fine-tuning process.

Mitigation Performance Evaluation
In this section, we conduct experiments to evaluate the robustness of our RMC mitigation framework.

Experimental Setup
For all experiments, we follow the same setting as in Section 3.3, and the same evaluation datasets as in Section 3.2. We use the OOD test set exclusively for evaluation. We compute the variance of samples (outlined in Section 4.1) in the in-distribution development set to split it into a shortcut and hard subset. The relative robustness between the hard and easy subset is used to tune the hyperparameter α in Equation 1, where we set α as 0.5, 0.3, 0.2 for MNLI, FEVER, and QQP, respectively. The weight λ in Equation 3 is fixed as 0.9 for all experiments.

Baseline Methods
We consider the following five baselines. Please refer to Sec. C in Appendix for more details. In contrast, RMC uses instance-level smoothing. • Focal (Focal Loss) (Lin et al., 2017): Compared to cross-entropy loss, focal loss has an additional regularizer to reduce the weight for easy samples and assign a higher weight to hard samples bearing less-confident predictions. • JTT (Just Train Twice) : This is a re-weighting method, which first trains the BERT-base model using standard cross-entropy loss for several epochs, and then trains the compressed model while up-weighting the training examples that are misclassified by the first model, i.e., hard samples.

Mitigation Performance Analysis
We compare our RMC framework with the above baselines and have the following key observations. Iterative Magnitude Pruning: Table 4 shows the mitigation results of accuracy and relative bias F bias . All mitigation methods are performed with pruned models at 40% sparsity. We observe that task-specific knowledge distillation only slightly improves accuracy on the OOD test set compared to Vanilla tuning, since the teacher model itself is not robust for downstream tasks (Niven and Kao, 2019). Global smoothing further improves generalization accuracy compared to prior methods. Our   RMC framework obtains the best accuracy on OOD test set across all the tasks on aggregate. RMC further reduces the average relative bias F bias by 10% over Vanilla tuning, as shown in Table 4, indicating the benefits of uncertainty-based sample-wise smoothing in terms of improving model robustness. For the MNLI task, we also illustrate the mitigation performance of our RMC framework for different levels of sparsity in Figure 3. We observe that RMC consistently improves accuracy on OOD HANS while reducing the relative bias F bias for all levels of sparsity over the Vanilla method.
Knowledge Distillation: Table 5 shows the mitigation results of accuracy and relative bias F bias . We observe that RMC significantly improves over MiniLM for OOD generalization leveraging smoothed predictions from BERT-base teacher.
With instance-level smoothing in RMC, the generalization accuracy for the compressed model on the OOD test set is significantly closer to BERT-base teacher compared to the other methods. We also decrease the relative bias F bias in Table 5 Table 6: Our RMC framework improves accuracy of the compressed models on the hard samples and reduces overfitting on the shortcut/easy samples, leading to reduced performance gap between the two subsets.

Further Analysis on Robust Mitigation
In this section, we further investigate the reasons for the improved generalization performance with RMC with an analysis on the MNLI task. Table 6 shows the accuracy performance of RMC for model pruning and distillation on the shortcut/easy and hard samples. We observe RMC to improve the model performance on the under-represented hard samples, where it reduces the generalization gap between the hard and shortcut/easy subset by 10.6% at 0.4 level of sparsity and by 11.3% for knowledge distillation. This analysis demonstrates that RMC reduces the overfitting of the compressed models on the easy samples and encourages them to learn more from the hard ones, thus improving the generalization on the OOD test sets.

Conclusions
In this work, we conduct a comprehensive study of the robustness challenges in compressing large PLMs when fine-tuning in downstream NLU datasets. Furthermore, we propose a general mitigation framework with instance-level smoothing for robust model compression. Experimental analysis demonstrates our framework to improve the generalization and OOD robustness of compressed models for different compression techniques, while not sacrificing the in-distribution performance.

Limitations
First, we study the shortcut learning/bias problem and OOD generalization of model compression techniques, exclusively focusing on the two most widely used families of compression techniques, including knowledge distillation and pruning. Our empirical analysis indicates that these two families of compression techniques suffer from the low generalization issue. However, other types of compression technique, such as matrix decomposition and quantization, are not discussed in this work. Studying the whole compression techniques is a challenging topic and will be investigated in our future research. Second, our RMC framework needs to calculate the variance of losses for each training sample, thus requiring additional training time.
Training efficiency can be further improved by implementing parallel training or more efficient ways of calculating sample difficulty, which will also be studied in our future research.

A More Details of Pruning Methods
In this section, we introduce more details about the compression techniques studied. knowledge Distillation: For a fair comparison, we do not compare with TinyBERT (Jiao et al., 2020) and MobileBERT , since TinyBERT is fine-tuned with data augmentation on NLU tasks, and MobileBERT is distilled from BERT-large rather than BERT-base. Magnitude Pruning: It is based on the overparameterization assumption of pre-trained language models (Xu et al., 2021b;Huang et al., 2021). For iterative magnitude pruning, we freeze all the embedding modules and only prune the parameters in the encoder (i.e., 12 layers of Transformer blocks). After pruning, the pruned weight values are set to 0 to reduce the amount of information to store. Unlike the LTH version, we consider standard magnitude pruning without using rewinding.
Structured Pruning: To calculate the importance, we follow (Michel et al., 2019;Prasanna et al., 2020) and calculate the expected sensitivity of the attention heads to the mask variable ξ (h,l) : denotes the contribution score of the attention head h in layer l, L(x) represents the loss value for the sample x, and ξ (h,l) is the mask of the attention head h in layer l. After obtaining the contribution scores, the attention heads with lowest score I

B More on Evaluation Datasets
In this section, we introduce more details about the three benchmark datasets. MNLI: This task aims to predict whether the relationship between the premise and the hypothesis is contradiction, entailment, or neutral. It is divided into a training set and development set with 392, 702 and 9, 815 samples, respectively. FEVER: The task is to predict whether the claims support, refute, or not-have-enough-information about the evidence. Recent studies indicate that there are strong shortcuts in claims (Utama et al., 2020). It is divided into a training set and a development set with 242, 911 and 16, 664 samples, respectively.
QQP: It is divided into a training set and a development set with 363, 846 and 40, 430 samples, respectively.

C More on Comparing Baselines
In this section, we introduce more details on comparing baselines. Distil and Smooth: For both baseline methods, we use a loss function similar to that of Equation 3. We fix the weight λ to 0.9 for all experiments, to encourage the compressed model to learn more from the probability output of the teacher network. A major difference between the two baselines is that Smooth has an additional smoothing process involved during the fine-tuning process. Focal Loss: The original focal loss function is: FL (p i ) = − (1 − p i ) γ log (p i ). Our implementation is as follows: (1 − p k ) γ log (p i ) . The hyperparameter γ controls the weight difference between hard and easy samples, and is fixed at 2.0 for all tasks. We use the denominator to normalize the weights within a batch, where N is the batch size. This is used to guarantee that the average weight for a batch of training samples is 1.0. As such, the weight for the easy samples would be down-weighted to lower than 1.0, and the weight for hard samples would be up-weighted to values larger than 1.0. JTT: This is also a reweighting baseline that encourages the model to learn more from hard samples. The hyperparameter λ up in  is set to 2.0. We also normalize the weights so that the average weight for each training sample is 1.0.

D Running Environment
For a fair evaluation of the robustness of compressed models, we run all experiments using a server with 4 NVIDIA GeForce 3090 GPUs. All experiments are implemented with the Pytorch version of the Hugging Face Transformer library.

E The Capacity Issue
One natural speculation about the low robustness of compressed models is due to their low capacity (i.e., smaller size). To disentangle the two important factors that influence model performance, i.e., low capacity and compression, we compare distilled models with Uncased-l6, which is trained only using pretraining. The results are given in Table 2. The results indicate that Uncased-l6 has better generalization ability over the MNLI and FEVER two tasks. Take structured pruning as an example; although the three pruned models in Ta-ble3 have the same model size, their generalization accuracy is different. These results indicate that the low robustness of compressed models is not entirely due to their low capacity, and compression plays a significant role.

F MNLI Easy and Hard Subsets
The authors train a hypothesis-only model and use it to generate predictions for the whole development set (Gururangan et al., 2018). Samples that are given correct predictions by the hypothesis-only model are regarded as easy samples, and vice versa. The easy subset contains 5488 samples, and the hard subset contains 4302 samples.