Beyond Preserved Accuracy: Evaluating Loyalty and Robustness of BERT Compression

Recent studies on compression of pretrained language models (e.g., BERT) usually use preserved accuracy as the metric for evaluation. In this paper, we propose two new metrics, label loyalty and probability loyalty that measure how closely a compressed model (i.e., student) mimics the original model (i.e., teacher). We also explore the effect of compression with regard to robustness under adversarial attacks. We benchmark quantization, pruning, knowledge distillation and progressive module replacing with loyalty and robustness. By combining multiple compression techniques, we provide a practical strategy to achieve better accuracy, loyalty and robustness.

BERT (Devlin et al., 2019) is a representative PLM. Many works compressing BERT use preserved accuracy with computational complexity (e.g., speed-up ratio, FLOPS, number of parameters) as metrics to evaluate compression. This Figure 1: Three metrics to evaluate the compressed models beyond preserved accuracy. For each input, label and probability loyalty measure the shift of label and predicted probability distribution, respectively. Robustness measures the performance of the compressed model under adversarial attacks. evaluation scheme is far from perfect: (1) Preserved accuracy cannot reflect how alike the teacher and student 2 models behave. This can be problematic when applying compression techniques in production (to be detailed in Section 3). (2) Using preserved accuracy to evaluate models compressed with more data or data augmentation (Jiao et al., 2020) can be misleading, since one cannot tell whether the improvement should be attributed to the innovation of the compression technique or addition of data. (3) Model robustness, which is critical for production, is often missing from evaluation, leaving a possible safety risk.
As illustrated in Figure 1, to measure the resemblance between the student and teacher models, we propose label loyalty and probability loyalty to target different but important aspects. We also explore the robustness of the compressed models by conducting black-box adversarial attacks. We apply representative BERT compression methods of different types to the same teacher model and benchmark their performance in terms of accuracy, speed, loyalty and robustness. We find that methods with a knowledge distillation loss perform well on loyalty and that post-training quantization can drastically improve robustness against adversarial attacks. We use the conclusions drawn from these experiments to combine multiple techniques together and achieve significant improvement in terms of accuracy, loyalty and robustness.

BERT Compression
Compressing and accelerating pretrained language models like BERT has been an active field of research. Some initial work employs conventional methods for neural network compression to compress BERT. For example, Q8-BERT (Zafrir et al., 2019) and Q-BERT (Shen et al., 2020) employ weight quantization to reduce the number of bits used to represent a parameter in a BERT model. Pruning methods like Head Prune (Michel et al., 2019) and Movement Pruning  remove weights based on their importance to reduce the memory footprint of pretrained models. Another line of research focuses on exploiting the knowledge encoded in a large pretrained model to improve the training of more compact models. For instance, DistilBERT (Sanh et al., 2019) and BERT-PKD (Sun et al., 2019) employ knowledge distillation (Hinton et al., 2015) to train compact BERT models in a task-specific and task-agnostic fashion respectively by mimicking the behavior of large teacher models. Recently,  proposed progressive module replacing, which trains a compact student model by progressively replacing the teacher layers with their more compact substitutes.

Label Loyalty
Model compression is a common practice to optimize the efficiency of a model for deployment (Cheng et al., 2017). In real-world settings, training and deployment are often separate (Paleyes et al., 2020). As such it is desirable to have a metric to measure to what extent the "production model" is different from the "development model". Moreover, when discussing ethical concerns, previous studies  ignore the risk that model compression could introduce additional biases. However, a recent work (Hooker et al., 2020) strongly contradicts this assumption. In a nutshell, we would desire the student to behave as closely as possible to the teacher, to make it more predictable and minimize the risk of introducing extra bias. Label loyalty directly reflects the resemblance of the labels predicted between the teacher and student models. It is calculated in the same way as accuracy, but between the student's prediction and the teacher's prediction, instead of ground labels: where pred t and pred s are the predictions of the teacher and student, respectively.

Probability Loyalty
Except for the label correspondence, we argue that the predicted probability distribution matters as well. In industrial applications, calibration (Guo et al., 2017;, which focuses on the meaningfulness of confidence, is an important issue for deployment. Many dynamic inference acceleration methods (Xin et al., 2020b;Schwartz et al., 2020b;Xin et al., 2020a; use entropy or the maximum value of the predicted probability distribution as the signal for early exiting. Thus, a shift of predicted probability distribution in a compressed model could break the calibration and invalidate calibrated early exiting pipelines. Kullback-Leibler (KL) divergence is often used to measure how one probability distribution is different from a reference distribution.
where X is the probability space; P and Q are predicted probability distributions of the teacher and students, respectively. Here, we use its variant, the Jensen-Shannon (JS) divergence, since it is symmetric and always has a finite value which is desirable for a distance-like metric: where M = 1 2 (P + Q). Finally, the probability loyalty between P and Q is defined as:  where L p ∈ [0, 1]; higher L p represents higher resemblance. Note that Equation 2 is also known as the KD loss (Hinton et al., 2015), thus KD-based methods will naturally have an advantage in terms of probability loyalty.

Robustness
Deep Learning models have been shown to be vulnerable to adversarial examples that are slightly altered with perturbations often indistinguishable to humans (Kurakin et al., 2017). Previous work (Su et al., 2018) found that small convolutional neural networks (CNN) are more vulnerable to adversarial attacks compared to bigger ones. Likewise, we intend to investigate how BERT models perform and the effect of different types of compression in terms of robustness. We use an off-the-shelf adversarial attack method, TextFooler (Jin et al., 2020), which demonstrates state-of-the-art performance on attacking BERT. TextFooler conducts black-box attacks by querying the BERT model with the adversarial input where words are perturbed based on their part-of-speech role. We select two metrics from (Jin et al., 2020), after-attack accuracy and query number, to evaluate a model's robustness. After-attack accuracy represents the remaining accuracy after the adversarial attack. Query number represents how many queries with perturbed input have been made to complete the attack.

Dataset and Baselines
We  2020), we truncate the first (bottom) 6 layers and then finetune it as a baseline for 6-layer models. Additionally, we directly optimize the KL divergence (i.e., pure KD loss) to set an upper bound for probability loyalty.

Training Details
Our implementation is based on Hugging Face Transformers . We first finetune a BERT-base model to be the teacher for KD and the source model for quantization and pruning. The learning rate is set to 3 × 10 −5 and the batch size is 64 with 1,000 warm-up steps. For quantization and pruning, the source model is the same finetuned teacher. For downstream KD and BERTof-Theseus, we initialize the model by truncating the first (bottom) 6 layers of the finetuned teacher, following the original papers (Sun et al., 2019;. QAT uses pretrained BERT-base for initialization. For pretraining distillation, we directly finetune compressed 6-layer DistilBERT and TinyBERT checkpoints to report results. The prun-ing percentage for Head Prune is 45%. The hyperparameters of BERT-PKD are from the original implementation. The detailed hyperparameters for each method can be found in Appendix A.

Experimental Results
We show experimental results in Table 1. First, we find that post-training quantization can drastically improve model robustness. A possible explanation is that the regularization effect of posttraining quantization (Paupamah et al., 2020;Wu and Flierl, 2020) helps improve the robustness of the model (Werpachowski et al., 2019;. A similar but smaller effect can be found from pruning. However, as shown in Table 2, if we finetune the low-precision or pruned model again, the model would re-overfit the data and yield even lower robustness than the original model. Second, KD-based models maintains good label loyalty and probability loyalty due to their optimization objectives. Interestingly, compared to Pure KD where we directly optimize the KL divergence, Dis-tilBERT, TinyBERT and BERT-PKD trade some loyalty in exchange for accuracy. Compared to DistilBERT, TinyBERT digs up higher accuracy by introducing layer-to-layer distillation, with their loyalty remains identical. Also, we do not observe a significant difference between pretraining KD and downstream KD in terms of both loyalty and robustness (p > 0.1). Notably, BERT-of-Theseus has a significantly lower loyalty, suggesting the mechanism behind it is different from KD. We also provide some results on SST-2 (Socher et al., 2013) in Appendix B.

Combining the Bag of Tricks
As we described in Section 4.3, we discover that post-training quantization (PTQ) can improve the robustness of a model while knowledge distillation (KD) loss benefits the loyalty of a compressed model. Thus, by combining multiple compression techniques, we expect to achieve a higher speed-up ratio with improved accuracy, loyalty and robustness.
To combine KD with other methods, we replace the original cross-entropy loss in quantizationaware training and module replacing with the knowledge distillation loss (Hinton et al., 2015) as in Equation 2. For pruning, we perform knowledge distillation on the pruned model. We also apply the temperature re-scaling trick from (Hinton Method Speed MNLI L-L P-L AA # Q   Table 2, the knowledge distillation loss effectively improves the accuracy and loyalty of pruning, quantization and module replacing. Furthermore, we post-quantize the KD-enhanced models after they are trained. Shown in Table 2, by adding post-training quantization, the speed and robustness can both be boosted. Notably, the order to apply PTQ and KD does matter. PTQ→KD has high accuracy and loyalty but poor robustness while KD→PTQ remains a good robustness with a lower accuracy performance. To summarize, we recommend the following compression strategy: (1) conduct pruning or module replacing with a KD loss; (2) for speed-sensitive and robustnesssensitive applications, apply post-training quantization afterwards.

Conclusion
In this paper, we propose label and probability loyalty to measure the correspondence of label and predicted probability distribution between compressed and original models. In addition to loyalty, we investigate the robustness of different compression techniques under adversarial attacks. These metrics reveal that post-training quantization and knowledge distillation can drastically improve robustness and loyalty, respectively. By combining multiple compression methods, we can further improve speed, accuracy, loyalty and robustness for various applications. Our metrics help mitigate the gap between model training and deployment, shed light upon comprehensive evaluation for compression of pretrained language models, and call for the invention of new compression techniques.

Ethical Concerns
We include a discussion about the possible ethical risks of a compressed model in Section 3.1. Although our paper is an attempt to mitigate the risk of introducing extra biases to compression, we would like to point out that our metrics do not directly indicate the bias level in the compressed model. That is to say, additional measures should still be taken to evaluate and debias both the teacher and student models.