Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

Conventional wisdom in pruning Transformer-based language models is that pruning reduces the model expressiveness and thus is more likely to underfit rather than overfit. However, under the trending pretrain-and-finetune paradigm, we postulate a counter-traditional hypothesis, that is: pruning increases the risk of overfitting when performed at the fine-tuning phase. In this paper, we aim to address the overfitting problem and improve pruning performance via progressive knowledge distillation with error-bound properties. We show for the first time that reducing the risk of overfitting can help the effectiveness of pruning under the pretrain-and-finetune paradigm. Ablation studies and experiments on the GLUE benchmark show that our method outperforms the leading competitors across different tasks.


Introduction
Recently, the emergence of Transformer-based language models (using pretrain-and-finetune paradigm) such as BERT (Devlin et al., 2019) and GPT-3 (Brown et al., 2020) have revolutionized and established state-of-the-art (SOTA) records (beyond human-level) on various natural language (NLP) processing tasks. These models are first pre-trained in a self-supervised fashion on a large corpus and fine-tuned for specific downstream tasks (Wang et al., 2018). While effective and prevalent, they suffer from redundant computation due to the heavy model size, which hinders their popularity on resource-constrained devices, e.g., mobile phones, smart cameras, and autonomous driving (Chen et al., 2021;Qi et al., 2021;Yin et al., 2021a,b;Choi and Baek, 2020).
Various weight pruning approaches (zeroing out certain weights and then optimizing the rest) have been proposed to reduce the footprint requirements of Transformers (Zhu and Gupta, 2018; Blalock * These authors contributed equally pruning under pretrain-and-finetune. In the subfigures, the cylinders on the left describe the pruning process, and the circles on the right represent the knowledge analysis of the sparse model. Gordon et al., 2020;Xu et al., 2021;. Conventional wisdom in pruning states that pruning reduces the overfitting risk since the compressed model structures are less complex, have fewer parameters and are believed to be less prone to overfit (Ying, 2019;Tian et al., 2020;Gerum et al., 2020). However, under the pretrain-and-finetune paradigm, most pruning methods understate the overfitting problem.
In this paper, we postulate a counter-traditional hypothesis, that is: model pruning increases the risk of overfitting if pruning is performed at the fine-tuning phase. As shown in Figure 1b, the pretrain-and-finetune paradigm contains two types of knowledge, the general-purpose language knowledge learned during pre-training (L) and the taskspecific knowledge from the downstream task data (D). Compared to conventional pruning that only discards task-specific knowledge (Figure 1a), pruning under pretrain-and-finetune (Figure 1b) discards extra knowledge (red area) learned in pretraining phase. Thus, to recover both the extra discarded general-purpose knowledge and the discarded task-specific knowledge, pruning under pretrain-and-finetune increases the amount of information a model needs, which results in relative data deficiency, leading to a higher risk of overfitting. To empirically verify the overfitting problem, we visualize the training and evaluation performance on a real-world task data of MRPC (Devlin et al., 2019) in Figure 2. From Figure 2 (b), it is observed that the evaluation accuracy on the training dataset remains improved while it keeps the same for the validation set through the training process. From Figure 2 (c), the difference in performance becomes more significant when the pruning rate becomes higher and the performance on the validation set even becomes worse after 2,000 training steps. All these observations verify our hypothesis.
The main question this paper attempts to answer is: how to reduce the risk of overfitting of pre-trained language models caused by pruning? However, answering this question is challenging. First, under the pretrain-and-finetune paradigm, both the general-purpose language knowledge and the task-specific knowledge are learned. It is nontrivial to keep the model parameters related to both knowledge when pruning. Second, the amount of data for downstream tasks can be small, such as the data with privacy. Thus, the overfitting problem can easily arise, especially in the face of high pruning rate requirements. A little recent progress has been made on addressing overfitting associated with model compression. However, their results are not remarkable and most of them focus on the vision domain (Bai et al., 2020;. To address these challenges, we propose SPD, a sparse progressive distillation method, for pruning pre-trained language models. We prune and optimize the weight duplicates of the backbone of the teacher model (a.k.a., student modules). Each student module shares the same architecture (e.g., the number of weights, the dimension of each weight) as the duplicate. We replace the corresponding layer(s) of the duplicated teacher model with the pruned sparse student module(s) in a progressive way and name the new model as a grafted model. We validate our proposed method through the ablation studies and the GLUE benchmark. Experimental results show that our method outperforms the existing approaches.
We summarize our contributions as follows: • We postulate, analyze, and empirically verify a counter-traditional hypothesis: pruning increases the risk of overfitting under the pretrainand-finetune paradigm.
• We propose a sparse progressive pruning method and show for the first time that reducing the risk of overfitting can help the effectiveness of pruning.
• Moreover, we theoretically analyze that our pruning method can obtain a sub-network from the student model that has similar accuracy as the teacher.
• Last but not least, we study and minimize the interference between different hyperparameter strategies, including pruning rate, learning rate, and grafting probability, to further improve performance.

Related Work
To summarize, our contribution is determining the overfitting problem of pruning under the pretrainand-finetune paradigm and proposing the sparse progressive distillation method to address it. We demonstrate the benefits of the proposed framework through the ablation studies. We validate our method on eight datasets from the GLUE benchmark. To test if our method is applicable across  tasks, we include the tasks of both single sentence and sentence-pair classification. Experimental results show that our method outperforms the leading competitors by a large margin. Network Pruning. Common wisdom has shown that weight parameters of deep learning models can be reduced without sacrificing accuracy loss, such as magnitude-based pruning and lottery ticket hypothesis (Frankle and Carbin, 2019). (Zhu and Gupta, 2018) compared small-dense models and large-sparse models with the same parameters and showed that the latter outperforms the former, showing the large-sparse models have better expressive power than their small-dense counterparts. However, under the pretrain-and-finetune paradigm, pruning leads to overfitting as discussed. Knowledge Distillation (KD). As a common method in reducing the number of parameters, the main idea of KD is that the small student model mimics the behaviour of the large teacher model and achieves a comparable performance (Hinton et al., 2015;Mirzadeh et al., 2020). (Sanh et al., 2019;Jiao et al., 2020;Sun et al., 2020) utilized KD to learn universal language representations from large corpus. However, current SOTA knowledge distillation methods are not able to achieve a high model compression rate (less than 10% remaining weights) while achieving an insignificant performance decrease. Progressive Learning. The key idea of progressive learning is that student learns to update module by module with the teacher.  utilized a dual-stage distillation scheme where student modules are progressively grafted onto the teacher network, it targets the few-shot scenario and uses only a few unlabeled samples to achieve comparable results on CIFAR-10 and CIFAR-100. (Xu et al., 2020) gradually increased the probability of replacing each teacher module with their corresponding student module and trained the student to reproduce the behavior of the teacher. However, the performance on Transformer-based models of the aforementioned first method is unknown while the second method has an obvious performance drop with a low sparsity (50%).

Problem Formulation
The teacher model and the grafted model (shown in Figure 3) are denoted as f S and f G , respectively. Both models have N + 1 layers (i.e., the first N layers are encoder layers, and the (N + 1)-th layer is the output layer). Denote f T i (·), f G i (·) as the behaviour function induced from the i-th encoder of the teacher model, and the grafted model, respectively. As shown in Figure 4, we utilize layerwise knowledge distillation (KD), where we aim to bridge the gap between f T i (·) and f G i (·). The grafted model is trained to mimic the behavior of the teacher model. During training, we minimize the summation loss L: where X denotes the training dataset, λ i is coefficient of i-th layer loss, L D is the distillation loss of the layer pair, x i is the input of the i-th layer.
During KD, each student module mimics the behavior of the corresponding teacher layer. Similar to (Jiao et al., 2020), we take the advantage indicates the difference between attention matrices. MSE(·) is the mean square error loss function and i is the index of Transformer layer. L pred = -softmax(z T ) · log _softmax(z S / temp) indicates the difference of soft cross-entropy loss, where z T and z S are the soft logits of teacher and student model, respectively. T is the temperature hyper-parameter.
We further reduce the number of non-zero parameters in the weight matrix while maintaining accuracy. We denote {W j } j=i j=1 as the collection of weights in the first i layers, θ j as the sparsity of the j-th layer. Then, the loss function of sparse knowledge distillation becomes After training, we find the sparse weight matrix where Π S j (·) denotes the Euclidean projection onto the set S j = {W j | sparsity(W j ) ≤ θ j }.

Error-bound Analysis
Our pruning method is similar to finding matching subnetworks using the lottery ticket hypothesis (Frankle and Carbin, 2019;Pensia et al., 2020) methodology. We analyze the self-attention (excluding activation). Some non-linear activation functions has been analyzed in (Pensia et al., 2020). Lueker et al. (Lueker, 1998) and Pensia et al. (Pensia et al., 2020) show that existing a subset of w i , such that the corresponding value of g(x) is very close to f (x).
Analysis on self-attention. The self-attention can be presented as: (6) Consider a model f (x) with only one selfattention, when the token size of input x is 1, x and a pruning sparsity θ, base on Corollary, when d ≥ C log 4/ , there exists a pattern of w G i , such that, with probability 1 − , where I(θ i ) is the indicator to determine whether w G i will be remained. In general, let the token x's size be n. so x = (x 1 , x 2 , ..., x n ). Consider a teacher model f T (x) with a self-attention, then where c ij is the (i, j) th element of the matrix .
Base on Corollary, when d ≥ C log 4/ , there exists a pattern of w G i , such that, with probability 1 − , In summary:

Progressive Module Grafting
To avoid overfitting in the training process for the sparse Transformer model, we further graft student modules (scion) onto the teacher model duplicates (rootstock). For the i-th student module, we use an independent Bernoulli random variable I(θ i ) to indicate whether it will be grafted on the rootstock. To be more specific, I(θ i ) has a probability of p (grafting probability) to be set as 1 (i.e., student module substitutes the corresponding teacher layer). Otherwise, the latter will keep weight matrices unchanged. Once the target pruning rate is achieved, we apply linear increasing probability to graft student modules which enable the student modules to orchestrate with each other. Different from the model compression methods that update all model parameters at once, such as TinyBERT (Jiao et al., 2020) and DistilBERT (Sanh et al., 2019), SPD only updates the student modules on the grafted model. It reduces the complexity of network optimization, which mitigates the overfitting problem and enables the student modules to learn deeper knowledge from the teacher model. The overview is described in Algorithm 1. We will further demonstrate the effectiveness of progressive student module grafting in 4.2.

Algorithm 1 Sparse Progressive Distillation
Input: Teacher model f T (fine-tuned BERTBASE); grafted model f G : duplicates of teacher model. Set t1, t2, t3 as the final number of training steps of pruning, progressive module grafting, and finetuning, respectively. Set p as the grafting probability Output: Student model p ← p0 for t = 0 to t3 do if 0 ≤ t < t1 then Prune student modules and generate mask M Graft student modules with p0 end if if t1 ≤ t < t2 then Graft student modules with p ← k(t − t1) + p0 end if Calculate distillation loss L in Eqn.
(3) For f G , update sparse weights w ← w · M Duplicate sparse weight(s) on f G to corresponding student module(s) end for return f G
On all listed tasks, SPD even outperforms the teacher model except for RTE. On RTE, SPD retains exactly the full accuracy of the teacher model. On average, the proposed SPD achieves a 1.1% higher accuracy/score than the teacher model. We conclude the reason for the outstanding performance from three respects: 1) There is redundancy in the original dense BERT model. Thus, pruning the model with a low pruning rate (e.g., 50%) will not lead to a significant performance drop. 2) SPD decreases the overfitting risk which helps the student model learn better. 3) The interference between different hyperparameter strategies is mitigated, which enables SPD to obtain a better student model.
We also compare SPD with other baselines (i.e., 4-layer TinyBERT (Jiao et al., 2020), RPP (Guo et al., 2019), and SparseBERT (Xu et al., 2021)) under higher pruning rates. Results are summarized in Table 2. For the fairness of comparison, we remove data augmentation from the above methods. We mainly compare the aforementioned baselines with very high sparsity (e.g., 90%, 95%) SPD. For the comparison with TinyBERT 4 , both SPD (90% sparsity) and SPD (95% sparsity) win. SPD (90% sparsity) has 63.4% and 9% higher evaluation score than TinyBERT 4 on CoLA and MRPC, respectively. For the setting of 95% sparsity, SPD outperforms TinyBERT 4 with 41.3% and 7.6% higher performance, respectively. Compared to RPP, both SPD (90% sparsity) and SPD (95% sparsity) show higher performance on MRPC, with 9.8% and 8.3% higher F1 score, respectively. For SparseBERT, SPD exceeds it on all tasks in Table 2. Especially on CoLA, SPD (90% sparsity) and SPD (95% sparsity) have 2.69× and 2.33× higher Mcc score on CoLA, respectively. SparseBERT has competitive performance with SOTA when using data augmentation. The reason for the performance drop for SparseBERT may because its deficiency of ability in mitigating overfitting problems.
Overfitting Mitigation. We explore the effectiveness of SPD to mitigate the overfitting problem. Depending on whether progressive, grafting, or KD is used, we compare 4 strategies: (a) no progressive, no KD; (b) progressive, no KD; (c) no progressive, KD; (d) progressive, KD (ours). We evaluate these strategies on both training and validation sets of MRPC. The results are summarized in Figure 5. From (a) to (d), the gap between the evaluation results of the training set and the dev set is reduced, which strongly suggests that the strategy adopted by SPD, i.e., progressive + KD, outperforms other strategies in mitigating the overfitting problem. Figure 5 (a), (b), and (c) indicate that compared to progressive only, KD has a bigger impact on mitigating overfitting, as the performance gap between the training set and the dev set decreases more from (a) to (c) than from (a) to (b). From Figure 5 (a), (b) and (c), we also observe that compared to no progressive, no KD, either using progressive ( Figure 5 (b)) or KD (Figure 5 (c)) is very obvious to help mitigate the overfitting prob-     lem. Figures 5 (b), (c) and (d) indicate that the combination of progressive and KD brings more benefits than only using progressive or KD as Figure 5 (d) has the smallest performance gap between the training set and the dev set. Combined with Table 1 and Table 2, Figure 5 shows that SPD mitigates overfitting and leads to higher performance.

Ablation Studies
In this section, we justify the three schedulers used in our method (i.e., grafting probability, pruning rate, and learning rate), and study the sensitivity of our method with respect to each of them. Study on Components of SPD. The proposed SPD consists of three components (i.e., sparse, knowledge distillation, and progressive module grafting). We conduct experiments to study the importance of each component on GLUE benchmark tasks with the sparsity of 50% and results are shown in Table 3. Compared to both sparse + KD and sparse + progressive, SPD achieves gains on performance among all tasks. Effects of Grafting Probability Strategy. In our method, we set the grafting probability greater than 0 during pruning, to allow student modules to learn deeper knowledge from the teacher model. To verify the benefit of this design, we change the grafting probability to zero and compare it with our  method. The result on RTE is shown in Figure 6. Pruning with grafting (the red curve) shows better performance than pruning without grafting, which justifies the existence of grafting during pruning. In addition, we study the sensitivity of our method to grafting probability (Figure 7). It is observed that p 0 = 0.6 achieves the best performance, and the progressive design is better than the non-progressive.  Effects of Pruning Rate Strategy. For the prun-ing rate scheduler, we compare the strategies with different pruning ending steps. The results are shown in Figure 8. It is observed that the pruning during when grafting probability p = p 0 has a higher F1 score than other strategies on MRPC.
Effects of Optimizer Strategy. We also compare our strategy with the strategy that only has one learning rate scheduler. The results (Figure 9) indicate that our strategy (i.e., two independent optimizers) is better. We also evaluate different learning rates with the pruning rate of 0.9 and the grafting probability of 0.8.

Conclusion
In this paper, we postulate a counter-traditional hypothesis that pruning increases the risk of overfitting under the pretrain-and-finetune paradigm. We analyze and empirically verify this hypothesis, and propose a sparse progressive pruning method to address the overfitting problem. We theoretically analyze that our pruning method can obtain a subnetwork from the student model that has a similar accuracy as the teacher. We study and minimize the interference between different hyperparameter strategies, including pruning rate, learning rate, and grafting probability. A number of ablation studies and experimental results on eight tasks from the GLUE benchmark demonstrate the superiority of our method over the leading competitors. Sensitivity Analysis of Learning Rate. The analysis results on RTE and STS-B are shown in Figure 10 and Figure 11, respectively. Results vary with different learning rate settings. Among the eight learning rates listed in the legend of Figure 10 Figure 15) to further demonstrate the advantages of our proposed method SPD. In each figure, the x-axis is the training steps while the y-axis represents evaluation metrics. To obtain the curves, we use the same settings as Table 2.
Moreover, we describe the hyper-parameters settings in detail. For CoLA, we set the max sequence length as 128, the learning rate as 5.0e −4 , the grafting probability during pruning as 0.8, the number of training epochs as 60, and the number of pruning epochs as 30. For STS-B, we use the same setting as CoLA. For MRPC, we set the max sequence length as 128, the learning rate as 6.4 × e −4 , the grafting probability during pruning as 0.8, the number of training epochs as 60, and the number of pruning epochs as 30. For RTE, we set the max sequence length as 128, the learning rate as 3.0×e −5 , the grafting probability during pruning as 0.6, the number of training epochs as 60, and the number of pruning epochs as 30.