Continuation KD: Improved Knowledge Distillation through the Lens of Continuation Optimization

Knowledge Distillation (KD) has been extensively used for natural language understanding (NLU) tasks to improve a small model's (a student) generalization by transferring the knowledge from a larger model (a teacher). Although KD methods achieve state-of-the-art performance in numerous settings, they suffer from several problems limiting their performance. It is shown in the literature that the capacity gap between the teacher and the student networks can make KD ineffective. Additionally, existing KD techniques do not mitigate the noise in the teacher's output: modeling the noisy behaviour of the teacher can distract the student from learning more useful features. We propose a new KD method that addresses these problems and facilitates the training compared to previous techniques. Inspired by continuation optimization, we design a training procedure that optimizes the highly non-convex KD objective by starting with the smoothed version of this objective and making it more complex as the training proceeds. Our method (Continuation-KD) achieves state-of-the-art performance across various compact architectures on NLU (GLUE benchmark) and computer vision tasks (CIFAR-10 and CIFAR-100).


Introduction
Deep neural networks have achieved great success in many challenging tasks including the ones in natural language processing (Vaswani et al., 2017;Brown et al., 2020a) and computer vision (Dosovitskiy et al., 2020).Most successful neural networks are usually large and overparametrised (Liu et al., 2019;Devlin et al., 2018).The big size of these models limits them from deploying on computers with low computational power such as edge devices.Compressing the models can empower us to provide a variety of machine learning-based services offline on low-resource machines.This issue is even more severe in models designed for NLP.Although, after the introduction of the Transformers (Vaswani et al., 2017) the performance of models in this field improved dramatically, the size of the models increased exponentially.Nowadays, some of these models have more than 100 billion parameters (Brown et al., 2020b), and they are still increasing.
One way to address the expensive computational complexity of deep networks and their overparameterization is neural model compression (Jacob et al., 2018;Tjandra et al., 2018).Among all of the compression methods, Knowledge Distillation (KD) (Hinton et al., 2015) is one of the prominent techniques which have been used to compress a variety of models in different deep learning applications such as Natural Language Processing (Clark et al., 2019;Sun et al.;Jiao et al., 2019), speech processing (Yun et al., 2020;Chebotar and Waters, 2016), and computer vision (Mirzadeh et al., 2019;Guo et al., 2020).In KD, we have a large accurate network, which is called a teacher, and a small network, which is called a student, that we desire to train.The innate capacity gap between the student and the teacher was speculated (Lopez-Paz et al., 2015) to impede the training and was addressed by multiple works (Mirzadeh et al., 2019;Jafari et al., 2021).However, the existing solutions have several complications: for example, (Mirzadeh et al., 2019) requires an extra intermediate network to be trained, and (Jafari et al., 2021) has a rigid twostage structure that requires the careful design for successful training.Moreover, none of these techniques is robust against the noisy data and noisy teacher's outputs, which can distract the student with the limited capacity from learning more useful features and make it overfit to the noise.
We propose a new solution, Continuation KD, inspired by the continuation method from optimization and nonlinear equations.During the training, we gradually move from the smoothed objective function, which is robust to overfitting to the noise, to the original highly non-convex function.We conduct extensive experiments on the GLUE benchmark for DistilRoBERTA (6-layer) (Sanh et al., 2019) and BERT-small (4-layers) (Turc et al., 2019) core models and show a significant improvement over the previous baselines.Besides that, we demonstrate that in the computer vision setting continuation KD also outperforms its competitors.Overall, our contributions are the following: 1. We proposed a novel KD technique based on the Continuation method, which gradually increases the complexity of the loss function and provides a better optimization for all knowledge distillation scenarios in both computer vision and NLP.
2. We implement a loss function like hinge loss function which makes a student trained with our technique robust against the teacher's output noise.
3. Our proposed method is simple and, unlike its competitors, does not have several stages.This feature makes our method stable and efficient.
2 Related Works

Knowledge Distillation (KD)
Knowledge Distillation (Hinton et al., 2015) is a well-known neural model compression method.Despite the success of the original method, it has been shown in the literature (Lopez-Paz et al., 2015;Mirzadeh et al., 2019) that the large gap between the size of the student and the teacher networks makes KD ineffective.To address this capacity gap problem, Mirzadeh et al. (2019) proposed the teacher assistant knowledge distillation (TAKD) method.However, this technique is computationally expensive as it requires training multiple TA networks for a task.Moreover, the errors of the TAs can accumulate and transfer to the student.To alleviate these problems, Jafari et al. (2021) proposed Annealing-KD that achieved state-of-the-art performance on NLU and computer vision tasks.Although this method could handle the capacity gap problem, it is still not robust against the noisy data and noisy teacher's outputs.Also, two phases of the training require some a priori decisions on when to switch from the first phase to the second one.

Continuation Optimization
Continuation method was first proposed as a numerical method for solving nonlinear equations (Davidenko, 1953) and then it was adopted as a heuristic for nonconvex optimization (Watson, 2000).The main intuition of the continuation method is to include the problem we are trying to solve in a continuous family of problems, such that one of the members of this family is easy to solve and its solution could be pulled over to give an approximate solution of the original problem.Machine Learning community applied the continuation idea for training neural networks.Mobahi and Fisher III (2015) proposed the first theoretical analysis of the bound on the approximate solution given by the continuation optimization.Gulcehre et al. (2017) suggested optimizing highly non-convex neural networks by starting with a smoothed objective function and making it more complex over the training.We adapted this general method for Knowledge Distillation.

Background
Vanilla-KD The original knowledge distillation (Hinton et al., 2015) trains a small network (a student) by using two guiding signals: the hard labels coming from the training dataset, and the predictions of a large network pre-trained on the same task (a teacher) which is known as soft labels.To achieve this goal, Vanilla-KD utilizes a particular loss function which is a linear combination of two losses.The first one is a cross-entropy between the softmax output of the student and hard labels, and the second one is a KL-divergence between the softened version of the softmax outputs of the student and the teacher networks.Equation 1 explains this loss function in details: Here CE(.) is the cross-entropy function, KL(.) is the KL-divergence function, z T (.) and z S (.) are the teacher and student logits, σ(.) is the softmax function, and τ is the softening parameter.Also, λ is a hyper-parameter between [0, 1] which indicates the amount of contribution of each loss function.
Minimizing the above loss function decreases the distance between both underlying function and the teacher model.In Vanilla-KD usually we assume that the teacher is a good approximation of the underlying function.
TAKD In TAKD method (Mirzadeh et al., 2019) the teacher model first trains an intermediate model with a slightly smaller capacity, which is called teacher assistant (TA), by utilizing Vanilla-KD.
Then the TA model trains the student model with a small capacity by using Vanilla-KD again.TAKD tries to fill the capacity gap between the teacher and the student models by introducing the TA model but this gap can be still large.As mentioned in (Mirzadeh et al., 2019), a better idea is to use hierarchical TAs to have a smoother knowledge transfer from the teacher to the student.
Annealing-KD For controlling the complexity of the teacher model, instead of using multiple TA networks, Annealing-KD (Jafari et al., 2021) adds an annealed dynamic temperature factor to the output of the teacher.By using this factor, Annealing-KD reduces the sharpness of the teacher at the beginning of the training process.Then it increases the sharpness of the teacher gradually during the training.Therefore, since the complexity of the teacher increases gradually during the training time, the teacher knowledge transfers much more smoothly to the student than in TAKD.
Annealing-KD has two stages of training where in the first stage, a student learns from a teacher for k epochs.During this stage, Annealing-KD matches the student's logits to the teacher's logits by using a mean square error loss function.At the beginning of the training the temperature factor sets to a high value to apply the maximum smoothing to the output of the teacher and then it decreases gradually until there is no smoothing effect remains on the output of the teacher.Formally, one first defines a monotonically increasing function ϕ : N → [0, 1], going from zero at the beginning of the training to one at the end.Then, the student loss for stage 1 is: where i is the training epoch, z S (x) and z T (x) are logit outputs of the student and the teacher respectively.In the second stage, Annealing-KD gets the best checkpoint of the first stage and finetune it on the hard labels from the dataset by using a crossentropy loss for m epochs.The hyperparameters k and m must be chosen before the training.

Methodology
In this section, we describe our Continuation-KD technique, which addresses both the capacity gap problem and the lack of robustness against the noise in the teacher's output.Continuation-KD uses a loss function with two objectives (Eq.3).The first objective L CE is a cross-entropy loss term that trains the student based on the given hard labels.
The second objective L CN T KD is our proposed annealed hinge loss function that gradually trains the student to mimic the behaviour of the teacher.Inspired by the continuation method (Gulcehre et al., 2017), Continuation-KD starts with an easy objective to train at the beginning of training.As the training proceeds, the whole objective function becomes more and more complex.Formally, the loss function of the Continuation-KD is defined as follows: where 1 ≤ i ≤ n indicates the epoch index with the maximum number of epochs n.The 0 ≤ ψ(i) ≤ 1 is an increasing function between 0 and 1 where ψ(1) = 0 at the beginning and it increases during the training.L CN T KD defines as: where z S and z T are the output logits of the student and teacher networks, respectively; m is the margin factor; 1 ≤ T i ≤ T max is the temperature factor, T max is the maximum temperature, and 0 ≤ ϕ(T i ) ≤ 1 is an increasing function.We define this function as: (5) Note that in Eq. 4, ∥z S − ϕ(T i )z T ∥ 2 2 is a mean square loss between the student's logits and the annealed version of the teacher's logits.Also, max{0, ∥z S − ϕ(T i )z T ∥ 2 2 − ϕ(T i )m} is a hinge loss with an annealed margin ϕ(T i )m.This loss function avoids penalizing negligible differences (the ones less than m ϕ(T i )) between the outputs of the student and teacher.This feature helps the student to learn a meaningful behaviour of the teacher rather than focusing on higher frequency fluctuations.
At the beginning of training, we set T 1 = T max which leads to the most softened version of the teacher's output (ϕ(T 1 ) = 1 Tmax ).Since we have ψ(1) = 0, the student only learns the behaviour of the teacher's smoothest version, which is an easy target to learn.Then, during training, we decrease the temperature, At this phase, functions ψ(i) and ϕ(T i ) are both increasing which in turn leads to increasing the sharpness of the teacher and smoothly shifting from the hinge loss to the cross-entropy loss.Both of these operations increase the complexity of the whole loss function.Note that at the function ϕ(T i ) also anneals margin m.Its reason is that smoothing the teacher with ϕ(T i ) damps its noise as well.Therefore we damp the margin m with ϕ(T i ) to apply a margin proportional to the amount of noise in the smoothed version of the teacher.Figure 1 visualize different components of continuation-KD.
Also note that, if we set m = 0 and ψ(i) to the step function in Eq 6, then Continuation-KD becomes identical to Annealing-KD, where k is the number of epochs in the first stage and n − k is the number of epochs in the second stage.This fact shows that the Annealing-KD is actually a special case of our Continuation-KD.
Algorithm 1 demonstrates the details of Continuation-KD.It requires a student S, a teacher T , a dataset D, max temperature T max , number of epochs n, an increasing function ψ, and margin m as inputs and returns the trained student at the output.At the beginning, it sets variables T = T max to get the maximum smoothness of the teacher.Also, variable k indicates the number of epochs before updating T during training.Φ and Ψ are the output values of ϕ(i) and ψ(i).Function GET-MINI-BATCH(D) retrieves a minibatch (X, Y ) from the dataset D. Then these data samples feed into loss functions in the next lines to get the outputs of L CE and L CN T KD .Then the linear combination of line 14 combines these two losses to get the continuation loss L. Finally, L is fed into OPTIMIZATION-BACK-PROPAGATION(.)function to optimize the student network based on the back propagation of the gradient of this loss function and update the weights of the student.This part of the training is identical to regular training of the neural networks.SAVE-BEST-CHECKPOINT(.)function checks the performance of the current student model on a validation dataset.If it is better than the previous checkpoints, it saves the checkpoint.In the end, we load the best checkpoint and return it.
In the next section, we report the experimental results of Continuation-KD method.

Experiments
This section demonstrates our evaluation results comparing Continuation-KD with other baselines for natural language processing and computer vision tasks.We compare our method with stateof-the-art techniques such as annealing-KD (Jafari et al., 2021), TAKD (Mirzadeh et al., 2019) and other baselines like Vanilla-KD (Hinton et al., 2015) and training the student only with hard labels.In the following sub-sections, we will discuss each in more detail.

Hardware Details
We trained all our baselines using a single NVIDIA V100 GPU.All experiments were run using the PyTorch framework 1 and for NLP experiments we used HuggingFace 2 API.

Image Classification
For the image classification tasks, we used CIFAR-10 and CIFAR-100 datasets with 10 and 100 classes respectively.Both of these datasets have 60,000 data samples, and each of them is a 32 × 32 pixel color image.Also, both of these datasets have 50,000 train and 10,000 test samples.
In all of our computer vision (CV) experiments, ResNet-8 and ResNet-110 models are used as the student and the teacher respectively.For the TAKD baseline, the ResNet-20 model is used as the TA model.Continuation-KD is trained for 200 epochs with a maximum temperature of 20, a learning rate 1 https://pytorch.org 2 https://huggingface.co of 0.2, and batch size 32.Also, the following ψ(•) function used in these experiments: The CIFAR-10 and CIFAR-100 experiments results are reported in Tables 1 and 2. We can see that Continuation-KD clearly outperforms other baselines.Also, Annealing-KD achieved secondbest results in these experiments.However, other baselines showed almost similar performances.

SAVE-BEST-CHECKPOINT(S)
15: end for 16: S ← LOAD-BEST-CHECKPOINT( ) 17: return S 18: end function our student and RoBERTa-Large (24 layers) as our teacher.Also, for the TAKD baseline, we used RoBERTa-base (12 layers) as the teacher assistant model.For this experiment, the maximum temperature was 10, learning rate was 2e-5, and batch size was 64.Also, we used the pre-trained distil-RoBERTa as our student and we fine-tuned it for 30 epochs.The ψ function for this experiment is defined as following: Tables 3 and 4 show the results of this experiment on the dev set and test set.As shown in these tables, Continuation-KD achieved superior results in most of the tasks and it improves the previous baselines with a good overall average score.
For the second experiment, we utilize BERTlarge (24-layers) as the teacher, BERT-Small (4layers) as the student and BERT-base (12-layers) as the teacher-assistant for TAKD.We train the teacher for 30 epochs, and for the student, we use the same hyperparameters as the first experiment.We report the results in Tables 5 and 6.Here, Continuation-KD also outperforms other baselines and gets better overall performance than its competitors.
All presented results in our experiments show that continuation-KD performs better than TAKD and Annealing-KD, which indicates the effective-ness of this method.The experimental results support the claim that our proposed Continuation-KD can provide a better generalization than the other methods.

Analysis
To investigate how Continuation-KD works, we did two ablation studies explaining different aspects of this technique: effects of the dynamically changing factors and noise mitigation.
Effects of Dynamic Factors ϕ and ψ.In the first ablation, we scrutinize the effect of each dynamically changing component of the Continuation-KD loss function.It basically considers the effect of functions: ψ(i) from Eq. 3, ϕ(T i ) when it anneals the outputs of the teacher in Eq. 4, and ϕ(T i ) when it anneals the margin of the hinge loss in Eq. 4.
To investigate the effects of these components, we repeat the NLU experiments with distilRoBERTa model described in section 5.3 on MRPC and RTE datasets three times.In each trial, we fixed two of the components, and we only let one of them dynamically change during training.Table 7 reports the performance of each of these experiments.The first three columns of this table show the performance of distilRoBERTa on each of the datasets when only one of the three dynamical components changes and two others are fixed.The last column shows the performance of the model on these datasets in a regular training when all components 5265     dynamically change.In the first column, the ϕ = 1 is fixed for both teacher and margin.In the second column, the margin's coefficient is fixed to 1 and ψ = 0.5.Also, in the third column, the teacher's coefficient is fixed to 1 and ψ = 0.5.
As shown in Table 7, for both datasets, the performance in the first three columns is almost similar, which indicates the equal contribution of each dynamic component in the improvement of the results.Also, the last column shows the dramatic improvement in the performance.Hence, we can conclude from this experiment that all three dynamic components are necessary for achieving better performance.Noise Robustness In the second ablation study, we visualize the effect of Continuation-KD with a noisy teacher.For this purpose, we consider a sinusoidal function with low frequency, and then we add to it a sinusoidal noise with high frequency (blue curves in Figure 2).Then, we take a fully connected network with one-dimensional input and output and two hidden layers with 128 neurons in each layer as a student model.We sample 3,000 points from the graph of the noisy sinusoidal function and train the student model once with Vanilla-KD and once with Continuation-KD (orange curves in Figure 2).As Figure 2 shows, the model trained with Continuation-KD has a much smoother curve in comparison with the model trained with vanilla-KD and could learn the main behaviour of the teacher function rather than learning noise.

Conclusion
In this work, we present Continuation-KD, a novel KD method inspired by Continuation optimization.
Our Continuation-KD technique starts optimizing a smoothed version of the objective function and gradually increases the complexity of the loss towards the original, highly non-convex one.We demonstrated that our method alleviates the capacity gap problem: an innate problem of KD resulting from the different capacities of a student's and a teacher's networks which detriments the performance.Besides that, we show that the method can lessen the student's overfitting to noise in the teacher's output.Our technique is stable because it doesn't require two stages in the training and is efficient.It outperforms its competitor KD methods for different backbone models' architectures in both computer vision and NLP.
For this investigation, we proposed to implement Continuation method by smoothing the objective with the hinge loss.However, it can potentially be done differently.Investigating other realizations of Continuation Optimization for improving small models' performance is an interesting next step.

Limitation
One of the advantages of our proposed method is that it mitigates the noise in the teacher's output and prevents the student from overfitting to the noise.We empirically demonstrate this claim, but we don't have rigorous theoretical proof of how continuation method achieves this robustness.

Figure 1 :
Figure 1: Principle diagram illustrating different components of Continuation-KD.The main loss function (purple box) is a composition of two losses -cross entropy loss L CE and continuation loss L CN T (green boxes).Because of the smoothing in L CN T with the dynamic factor ϕ, it is an easier objective to optimize than L CE .During the training, Continuation KD gradually moves from the easier objective to more complex objective with aid of the dynamic factor ψ.
Figure 2: (a) Illustrates the behaviour of the student model after training with a noisy teacher without using continuation-KD.(b) Illustrates the behaviour of the student model after training with a noisy teacher by using continuation-KD for training.Blue points are the samples from the noisy teacher.Orange points are the samples of the trained student in each scenario.

Table 3 :
DistilRoBERTa results for Continuation-KD on dev set.F1 scores are reported for MRPC, pearson correlations for STB-B, and accuracy scores for all other tasks.

Table 4 :
Performance of DistilRoBERTa trained by Continuation-KD on the GLUE leaderboard compared with Vanilla-KD, TAKD, and annealing KD.

Table 5 :
BERT-Small results for Continuation-KD on dev set.F1 scores are reported for MRPC, pearson correlations for STS-B, and accuracy scores for all other tasks.

Table 6 :
BERT-Small results for Continuation-KD on test set.F1 scores are reported for MRPC, pearson correlations for STS-B, and accuracy scores for all other tasks.

Table 7 :
Performance of distilRoBERTa on MRPC and