Tailoring Instructions to Student’s Learning Levels Boosts Knowledge Distillation

It has been commonly observed that a teacher model with superior performance does not necessarily result in a stronger student, highlighting a discrepancy between current teacher training practices and effective knowledge transfer. In order to enhance the guidance of the teacher training process, we introduce the concept of distillation influence to determine the impact of distillation from each training sample on the student’s generalization ability. In this paper, we propose Learning Good Teacher Matters (LGTM), an efficient training technique for incorporating distillation influence into the teacher’s learning process. By prioritizing samples that are likely to enhance the student’s generalization ability, our LGTM outperforms 10 common knowledge distillation baselines on 6 text classification tasks in the GLUE benchmark.


Introduction
The recent success of natural language processing (NLP) is driven by the adoption of large-scale pretrained language models (Devlin et al., 2019;Liu et al., 2019;Dai et al., 2019;Yang et al., 2019).As these models are scaling up in depth and width, they become increasingly computational and storage intensive, making deployment difficult.To address this issue, different methods have been proposed for crafting efficient models with minimal loss in performance, such as weight pruning (Fan et al., 2019;Li et al., 2021a), network quantization (Kim et al., 2021;Zhang et al., 2020), and knowledge distillation (KD) (Sun et al., 2019;Tang et al., 2019;Sun et al., 2020).Among these methods, KD has proven to be effective in various NLP applications (Jiao et al., 2020) and is widely adopted.The idea of KD involves asking a lightweight stu- * Equal contribution.† Work done while being at Amazon. 1 Our code is public available at https://github.com/twinkle0331/LGTM dent model to mimic the output of a large teacher model so as to transfer the knowledge.
Ideally, a teacher with better performance should be able to transfer more knowledge to the student.Therefore in most knowledge distillation algorithms, the teacher network is trained to maximize its own performance.However, multiple studies (Wang et al., 2022a;Cho and Hariharan, 2019) have observed that a teacher with higher performance does not necessarily lead to a betterperforming student, and may even cause a performance degradation.Stanton et al. (2021) has attributed this inefficiency in knowledge distillation to challenges during optimization.As the model capacity gap between the student and the teacher increases, the optimization process becomes more likely to be trapped in local optima (Cho and Hariharan, 2019;Mirzadeh et al., 2020).
One way to address the performance degradation in KD is to update the teacher via feedback from student's performance, also known as learning to teach (L2T) (Fan et al., 2018;Zhou et al., 2022).L2T allows the teacher model to adjust its "teaching agenda" by interacting with the student.Among the L2T algorithms, online distillation (Zhang et al., 2018;Zhu et al., 2018;Shi et al., 2020) trains the student and teacher concurrently and enforces similarity between their outputs on the training set.However, online distillation focuses on transferring the knowledge of the teacher to the student on training set without explicitly considering how well the student will perform on validation set.On the other hand, meta distillation (Zhou et al., 2022;Pham et al., 2021) takes the generalization ability of student on the held-out validation set into account, and guides the teacher's learning process to maximize the generalization ability.However, the optimization objective of meta distillation may result in a degraded teacher model, as it only receives supervision from the student model.
It is well-known that humans are more efficient arXiv:2305.09651v1[cs.CL] 16 May 2023 learners when their teachers provide guidance on the level of attention they should devote to certain problems based on their current knowledge.Similarly, it is possible that a student model could be trained more effectively if it receives such guidance from a teacher.To accomplish this goal, the teacher should prioritize samples that are likely to enhance the student's generalization ability during training, thus allowing the student to perform better on the held-out validation set.
In this work, inspired by the concept of influence function (Pruthi et al., 2020;Koh and Liang, 2017), we propose distillation influence to estimate how distilling on each training sample impacts the student's performance on the validation set.In addition, we are able to interpret existing L2T methods from the perspective of influence function, so as to gain a deeper understanding of their limitations.The optimization process of existing L2T methods are often impacted by outliers, because they assign all training samples in the mini-batch the same weight.Hence, we propose our L2T framework, Learning Good Teacher Matters (LGTM), which assigns loss weights of the training samples based on their distillation influence.
Extensive experiments have shown that LGTM enables more effective knowledge transfer.
In summary, our contributions are as follows: In addition, we introduce the notation of the Jacobian matrix in the context of working with the chain rule and gradient.In particular, let f : R k → R n be a differentiable function, and let v ∈ R k be a vector.We use the notation ∂f ∂v ∈ R k×n to represent the Jacobian matrix of f , which has dimensions k × n.For simplicity, we annotate ∂f ∂v as ∇ v .We use X to denote the transpose of the matrix X.

Revisiting Learning to Teach
In this paper, we focus on task-specific distillation given pre-trained language models.Under this setting, the teacher model is already pre-trained in an unsupervised manner and the student model is either derived from part of the teacher model or pre-trained in an unsupervised manner as well.

Vanilla distillation
The typical approach to knowledge distillation is a two-stage process.It involves first fine-tuning a pre-trained teacher model to maximize its performance on a specific task.Once the teacher model has converged, a student model is trained to closely imitate the output of the teacher model on the training data.The optimization objective for the student model at each mini-batch is: (1) The update of the student follows: The limitation of vanilla distillation is that it does not allow teacher to adjust its behavior according to student's feedback, as the teacher's parameters are fixed during the distillation process.
Online distillation To achieve student-aware distillation, online distillation (Zhang et al., 2018;Zhu et al., 2018;Shi et al., 2020) is proposed which involves the simultaneous fine-tuning of both the student and teacher models in one-stage.
In addition to minimizing the cross-entropy loss with respect to the ground truth labels, the target distribution of the teacher model is constrained to be close to that of the student model through the minimization of the cross-entropy loss between the outputs of the teacher and student models: (3) The training process involves iteratively updating the parameters of both models: Through iterative update, the student model is able to learn from the learning curve of the teacher model (Shi et al., 2020), which improves its performance on the given task.However, online distillation focuses on transferring the knowledge of the teacher to the student on training set without explicitly considering how well the student model will perform on unseen test data.This might lead to the student model only memorizing the training examples without generalizing well to new ones (Zhou et al., 2022).
Meta distillation Meta distillation (Zhou et al., 2022;Pham et al., 2021) is a technique that takes into account the feedback from the student model and guides the optimization of the teacher model to maximize the generalization ability of the student.The generalization error of the student model is measured by the cross-entropy loss computed between the ground truth labels and the predictions of the student model on the validation set: L val (θ s , z e ) = L ce (y e , S(x e , θ s )). (5) Meta distillation decomposes models' learning process into two stages.The first stage is to finetune a good teacher on task-specific data similar to vanilla distillation, while the second stage involves iterative update of the teacher and student models.Note that compared to online distillation, meta distillation obtains the student feedback from validation data, not training data.
During the second stage, the student model is first updated through the standard distillation process by minimizing the distillation loss in eq. ( 1).Then the teacher model is optimized to minimize the updated student's loss on the held-out validation set, which ensures it is able to guide the student towards better generalization.During this process, the teacher is only trained for the purpose of knowledge transfer.Formally, the student model is updated as follows: The teacher model is then updated as follows: However, the optimization objective of meta distillation can result in a degraded teacher model because it only receives supervision from the student.This will prevent the teacher model from continuing to learn and improve in the second stage, thus impeding its ability to adapt to new data.

Methods
To overcome the aforementioned limitations, we introduce our L2T framework, Learning Good Teacher Matters (LGTM) to enable more effective knowledge distillation.We first introduce distillation influence, which estimates how much will the student's performance on validation data change if we put one training sample in the knowledge distillation process.
Afterwards, we introduce an efficient training method based on finite difference approximation for incorporating distillation influence into the teacher's update.Finally, we interpret current L2T methods from the perspective of influence function.
Distillation influence Influence function (Pruthi et al., 2020;Koh and Liang, 2017) is a way of measuring the influence of training samples on the model's predictions.It can be utilized to identify instances that have a disproportionate effect on the model's behavior, whether due to their status as outliers or due to incorrect labeling (Jia et al., 2019;Ghorbani and Zou, 2019;Hara et al., 2019).By calculating the influence function for a particular example, it is possible to estimate the extent to which the model's prediction would be altered as a result of operations on that sample.
In vanilla distillation, for the student model, we derive the distillation influence of z r i as the gradient similarity between the training sample z r i and the validation batch z e : The detailed derivation can be found in appendix A.
The influence reflects how well the knowledge gained from a particular sample generalizes.It follows that the teacher should focus on teaching the student to capture training samples that have the highest distillation influences.
In order to incorporate the per-sample influence into knowledge distillation, we adjust the loss weight of each sample based on its distillation influence.This allows us to determine the relative importance of each sample, and helps to control how much each sample contributes to the teacher's learning process.Samples that are deemed to be more beneficial for the student's generalization are assigned higher weights.Then we propose training the teacher using the following objective: where w i = I distill (z r i , z e ).By including the influence in the knowledge distillation loss function, we can tailor the training process to better suit the characteristics of the target task.Copy student parameter θs to student θ s 4: Update θ s : θ s ← θs − ηs∇ θ s Ls(θ s , θt, z r ) 5: Sample a batch of validation set z e = (x e , y e ) ∼ Dval 6: Calculate θ ± s : θ ± s = θs ± Lce(y e , S(x e ; θ s )) 7: Calculate the Distillation Influence with z r , θt, θ ± s and : Linfluence eq. ( 10) 8: Update θt: θt ← θt − ηt∇ θ t Lt(θt, θs, z r ) eq. ( 11) 9: Update original θs: θs ← θs − ηs∇ θs Ls(θs, θt, z r ) 10: step ← step + 1 11: end while Finite difference approximation For standard neural network training, we often compute a consolidated gradient for a mini-batch of B r training samples to enhance computational efficiency.However, in the context of determining the distillation influence for each sample, the computation of per-sample gradient L ce (T (x r i ; θ m t ), S(x r i ; θ m s )) will slow down the training by a factor of B r .In addition, a naive implementation is memory intensive, because it requires to keep a copy of ∇ θs L ce (y e , S(x e ; θ m+1 s )).
To address this, we propose an efficient method for updating the teacher with the distillation influence by utilizing finite difference (Gleich, 2005), a technique commonly used in numerical analysis for approximating the derivative of a function at a given point.Similar to (Pham et al., 2021;Liu et al., 2018), we approximate L influence by where )) and is a small scalar.Our proposed method for evaluating the finite difference is computationally efficient, as it only requires two forward passes for θ s and one backward pass for θ t for a single batch, as opposed to a naive implementation which requires B r forward and backward passes for θ s and one backward pass for θ t .We provide more details of the derivation in appendix B.
Teacher's auxiliary loss Inspired by (Pham et al., 2021), in order to balance the trade-off between self-evolution and transferability of the  (Zhou et al., 2022) and LGTM on the MNLI validation set.We observe that for LGTM, student model does not suffer from overfitting (thanks to distillation influence), and the teacher can balance its own evolution and effective knowledge transfer (thanks to auxiliary loss).
teacher model, we incorporate the loss with respect to the ground truth as L aux into the final objective: where α is the loss ratio.
Overall, our method allows the teacher to adapt to the student's abilities and provide more personalized guidance while improving the student's generalization capability.We present the algorithm of LGTM in algorithm 1.
Relationship with other L2T methods Here we interpret current learning to teach methods from the perspective of influence function.
In the case of online distillation, it is assumed that all training samples possess an equivalent distillation influence and that the teacher model is responsible for reducing the transfer difficulty of all training samples.
In contrast, the key differentiating factor between meta distillation and online distillation is the utilization of a dynamic loss weight.We interpret this weight as a measure of the distillation influence of the current training batch z r on the generalization ability of the student model.Specifically, it reflects the similarity between the gradients of the training and validation batches, indicating the effect of the current training batch z r on the validation batch z e (as detailed in appendix C).However, it should be noted that this weight functions primarily as an adaptive learning rate, adjusting the gradient step proportionally to the degree of similarity in gradients.We illustrate the general workflow of vanilla distillation, online distillation, meta distillation and LGTM in fig. 1.

Experiments
In this section, we first describe our experiment setup including datasets and baselines in Sec.5.1.Then we compare our proposed LGTM to meta distillation to gain some basic understanding of how to incorporate the student's feedback in Sec.5.2.To further verify the effectiveness of our method, in Sec.5.3 we compare to 10 widely adopted knowledge distillation baselines and show consistently better results.Then we demonstrate how distillation influence works in Sec.5.4, followed by ablation studies of LGTM in Sec.5.5.
Training setup Following previous works (Sun et al., 2019;Zhou et al., 2022), we distill BERT-Base (Devlin et al., 2019) to a 6-layer BERT model.For all two-stage baselines, we fine-tune the models on each task.For fair comparison, both Meta Distill and LGTM utilize feedback from the validation set in the calculation of the distillation loss.Table 1: Experimental results on the test set of GLUE (from the official test server).We bold the best results for each dataset, as well as the final average accuracy.Following (Zhou et al., 2022), the student is initialized with a 6-layer pre-trained BERT (Turc et al., 2019).We can see that LGTM outperforms all 10 baselines.
Detailed training hyperparameters can be found in appendix D.

Comparison with Meta Distillation
Given our proposed LGTM is closely related to the meta distillation line of work, here we first conduct a comparison between LGTM and a specific meta distillation method, Meta Distill (Zhou et al., 2022), to demonstrate the benefit of adopting distillation influence.
We observe that for Meta Distill (blue curve) in fig. 2 (a) and (b), the validation loss of the student model gradually increases in later iterations while the validation accuracy keeps improving until a stable plateau.This clearly indicates that the student model is experiencing overfitting.One possible explanation is that excessive emphasis is placed on certain training samples that generate high loss, e.g., hard samples or outliers.This negatively impacts the generalization ability of the student model, which leads to overfitting.
The key difference between Meta Distill and our LGTM (orange curve) is that LGTM accounts for the per-sample distillation influence while Meta Distill treats all training samples in a batch equally.This enables the filtering of samples that have a detrimental effect on generalization performance of the student model, leading to a steady decrease of validation loss (fig. 2 (a)) and an improved validation accuracy (fig. 2 (b)).
In terms of teacher model, it should not only impart their current knowledge to the student, but also actively seek out new information and perspectives to improve their own understanding.As can be seen in fig. 2 (c), LGTM allows for the effective transfer of knowledge from the teacher model by incorporating the teacher auxiliary loss.The validation accuracy of the teacher model keeps improving for LGTM, but drops quickly for Meta Distill.

Main Results
Here we show the results of our proposed method on the test set of text classification tasks in GLUE benchmark.As can be seen in table 1, LGTM outperforms all 10 baselines including recent strong KD methods (Guo et al., 2022;Huang et al., 2022;Rao et al., 2022;Zhou et al., 2022), which highlights the effectiveness of our method.
To be more specific, our proposed method achieves state-of-the-art performance in comparison to those rely on carefully designed training pipelines or loss functions, e.g., PKD (Sun et al., 2019), SKD (Guo et al., 2022) and DIST (Huang et al., 2022).PKD proposes two distillation schemes, to enable the student to learn from multiple intermediate layers of the teacher model for incremental knowledge extraction.SKD and DIST both modify the form of KL-divergence loss to narrow the gap between the teacher and student models.LGTM also does not require a series of teacher assistant models as TAKD (Mirzadeh et al., 2020) and RCO (Jin et al., 2019).
Compared to online distillation methods, LGTM performs better than DML (Zhang et al., 2018), ProKT (Shi et al., 2020) and PESF-KD (Rao et al., 2022).This highlights the importance of incorporating student's feedback during the training process.An overemphasis on knowledge transfer from  We also visualize the relationship between the distillation influence and the predictions from the student and the teacher.Left: our method assigns negative weight to a potential difficult sample, which helps avoid overfitting.Right: our method assigns positive weight to a potential easy sample, which encourages model learning.
the training set may lead to the student overfitting the teacher's outputs, resulting in a reduction in its generalization abilities.Furthermore, unlike meta distillation methods, e.g., Meta Distill (Zhou et al., 2022), our method allows for computing distillation influence of individual training samples, which enables filtering out samples that may hurt student's generalization.Therefore, LGTM is able to help the student to develop general understanding of the overall task while alleviate the overfitting issue.

Analysis of Distillation Influence
We further explore the trend of the distillation influence of samples during the real training process.Here, we conduct experiments on the MRPC dataset.The task is to predict whether the sentences in a sentence pair are semantically equivalent (Wang et al., 2018).
First, we select two representative samples presented in fig. 3 to visualize the trend of the distillation influence and its relationship between the teacher's and the student's prediction.
On the left-side of fig.3, we can see that during the initial stages of training, both the teacher (green) and the student (orange) have made wrong predictions.It might suggest that this sample poses a significant challenge for both models to learn.In this case, we do not want student model to mimic the output from teacher models too much because teacher model is also wrong about this sample.Our method is able to gradually adjust the loss weight to negative, indicating we will filter out this misleading training sample for now to make both models learn faster.As a result, the student model first escapes this predicament.Then through student feedback on the validation set, the teacher model also learns to make the correct prediction.Finally as training progresses, it is observed that both the student and the teacher are able to correctly classify this sample, resulting in the distillation influence stabilizing at a near-zero value.
We present another example in fig. 3 right, where both the student and the teacher are able to accurately predict a given sample.It might suggest this sample is too easy for the teacher and the student.In this case, we want to give this sample a high positive weight to form a student-friendly decision boundary.This is similar to design a curriculum to learn from easy samples to hard ones in curriculum learning (Soviany et al., 2022).
We also visualize an average trend of distillation influence in fig.4, based on 64 samples that are randomly chosen from MRPC.We observe that the distillation influence is usually insignificant in the beginning and end of the training, but fluctuates in the middle.This is reasonable since our method is assigning varying weights to each sample during training, with the goal of filtering difficult samples and focusing on samples better for generalization.We find that whether assigning positive or negative weight, the trend is similar.Distillation influence is usually insignificant in the beginning and end of the training, but fluctuates in the middle.We hypothese this is because our method is assigning varying weights to each sample during training, with the goal of filtering difficult samples and focusing on samples better for generalization.

Ablation Study
Given limited space, we present three studies in this section and show more ablation studies in appendix E. LGTM consistently beats the original methods, which validates the compatibility of LGTM to these losses.

Training time F1
LGTM w/o FDA 117min 90.7 LGTM w/ FDA 11min 90.4 Table 3: The comparsion of LGTM with FDA and without FDA method.While their performance are similar, LGTM with FDA is 10× faster than without it.
set of the MRPC dataset and observe that training with FDA result in an F1 score of 90.4, while training without FDA resulted in a score of 90.7.
There is only a slight drop in performance with the approximation.
Distillation loss There are other distillation losses in the context of knowledge distillation.
Here we want to evaluate whether LGTM can adapt to those objectives.In particular, we consider the modified loss used in DIST (Huang et al., 2022) and the common mean squared error (MSE).As can be seen in table 2, our LGTM consistently beats the original methods that utilize these distillation objectives, which validates the compatibility of LGTM to different distillation objectives.
Student model size Here we conduct experiments to evaluate the performance of our proposed method in scenarios where there is a larger capacity difference between the teacher and student models.Specifically, we perform knowledge distillation from a BERT-Base model (Devlin et al., 2019) to a 4-layer BERT model.As can be seen from table 4, LGTM consistently outperforms other baselines in most of the tasks except competitive results on SST-2.This indicates the robustness of our method which suggests its wide usage in various knowledge distillation settings.
LGTM again outperforms strong baselines when the capacity gap is larger between teacher and student.
key aspects are typically considered: the teacher model from which knowledge is transferred (learning target), the data on which the model is trained (learning material), and the objective function that defines the learning objective.Efforts have been made to make knowledge distillation more studentfriendly by reducing the difficulties in these aspects (Li et al., 2021b).propose updating the teacher and student jointly to make the teacher aware of the student's state.Rao et al. (2022) trains for more timestep to smooth the distribution of the teacher for a easier transfer.
In terms of learning material, TinyBERT (Jiao et al., 2020) suggests augmenting the training data to make it more diverse.Kim et al. (2022) proposes training the student with samples that are easy for the teacher but difficult for the student.With respect to learning objective, the most common approach is to match the probabilistic prediction scores of the teacher and student models using KL-divergence.However, this can cause problems during training, leading to poor performance.Guo et al. (2022); Huang et al. (2022) propose to soft the constraint by a more tolerated loss.Pham et al. (2021); Zhou et al. (2022) propose using the student's performance as the optimization objective for the teacher model, allowing the teacher to optimize its knowledge transfer based on feedback from the student.Wang et al. (2022b) proposes to select the appropriate knowledge to guide the optimization of the student.

Conclusion
In this paper, we first revisit several learning to teach paradigms in knowledge distillation.Then we propose distillation influence to determine how distilling from each training sample impacts the student's generalization ability.By visualizing how the distillation influence of each sample changes during training, we can see that a simple re-weighting using distillation influence is able to help student training, e.g., reduce overfitting.Built on top of distillation influence, we propose our learning to teach framework, LGTM, that consistently outperforms existing knowledge distillation methods on text classification tasks in the GLUE benchmark.

Limitations
Although LGTM has demonstrated superior performance in task-specific knowledge distillation, it is worth investigating the potential benefits of combining LGTM with pre-training knowledge distillation (Jiao et al., 2020;Wang et al., 2020).Additionally, while our experiments have been limited to text classification tasks, which are relatively simple for current pre-trained language models, future work should explore the application of LGTM to more complex text generation tasks.

Ethics Statement
During the training process, the teacher and student models are initialized from pre-trained models.However, pre-trained language models are vulnerable to potential ethical and social risk as mentioned by Bommasani et al. (2021) and Weidinger et al. (2021).Therefore, the teacher and student models can be exposed to similar social risks of large language models.

A The Derivation of Distillation Influence
As described by Pruthi et al. (2020), the influence of a training sample z = (x, y) on a test sample z = (x , y ) can be traced by examining the change in loss of model w on the test sample.The influence function is defined as the total reduction in loss on the test sample z induced by the training process whenever the training sample z is utilized: where w t+1 = w t −η w L(w t , z) and η w is the learning rate and the model are parameterized by w t and w t+1 .
In this context, we will focus on the influence of the current training batch on the student model's performance on the validation data.To improve computation efficiency, a batch of samples is drawn from the validation set to evaluate the model's generalization performance.As a result, the influence on a single validation sample, as described in eq. ( 12), is extended to a batch of validation samples z e .The influence of the current training batch z r on the validation batch z e is defined as follows: As a result, we approximate the I(z r , z e ) as follows: The contribution of a single sample z r i = (x r i , y r i ) in the training batch z r is defined as follows: By excluding loss irrelevant to the teacher in eq. ( 16), we define the distillation influence of z r i to be: Idistill(z r i , z e ) =∇ θs Lce(T (x r i ; θ m t ), S(x r i ; θ m s )) ∇ θs Lce(y e , S(x e ; θ m+1 s )) (17)

C A Closer Look at Meta Distillation
In meta distillation, the loss on the validation set with respect to the teacher can be derived as follows: θs,θt L ce (T (x r ; θ m t ), S(x r ; θ m s )) ∇ θs L ce (y e , S(x e ; θ m+1 s )) The data used to measure the generalization of the student, whether it be from an existing validation set or a newly separated set, remains informative in both cases.As such, it is reasonable to expect that the feedback provided by the student to the teacher would not exhibit significant differences between the two sources.
Our experiments demonstrate that utilizing feedback from a validation set, whether pre-existing or newly separated from the training set, does not lead to significant variations in performance.However,  3) to train the teacher.We follow (Shi et al., 2020) to initialize both the teacher and student's classifier as zero in the onestage setting.
it should be noted that the number of training samples may play a role in the results.When a subset of the training set is selected to form a new validation set, the number of training samples is reduced.This reduction may lead to overfitting in datasets of small or medium size, as there is not enough data information provided to the model.Conversely, in large datasets, the number of samples is sufficient to encompass a substantial portion of the data information, thus having minimal impact on the results.

E.2 Ratio of Teacher's Self-evolution
A student-friendly teacher should strike a balance between self-evolution and knowledge transfer.It is believed that an excessive focus on self-evolution may result in neglect of feedback provided by the student, leading to instruction that is not centered on the student's needs.Conversely, inadequate focus on self-evolution may prevent the teacher from improving their own abilities, resulting in suboptimal instruction for the student.In either scenario, the outcome is not conducive to fostering a student-friendly environment.
Therefore, we ablate on the ratio of the teacher's self-evolution to see how it contributes to the performance of the student.α is the ratio of the teacher's loss with respect to ground truth in eq. ( 11).We set it from {1.0,0.8,0.6,0.4}.In table 7, the performance of the student exhibits a unimodal distribution, which is in agreement with our proposed assumption.Specifically, the results indicate that when the ratio of the teacher's self-evolution is set at 0.6, the performance of the student is optimal.

F Analysis
We further discuss some design choices of current methods, including the initialization state of the teacher and the updating order of the teacher and student models.Following (Guo et al., 2022), we apply the entropy gap to evaluate these design choices.

F.1 Impact of the Teacher's initial state
While vanilla distillation and meta distillation employ a two-stage training approach, online distillation and LGTM employ a one-stage joint training strategy for the teacher and student models.The key difference is whether to involve fine-tuning the teacher network on target task.In this study, we investigate the impact of the teacher network's state on the student network.
A teacher network initialized in the same state as the student network can maintain the student network's progress at all times, but its capabilities may be relatively weak.In contrast, a converged teacher network has superior performance but also a larger gap, which can prevent the student network from gaining knowledge effectively.
As show in fig.5, a lower initial confidence gap between the teacher model and the student model leads to more efficient knowledge transfer.When the initial ability gap is relatively high, it takes more iterations for the student model to catch up to the fine-tuned teacher model.In contrast, when the initial ability gap is lower, a teacher model initialized at the same state as the student model is able to transfer knowledge to the student more quickly.Specifically, in the early stages, the teacher model focuses more on self-evolution than knowledge transfer, causing the entropy gap to increase.Then, the teacher model shifts its focus towards knowledge transfer, resulting in an increasing and then decreasing trend in the entropy gap.

F.2 Prioritizing the Teacher or Student
Online distillation and meta distillation and LGTM all use bi-level optimization.However, online distillation and LGTM updates the teacher network followed by the student network, while meta distillation updates the student network followed by the teacher network.In this section, we study the optimal order for updating the teacher network and student network in knowledge distillation.
As shown in fig.6, updating the teacher model first could lead to a lower entropy gap and faster convergence speed.We assume that the teacher could formulate an appropriate 'teaching plan' for the student in this updating order.
The teacher should strive to guide the student to identify the most important samples and information, to help the student develop a deep and general understanding of the task.Furthermore, the teacher should also take into consideration that some samples may be difficult for the teacher itself to classify or understand.And for those samples, a lower criterion should be set for the student, which may form a more student-friendly decision boundary.
Therefore, the teacher's output serves as a dynamic learning target for each sample.By updating based on the student's feedback in advance, the teacher is able to reach a state that is optimal for the student's learning.In this case, the teacher could provide an appropriate learning signal.Leveraging this updated supervision signal, the student could make up for the ability gap faster.For the other two updating orders, the teacher hasn't updated yet, lacking of making trade-offs between the samples that are more beneficial for generalization and those that are more challenging to learn from.This may lead to a certain degree of lag in knowledge transfer, resulting in a larger entropy gap between the student and the teacher.

Figure 1 :
Figure 1: Comparison of vanilla distillation, online distillation, meta distillation and our proposed LGTM.The dotted orange lines show the direction of the gradient flow for model update.Note that vanilla distillation and meta distillation employ a two-stage training pipeline by first fine-tuning the teacher on the target task.Online distillation and LGTM employ a one-stage joint training strategy for both teacher and student.
θs, teacher θt, training set Dtrain, validation set Dval Require: ηs, ηt: learning rate for the student and the teacher Require: : a small scalar Require: M : the maximum number of the training steps 1: while step < M do 2: Sample a batch of training set z r = (x r , y r ) ∼ Dtrain 3:

Figure 2 :
Figure 2: Performance comparison between Meta Distill(Zhou et al., 2022) and LGTM on the MNLI validation set.We observe that for LGTM, student model does not suffer from overfitting (thanks to distillation influence), and the teacher can balance its own evolution and effective knowledge transfer (thanks to auxiliary loss).

Figure 3 :
Figure 3: We select two samples in the MRPC dataset to visualize their trends of the distillation influence during training.We also visualize the relationship between the distillation influence and the predictions from the student and the teacher.Left: our method assigns negative weight to a potential difficult sample, which helps avoid overfitting.Right: our method assigns positive weight to a potential easy sample, which encourages model learning.

Figure 4 :
Figure 4: We visualize the trend of the distillation influence from 64 random samples in the MRPC dataset.We find that whether assigning positive or negative weight, the trend is similar.Distillation influence is usually insignificant in the beginning and end of the training, but fluctuates in the middle.We hypothese this is because our method is assigning varying weights to each sample during training, with the goal of filtering difficult samples and focusing on samples better for generalization.
On learning target, Jin et al. (2019); Mirzadeh et al. (2020) introduce teacher assistant models of intermediate timestep or training time step respectively to narrow the gap between the teacher and student models.Park et al. (2021); Shi et al. (2020) Lval(θ m s , z e ) − Lval(θ m+1 s , z e ) = Lce(y e , S(x e ; θ m s )) − Lce(y e , S(x e ; θ m+1 s s − η s L s (θ m s , θ m t , z r ).By applying the Taylor expansion, we can approximate L val (θ m s , z e ) as follows:

Figure 5 :
Figure5: The entropy gap between the teacher and the student on the SST-2 training set for two-stage and onestage training strategies.We only keep the loss with respect to ground truth labels in eq.(3) to train the teacher.We follow(Shi et al., 2020) to initialize both the teacher and student's classifier as zero in the onestage setting.

Table 2 :
Experimental results on the test set of GLUE when training with different KD objectives.Our

Table 4 :
Experimental results on the test set of GLUE.

Table 7 :
Figure 6: A comparison of the entropy gap on the MNLI training set with different orders of updating the teacher.'teacher'denotes updating the teacher model followed by the student model.'student' is the opposite.And 'simulatenously' denotes updating the teacher and the student simulatenously.Experimental results on the test set of four GLUE datasets.α is the ratio of teacher's selfevolution.