Dynamic Knowledge Distillation for Pre-trained Language Models

Knowledge distillation (KD) has been proved effective for compressing large-scale pre-trained language models. However, existing methods conduct KD statically, e.g., the student model aligns its output distribution to that of a selected teacher model on the pre-defined training dataset. In this paper, we explore whether a dynamic knowledge distillation that empowers the student to adjust the learning procedure according to its competency, regarding the student performance and learning efficiency. We explore the dynamical adjustments on three aspects: teacher model adoption, data selection, and KD objective adaptation. Experimental results show that (1) proper selection of teacher model can boost the performance of student model; (2) conducting KD with 10% informative instances achieves comparable performance while greatly accelerates the training; (3) the student performance can be boosted by adjusting the supervision contribution of different alignment objective. We find dynamic knowledge distillation is promising and provide discussions on potential future directions towards more efficient KD methods.


Introduction
Knowledge distillation (KD) (Hinton et al., 2015) aims to transfer the knowledge from a large teacher model to a small student model. It has been widely used (Sanh et al., 2019;Jiao et al., 2020;Sun et al., 2019) to compress large-scale pre-trained language models (PLMs) like BERT (Devlin et al., 2019) and RoBERTa  in recent years. By knowledge distillation, we can obtain a much smaller model with comparable performance, while greatly reduce the memory usage and accelerate the model inference.
Although simple and effective, existing methods usually conduct the KD learning procedure statically, e.g., the student model aligns its output probability distribution to that of a selected teacher model on the entire pre-defined corpus. In other words, the following three aspects of KD are specified in advance and remain unchanged during the learning procedure: (1) the teacher model to learn from (learning target); (2) the data used to query the teacher (learning material); (3) the objective functions and the corresponding weights (learning method). However, as the student competency evolves during the training stage, it is particularly unreasonable to pre-define these learning settings and keep them unchanged. Conducting KD statically may lead to (1) unqualified learning from a too large teacher, (2) repetitive learning on instances that the student has mastered, and (3) sub-optimal learning on alignments that are unnecessary. This motivates us to explore an interesting problem: whether a dynamic KD framework considering the student competency evolution during training can bring benefits, regarding the student performance and learning efficiency?
In this paper, we propose a dynamic knowledge distillation (Dynamic KD) framework, which attempts to empower the student to adjust the learning procedure according to its competency. Specifically, inspired by the success of active learning (Settles, 2009), we take the prediction uncertainty, e.g., the entropy of the predicted classification probability distribution, as a proxy of the student competency. We strive to answer the following research questions: (RQ1) Which teacher is proper to learn as the student evolves? (RQ2) Which data are actually useful for student models in the whole KD stage? (RQ3) Does the optimal learning objective change in the KD process? In particular, we first explore the impact of the teacher size to dynamic knowledge distillation. Second, we explore whether dynamically choosing instances that the student is uncertain for KD can lead to a better performance and training efficiency trade-off. Third, Class Distribution Figure 1: The three aspects of dynamic knowledge distillation explored in this paper. Best viewed in color.
we explore whether the dynamic adjustment of the supervision from alignment of prediction probability distributions and the alignment of intermediate representations in the whole KD stage can improve the performance.
Our experimental results show that: (1) A larger teacher model with more layers may raise a worse student. We show that selecting the proper teacher model according to the competency of the student can improve the performance. (2) We can achieve comparable performance using only 10% informative instances selected according to the student prediction uncertainty. These instances also evolve during the training as the student becomes stronger.
(3) We can boost the student performance by dynamically adjusting the supervision from different alignment objectives of the teacher model.
Our observations demonstrate the limitations of the current static KD framework. The proposed uncertainty-based dynamic KD framework only makes the very first attempt, and we are hoping this paper can motivate more future research towards more efficient and adaptive KD methods.

Background: Knowledge Distillation
Given a student model S and a teacher model T , knowledge distillation aims to train the student model by aligning the outputs of the student model to that of the teacher. For example, Hinton et al. (2015) utilize the teacher model outputs as soft targets for the student to learn. We denote S(x) and T (x) as the output logit vector of the student and the teacher for input x, respectively. The KD can be conducted by minimizing the Kullback-Leibler (KL) divergence distance between the student and teacher prediction: where σ(·) denotes the softmax function and τ is a temperature hyper-parameter. The student parame-ters are updated according to the KD loss and the original classification loss, i.e., the cross-entropy over the ground-truth label y: where λ KL is the hyper-parameter controlling the weight of knowledge distillation objective. Recent explorations also find that introducing KD objectives of alignments between the intermediate representations (Romero et al., 2015;Sun et al., 2019) and attention map (Jiao et al., 2020; is helpful. Note that conventional KD framework is static, i.e., the teacher model is selected before KD and the training is conducted on all training instances indiscriminately according to the predefined objective and the corresponding weights of different objectives. However, it is unreasonable to conduct the KD learning procedure statically as the student model evolves during the training. We are curious whether adaptive adjusting the settings on teacher adoption, dataset selection and supervision adjustment can bring benefits regarding student performance and learning efficiency, motivating us to explore a dynamic KD framework.

Dynamic Knowledge Distillation
In this section, we introduce the dynamic knowledge distillation. The core idea behind is to empower the student to adjust the learning procedure according to its current state, and we investigate the three aspects illustrated in Figure 1.

Dynamic Teacher Adoption
The teacher model plays a vital role in KD, as it provides the student soft-targets for helping the student learn the relation between different classes (Hinton et al., 2015). However, there are few investigations regarding how to select a proper teacher for the  Table 1: We find that bigger teacher with better performance raises a worse student model. Results are average of 3 seeds on the validation set. * denotes statistically significant improvement over the best performing baseline with p < 0.05. student in KD for PLMs during the training dynamically. In the KD of PLMs, it is usually all teacher models are with the same model architecture, i.e., huge Transformer (Vaswani et al., 2017) model. Thus, the most informative factor of teacher models is their model size, e.g., the layer number of the teacher model and the corresponding hidden size. This motivates us to take a first step to explore the impact of model size.

Bigger Teacher Not Always Raises Better Student
Specifically, we are curious about whether learning from a bigger PLM with better performance can lead to a better distilled student model. We conduct probing experiments to distill a 6-layer student BERT model from BERT BASE with 12 layers, and BERT LARGE with 24 layers, respectively. We conducts the experiment on two datasets, RTE (Bentivogli et al., 2009) and CoLA (Warstadt et al., 2019), where two teacher models exhibit clear performance gap, and a sentiment classification benchmark IMDB (Maas et al., 2011). Detailed experimental setup can be found in Appendix A. As shown in Table 1, we surprisingly find that while the BERT LARGE teacher clearly outperforms the small BERT BASE teacher model, the student model distilled by the BERT BASE teacher achieves better performance on all three datasets. This phenomenon is counter-intuitive as a larger teacher is supposed to provide better supervision signal for the student model. We think that there are two possible factors regarding the size of teacher model that leading to the deteriorated performance: (1) The predicted logits of the teacher model become less soft as the teacher model becomes larger and more confident about its prediction (Guo et al.,  The teacher model is BERT BASE with 12 layers. 2017; Desai and Durrett, 2020), which decreases the effect of knowledge transfer via the soft targets. We find that a smaller τ also leads to a decreased performance of the student model, indicating the the less-softened teacher prediction will decrease the student performance. 2 (2) The capacity gap between the teacher and student model increases as the teacher becomes larger. The competency of the student model can not match that of the large teacher model, which weakens the performance of KD.
To explore the combined influence of these factors, we distill student models with different layers and plot the performance gain compared to directly training the student model without distillation in Figure 2. It can be found that by decreasing the student size, the better supervision from teacher model boosts the performance, while the two counteractive factors dominate as the gap becomes much larger, decreasing the performance gain. We notice that this phenomenon is also observed by Mirzadeh et al. (2020) in computer vision tasks using convolutional networks, showing that it is a widespread issue and needs more in-depth investigations. Note that BERT BASE and BERT LARGE also differs from the number of hidden size, the experiments regarding the hidden size, where the phenomenon also exists and corresponding results can be found in Appendix C.

Uncertainty-based Teacher Adoption
Our preliminary observations demonstrate that selecting a proper teacher model for KD is significant for the student performance. While the capacity gap is an inherent problem once the teacher and the student are set, we are curious about whether dynamically querying the proper teacher according to the student competency during training can make the full use of teacher models. Without loss of generality, we conduct KD to train a student model from two teacher models with different numbers of Transformer layers. We assume that during the initial training stage, the student can rely more on the small teacher model, while turns to the large teacher for more accurate supervision when it becomes stronger. Specifically, we propose to utilize the student prediction uncertainty as a proxy of the competency, inspired by the successful applications in active learning (Settles, 2009), and design two uncertain-based teacher adoption strategies: Hard Selection: The instances in one batch are sorted according to the student prediction uncertainty, i.e., the entropy of predicted class distribution. Then the instances are evenly divided into instances that the student most uncertain about and instances that model is most confident about. For the uncertain part, the small teacher is queried for supervision signals, while the large teacher provides the soft-label for the instances that the student is confident about. Soft Selection: The corresponding KD loss weights from two teachers are adjusted softly at instance-level. Formally, given two teacher model T 1 (BERT BASE , in our case) and T 2 (BERT LARGE ), we can re-write the KD objective in the multiple teacher setting as: where L T KL denotes matching loss of the output logits of the student model and the teacher model t. The w 1 and w 2 controls the relative contribution of the supervisions from the two teachers. We adaptively down-weight the supervision from the large teacher when the student are uncertain about the training instances. The prediction uncertainty is adopted as a measurement of the student competency for instance x: where σ is a normalization function, e.g, softmax function for mapping the logit vector to probability distribution. The w 1 and w 2 are adjusted as follows: where U is a normalization factor which re-scales the weight to [0,1]. In this way, the student will pay more attention to the small teacher when it is uncertain about the current instance, while relies on the large teacher when it is confident about its prediction.

Experiments Settings
We conduct experiments to distill a 6-layer student model from BERT BASE and BERT LARGE , on RTE, CoLA and IMDB, following settings of probing analysis in Section 3.1.1.

Results
The results of the proposed selection strategies are shown in Table 1. We observe that the hard selection strategy achieves an overall 65.3 accuracy which outperforms directly learning from the ensemble of two teacher models. This demonstrates that the proposed strategy is effective by selecting the proper teacher model to learn. The soft selection strategy also outperforms the baseline while falls little behind with the hard version.
We attribute it to that the provided supervisions of two teachers are of different softness, thus may confuse the student model.

Dynamic Data Selection
The second research question we want to explore is which data will be more beneficial for the performance of the student. As the distillation proceeds, the student is becoming stronger, thus the repetitive learning on the instances those the student has mastered on can be eliminated. If there are such instances that are vital for the learning of the student model, can we only conduct KD on these instances, for improving the learning efficiency? Besides, do the vital instances remain static or also evolve with the student model? These questions motivate us to explore the effect of dynamically selecting instances.

Uncertainty-based Data Selection
We propose to actively select informative instances according to student prediction uncertainty, inspired by the successful practice of active learning (Settles, 2009). Formally, given N instances in one batch, for each instance x, the corresponding output class probability distribution over the class label y of the student model is P (y | x) = σ (S (x)). We compute an uncertainty score u x for x using the follow strategies with negligible computational overhead: Entropy (Settles, 2009), which measures the uncertainty of the student prediction distribution:   Margin, which is computed as the margin between the first and second most probable class y * 1 and y * 2 : Least-Confidence (LC), which indicates how uncertain the model about the predicted classŷ = arg max y P (y | x): We rank the instances in a batch according to its prediction uncertainty, and only choose the top N × r instances to query the teacher model, where r ∈ (0, 1] is the selection ratio controlling the number to query. Note that in binary classification tasks like IMDB, the selected subsets using the above strategies are identical. We also design a Random strategy that selects N × r instances randomly, to serve as a baseline for evaluating the effectiveness of selection strategies.

Experiments
Settings We conduct the investigation experiments on two sentiment classification datasets IMDB (Maas et al., 2011) and SST-5 (Socher et al., 2013), and natural language inference tasks including MRPC (Dolan and Brockett, 2005) and MNLI (Williams et al., 2018). The statistics of dataset and the implementation details can be found in Table 3 and D, respectively. We report accuracy as the performance measurement for all the evaluated tasks. Besides, we also provide the corresponding computational FLOPs for comparing the learning efficiency. In more detail, we divide the computational cost C of KD tinto three parts: student forward F s , teacher forward F t and student backward B s for updating parameters. Note that F s ≈ B s F t , as the teacher model is usually much larger than the student model. By actively learning only from N × r instances that the student are most uncertain about, the cost is reduced to: For example, the number of computational FLOPs of a 6-layer student BERT model is 11.3B while that of a 12-layer BERT teacher model is 22.5B (Jiao et al., 2020). When r is set to 0.1, the total KD cost is reduced from 45.1B to 14.7B.

Results with Original Dataset
The results when r set to 0.5 are listed in Table 2. Overall, it can be found that selecting the instances only lead negligible degradation of the student performance, compared to that of Vanilla KD, showing the effectiveness of the uncertainty-based strategies. Interestingly, the random strategy perform closely to the uncertainty-based strategies, which we attribute to that the underlying informative data space can also be covered by random selected instances. Besides, we notice that performance drop is smaller on the tasks with larger training data size. For example, selecting informative instances with prediction entropy leads to 0.2 accuracy drop on the MNLI dataset consisting 393k training instances, while causes 0.8 performance drop on MRPC with 3.7k instances. A possible reason is that for the tiny dataset, the underlying data distribution is not well covered by the training data, therefore further down-sampling the training data results in a larger performance gap. To verify this, we turn to the the setting where the original training dataset is enriched with augmentation techniques.

Results with Augmented Dataset Following
TinyBERT (Jiao et al., 2020), we augment the training dataset 20 times with BERT mask language prediction, as it has been prove effective for distilling a powerful student model. Our assumption is that with the data augmentation technique, the training set can sufficiently cover the possible data   space, thus selecting the informative instances will not lead to significant performance drop. Besides, it is of great practical value to accelerate the KD procedure via reducing the queries to teacher model on the augmented dataset. For example, it costs about $3, 709 for querying all the instances of the augmented MNLI dataset as mentioned by Krishna et al. (2019). 3 By only querying a small portion of instances to the teacher model, we can greatly reduce economic cost and ease the possible environmental side-effects (Strubell et al., 2019;Schwartz et al., 2019;Xu et al., 2021) that may hinder the deployments of PLMs on downstream tasks.
The results with TinyBERT-4L as the backbone model and r = 0.1 are listed in Table 4. We can observe that uncertainty-based selection strategy can maintain the superior performance while saving the computational cost, e.g., the FLOPs is reduced from 24.9B to 4.65B with negligible average performance decrease. In tasks like SST-5 and IMDB, selecting 10% most informative instances according to student prediction entropy can even outperform the original TinyBERT using the whole dataset. Among these strategies, the leastconfidence strategy performs relatively poor, as it only takes the maximum probability into consideration while neglects the full output distribution.
Performance under Different Ratios We vary the selection ratio r to check the results of different strategies on the augmented SST-5 dataset. The results are shown in Figure 3. Our observations are: (1) There exists a trade-off between the performance and the training costs, i.e., increasing the selection ratio generally improves the performance of student model, while results in bigger training costs.
(2) We can achieve the full performance using about 20% training data. It indicates that the training data support can be well covered with about 20% data, thus learning from these instances can sufficiently train the student model. It validates our motivation to select informative instances for reducing the repetitive learning caused by data redundancy. (3) Selection strategies based on the uncertainty of student prediction can make the better use of limited query, performing better than the random selection, especially when the query number is low.

Analysis
We further conduct experiments on the augmented SST-5 dataset to gain insights about the property of selected instances and visualize the distribution of selected instances for intuitive understanding.

Property of Selected Instances
We plot the teacher prediction entropy and the distance from the selected instances to the corresponding category center. From Figure 4 (left), we observe for hard instances with high uncertainty that selected by the student model, the teacher model also regards them as difficult. It indicates that the instance difficulty is an inherent property of data and uncertainty- based criterion can discover these hard instances from the whole dataset. Besides, the teacher's entropy of selected instances increases as the training proceeds, showing that the selected instances also evolve during the training as the student is becoming stronger. The right part in Figure 4 demonstrates that uncertainty-based selection will pick up the instances that are far away from the category center than the ones are randomly picked. These instances are more informative for the student model to learn the decision boundary of different classes.

Visualization of Selected Instances
We further visualize the distribution of instances in the feature space, i.e., the representation before the classifier layer, using t-SNE (Maaten and Hinton, 2008) and highlight the selected instances with the cross marker. We compare the best performing strategy margin and random selection on SST-5. As shown in Figure 5, the instances randomly selected are distributed uniformly in feature space. The margin strategy instead picks the instances close to the classification boundary. The results demonstrate that the uncertainty-based selection criterion can help the student model pay more attention to the instances that are vital for making correct predictions, thus achieving a comparable performance with a much lower computational cost. In all, our analysis experiments show that the uncertainty-based selection is effective for picking instances that are close to the classification boundary. Besides, the selected instances also evolve as the student model becomes stronger. By learning from these instances, the computational cost of KD is greatly reduced with a negligible accuracy drop.

Dynamic Supervision Adjustment
We finally explore the question of the optimal learning objective functions. Previous studies have shown that integrating the alignments on the intermediate representation (Romero et al., 2015;Sanh et al., 2019;Sun et al., 2019) and attention map (Jiao et al., 2020; between the student and the teacher model can further boost the performance. We are interested in whether the dynamic adjustment of the supervision from different alignment objectives can bring extra benefits. As the first exploration, we only consider the combination of the KL-divergence distance with the teacher prediction and the hidden representation alignments: where L P T is called PaTient loss, which measures the alignment between the normalized internal representations of the teacher and student model (Sun et al., 2019): where M is the number of student layer, I pt (i) denotes the corresponding alignment of the teacher layer for the student i-th layer, h s i and h t i denote representation of i-th layer of student and teacher model, respectively.

Uncertainty-based Supervision Adjustment
Different from previous studies which set the corresponding alignment objective weights via hyperparameter search and keep them unchanged during the training, we propose to adjust the weights according to the student prediction uncertainty for each instance. The motivation behind is that we assume it is unnecessary to force the student model  to align all the outputs of the teacher model during the whole training stage. As the training proceeds, the student is become stronger and it may learn the informative features different from the teacher model. Therefore, there is no need to force the student to act exactly with the teacher model, i.e., requiring the intermediate representations of the student to be identical with the teacher. Formally, we turn the weight of KD objective into a function of the student prediction uncertainty u(x) = Entropy (σ (S (s))): where λ * KL and λ * P T are pre-defined weight for each objective obtained by parameter search and U is the normalization factor. In this way, the contribution of these two objectives are adjusted dynamically during the training for each instances. For instances that the student is confident about, the supervision from the internal representation alignment is down-weighted. Thus the student is focusing mimicking the final prediction probability distribution with the teacher based on its own understanding of the instance. On the contrary, for instances that the student is confusing, the supervision from teacher model representations can help it learn the feature of the instance better.

Experiments Settings
The student model is set to 6-layer and BERT BASE is adopted as the teacher model. For intermediate layer representation alignment, we adopt the Skip strategy, i.e., I pt = {2, 4, 6, 8, 10} as it performs best as described in BERT-PKD. We conduct experiments on the sentiment analysis task SST-5, and two natural language inference tasks MRPC and RTE. For λ * KL and λ * P T , we adopt the searched parameters provided by Sun et al. (2019).

Results
The results of adaptive adjusting the supervision weights are listed in Table 5. We ob- serve that the proposed uncertainty-based supervision adjustment can outperform the static version BERT-PKD on all the tasks, showing that the proper adjustment of the KD objectives is effective for improving the student performance. We also plot the batch average of the KL loss weight in Figure 6. As expected, the corresponding weight of the prediction probability alignment objective is increasing as the student becomes more confident about its predictions, thus paying more attention to matching the output distribution with the teacher model. Interestingly, we find that at the initial stage of training, the KL weight is decreasing. It indicates that the learning by aligning the intermediate representations can help the student quickly gain the understanding the task, thus improving the confidence of predictions.

Discussions
After the preliminary explorations on the three aspects of Dynamic KD, we observe that it is promising for improving the efficiency and the distilled student performance. Here we provide potential directions for further investigations.
(1) From uncertainty-based selection criterion to advanced methods. In this paper, we utilize student prediction uncertainty as a proxy for selecting teachers, training instances and supervision objectives. More advanced methods based on more accurate uncertainty estimations (Gal and Ghahramani, 2016;, clues from training dynamics (Toneva et al., 2018), or even a learnable selector can be developed.
(2) From isolation to integration. As a preliminary study, we only investigate the three dimensions independently. Future work can adjust these components simultaneously and investigate the underlying correlation between these three dimensions for a better efficiency-performance trade-off.
(3) More fine-grained investigations regarding different components in the Dynamic KD frame-work: (i) For teacher adoption, exploring whether dynamically training a student model from more teacher models or teacher models with different architectures can bring extra benefits; (ii) For data selection, it will be interesting to investigate whether the informative data is model-agnostic, and whether dynamically selecting data from different domains can improve the generalization performance; (iii) For supervision adjustment, investigations on the effect of combinations of different objectives can be promising.

Related Work
Our work relates to recent explorations on applying KD for compressing the PLMs and active learning.
Knowledge Distillation for PLMs Knowledge distillation (Hinton et al., 2015) aims to transfer the dark knowledge from a large teacher model to a compact student model, which has been proved effective for obtaining compact variants of PLMs. Those methods can be divided into general distillation (Sanh et al., 2019;Turc et al., 2019; and task-specific distillation (Sun et al., 2019;Jiao et al., 2020;Li et al., 2020a;Li et al., 2020b;Wu et al., 2021). The former conducts KD on the general text corpus while the latter trains the student model on the task-specific datasets. In this paper, we focus on the latter one as it is more widely adopted in practice. Compared to existing static KD work, we are the first to explore the idea of Dynamic KD, making it more flexible, efficient and effective.
Active Learning (Settles, 2009), where a learning system is allowed to choose the data from which it learns from, for achieving better performance with fewer labeled data. Traditional selection strategies include uncertainty-based methods (Scheffer et al., 2001;Settles, 2009), which select the instances that model is most uncertain about, query-by-committee (Freund et al., 1997), which select instances with highest disagreements between a set of classifiers, and methods based on decision theory (Roy and McCallum, 2001). In this paper, inspired by the success of active learning, we introduce Dynamic KD that utilizes the different strategies like prediction entropy as a proxy of student competency to adaptively adjust the different aspects of KD. Our explorations show that the uncertainty-based strategies are effective for improving the efficiency and performance of KD.

Conclusion
In this paper, we introduce dynamic knowledge distillation, and conduct exploratory experiments regarding teacher model adoption, data selection and the supervision adjustment. Our experimental results demonstrate that the dynamical adjustments on the three aspects according to the student uncertainty is promising for improving the student performance and learning efficiency. We provide discussions on the potential directions worth exploring in the future, and hope this work can motivate studies towards more environmental-friendly knowledge distillation methods.

A Teacher Size Exploration Settings
We conduct the knowledge distillation with BERT BASE and BERT LARGE as teacher models. The student model is set to a 6-layer student BERT. For training the teacher, the teacher model is fine-tuned using the script provided by Huggingface Transformers library. The fine-tuning learning rate is 2e-5 with a linear warm-up learning rate schedule for the first 10% training steps. Batch size is 32, training epoch is set to 3, and the max length of input sentence is set to 128. The statistics of used datasets are listed in Table 3.
For distilling the student model, the student model is initialized using the first 6 layers weights of BERT BASE . We adopt the KL-divergence distance as the KD objective. λ KL is set to 0.5 and we empirically find this setting works well. The same training hyper-parameters as fine-tuning the teacher model are used for distillation. The performance is evaluated on the validation set and averaged on 3 random seeds.

B Impacts of Prediction Smoothness
To examine the influence of less-softened teacher predictions, we conduct distillation experiments with various temperature τ using the hyperparameters identical with previous experiments, to mimicking the sharpen impact introduced by the larger teacher size. In more detail, we setup a student model with 6 layers as before and select the BERT BASE as the teacher model. The results are illustrated in Figure 7. We observe on both datasets, the decreased temperature τ will lead a performance decrease. It indicates that the lesssoftened probability distribution indeed weakens the performance of knowledge distillation.

C Impacts of Teacher Hidden Size
As mentioned in the main paper, we observe that larger teacher may not raise a student model with better performance. We further conduct experiments regarding the hidden size of teacher model. Specifically, we setup a student with 6 layers with 256 hidden units. The small teacher and the large teacher are a BERT model of 12-layer with 256 hidden units and a BERT of 12-layer with 768 hidden units, respectively. Out experiments show that on the CoLA dataset, the student model distilled with the small teacher can achieve 11.9 matthews correlation score while that of model distilled by the large teacher is 8.8. The result on the IMDB is consistent, i.e., 83.2 accuracy for student model distilled by the large teacher and 83.4 accuracy for the student distilled by the small teacher. These results again verify the phenomenon that the larger teacher may not always raise a better student model.

D Implementation Details
Our implementation is based on PyTorch and Huggingface transformers library. Model is optimized with AdamW optimizer with linear learning rate warm-up. The sentence length is set to 64 for SST-5 and 128 for the rest datasets. Our teacher model is BERT BASE . The model is trained with learning rate 2e-5 and batch size 32 for 3 epochs. λ KL is set to 0.5, with temperature τ set to 1.
For experiments using TinyBERT, we select TinyBERT 4 v2 as our backbone model, and conduct the general distillation for 10 epochs on the augmented dataset. We further train the model for 3 epochs on the augmented dataset and choose learning rates from {1e-5, 2e-5, 3e-5} and batch sizes from {16, 32} based on the performance on the validation set. The performance are evaluated on the test sets and τ is set to 1, following the practice of TinyBERT.