Adaptive Contrastive Knowledge Distillation for BERT Compression

In this paper, we propose a new knowledge distillation approach called adaptive contrastive knowledge distillation (ACKD) for BERT compression. Different from existing knowledge distillation methods for BERT that implicitly learn discriminative student features by mimicking the teacher features, we first introduce a novel contrastive distillation loss (CDL) based on hidden state features in BERT as the explicit supervision to learn discriminative student features. We further observe sentences with similar features may have completely different meanings, which makes them hard to distinguish. Existing methods do not pay sufficient attention to these hard samples with less discriminative features. Therefore, we propose a new strategy called sample adaptive reweighting (SAR) to adaptively pay more attention to these hard samples and strengthen their discrimination abilities. We incorporate our SAR strategy into our CDL and form the adaptive contrastive distillation loss, based on which we construct our ACKD framework. Comprehensive experiments on multiple natural language processing tasks demonstrate the effectiveness of our ACKD framework.


Introduction
Recently, deep learning (Liu et al., 2023;Guo et al., 2023;Liu et al., 2021;Guo et al., 2022a) has achieved success in many natural language processing tasks.However, due to limited computation and storage resources, current deep learning approaches are hard to be deployed on mobile devices.Knowledge distillation is an effective approach to compress the model for mobile deployment, which aims to use a pretrained teacher network to help the training of a lightweight student network.To achieve this, the student needs to learn discriminative features.Namely, we need to push the features * Equal contribution.
† Corresponding author. of the sample from different classes (negative pairs) far away from each other and keep the features of the samples from the same classes (positive pairs) close.
Current knowledge distillation methods for BERT implicitly learn discriminative student features.They assume the teacher is well-learned (i.e., features of negative pairs are far away from each other in the teacher).Then, they minimize the feature distance of each sample between the teacher and student to make the student feature discriminative, as shown in Fig. 1(a).In this way, the features of negative pairs in the student can be pulled far away from each other.However, the aforementioned assumption is not always held.Commonly used words will appear in the sentences with different meanings, causing the features of negative pairs in the teacher to be close to each other, as shown in Fig. 1(b).In this case, training the student using the current knowledge distillation paradigm will result in the features of negative pairs in the student being close to each other as well.So, it is desirable to in-

Linguistic acceptable Linguistic unacceptable
Harry coughed himself into a fit.Harry coughed us into a fit.This building got taller and taller.This building is taller and taller.
Bill cried himself to sleep.Bill cried Sue to sleep.
troduce explicit supervision (e.g., a well-designed loss) to push the features of negative pairs in the student far away from each other.Another issue in the existing knowledge distillation methods is that they do not pay sufficient attention to hard samples in the distillation process.Similar sentences may have completely different meanings.For example, for the linguistic acceptability task, although the sentences "We yelled ourselves hoarse" and "We yelled Harry hoarse" are similar as they only have one different word, the first sentence is linguistically acceptable while the latter one is not, making them fall into different categories.This makes these sentences hard to distinguish because their features are similar and thus less discriminative.This phenomenon often occurs in other natural language processing tasks, and we provide more examples from GLUE benchmark (Wang et al., 2019) in Table 1.Therefore, it is also desirable to pay more attention to hard samples to strengthen their discrimination abilities.
To solve the aforementioned problems, we propose a new knowledge distillation framework called adaptive contrastive knowledge distillation (ACKD).Specifically, to tackle the first issue (i.e., lack of explicit supervision), we introduce the concept of contrastive learning (Gutmann and Hyvärinen, 2010;Oord et al., 2018;Saunshi et al., 2019;Hjelm et al., 2018) to knowledge distillation and design a contrastive distillation loss (CDL) as the explicit supervision to maximize the distance of the features from negative pairs.In particular, for each sample s, our CDL aims to maximize the similarity between the features of s in the student and that in the teacher, and minimize the similarity between the features of s in student and the features from the negative pairs of s in teacher.As shown in Fig. 1(c), our CDL can effectively push the features from negative pairs far away from each other.
To tackle the second issue (i.e., learning of hard samples), we propose a new strategy called sample adaptive reweighting (SAR) in our ACKD framework to adaptively pay more attention to hard samples to strengthen their discrimination abilities.Specifically, we utilize a neural network as a predictor to predict the discrimination ability of the feature for each sample based on its learned feature.Then, we reweight the loss from different samples according to the predicted discrimination ability.As all operations in this process are differentiable, the parameters of the predictor can be jointly learned with the student.We seamlessly incorporate our SAR strategy into the newly proposed CDL and construct the adaptive contrastive distillation loss (A-CDL).
We combine our A-CDL with the existing knowledge distillation methods and construct our Adaptive Contrastive Knowledge Distillation (ACKD) framework.It is also a non-trivial task to construct our ACKD framework as our A-CDL is calculated based on the features, which can only be calculated inside one mini-batch due to the property of current deep learning frameworks (i.e., features will be released after the calculation of current batch).So, the diversity of negative paired samples is limited by the batch size, causing an inaccurate optimization direction.To overcome this issue, inspired by (He et al., 2020), we construct a dynamic feature storage that can store the features from a large number of samples, based on which we calculate our A-CDL to increase the sample diversity.
In summary, the main contribution of this paper can be summarized as follows: • We propose a novel contrastive distillation loss (CDL) to introduce explicit supervision for learning discriminative student features.
• We propose a new strategy called sample adaptive reweighting (SAR) strategy to adaptively pay more attention to hard samples and strengthen their discrimination abilities.We seamlessly incorporate our SAR strategy into our CDL and form the adaptive contrastive distillation loss (A-CDL).Based on A-CDL, we construct our new adaptive contrastive knowledge distillation (ACKD) framework for BERT compression, in which dynamic feature storage is used to increase the diversity of samples.
• Comprehensive experiments on multiple natural language processing tasks demonstrate the effectiveness of our ACKD framework.

Related Work
Knowledge distillation.Recently, model compression methods (Guo et al., 2020b(Guo et al., ,a,c, 2021(Guo et al., , 2023(Guo et al., , 2022b;;Wei et al., 2023;Qin et al., 2022Qin et al., , 2023a,c,b;,c,b;Liu et al., 2022cLiu et al., , 2020Liu et al., , 2022a;;Peng et al., 2019) attracts many attentions, among which knowledge distillation approaches (Liu et al., 2022b) were proposed to accelerate deep neural networks (Ma et al., 2022(Ma et al., , 2021;;Hu et al., 2021).For example, (Hinton et al., 2015) first proposed to use the so-called dark knowledge as the additional supervision for training the student.After this work, many methods (Romero et al., 2015;Zagoruyko and Komodakis, 2017) were proposed to utilize the intermediate feature as the supervision in the distillation process.Another line of work finds knowledge distillation cannot achieve promising performance if there is a large capacity gap between teacher and student.Therefore, this line of works aims to use a sequence of teacher models to better transfer the knowledge to the student, including RCO (Jin et al., 2019) and TAKD (Mirzadeh et al., 2020).However, all of these works do not consider the relationship between different samples (e.g., the correlation between negative pairs), while our ACKD uses the relationship among samples as the explicit supervision to learn more discriminative features.
There are also knowledge distillation approaches (Tian et al., 2019) that utilize the relation between different samples when learning the student, which is more related to our ACKD framework.For example, (Tung and Mori, 2019) proposed to use the similarity of the features from different samples as the knowledge to train the student.(Park et al., 2019) and (Yim et al., 2017) use the mutual relation of different samples as the knowledge for distillation.However, these methods only use the student to mimic the sample relation in the teacher, which also lacks explicit supervision for the student to learn discriminative features.In contrast, our ACKD framework uses the newly proposed A-CDL to explicitly push the features of negative pairs far away from each other.Moreover, these methods do not consider the learning of hard sample problem for natural language processing tasks.In our ACKD, we use the SAR strategy to pay more attention to hard samples.
Knowledge distillation for BERT.Many methods were also proposed for compressing BERT (Devlin et al., 2018;Sanh et al., 2019;Zhou et al., 2022;Haidar et al., 2022;Jafari et al., 2021;Passban et al., 2021).For example, patient knowledge distillation (Sun et al., 2019) proposed to use inter-mediate features as the supervision to train a small student BERT.TinyBERT (Jiao et al., 2019) uses a two-stage distillation strategy for BERT compression.Although these methods can compress BERT for efficient inference, explicit supervision for learning discriminative student features is not used in these methods.While (Fu et al., 2021) also uses contrastive loss for BERT distillation, they do not use SAR strategy and ignores the sample difficulties.(Sun et al., 2020) proposed CoDIR method to capture structural knowledge in the intermediate layers.Unlike our ACKD framework, these approaches do not consider paying more attention to hard samples.

Adaptive Contrastive Knowledge Distillation
In this section, we will introduce our adaptive contrastive distillation (ACKD) framework.The goal of our ACKD framework is to use a pre-trained teacher model with a large capacity to help the training of a lightweight student model, and its overview is shown in Fig. 2. The loss of our ACKD framework when training the student comes from four parts: cross-entropy loss (CEL), knowledge distillation loss (KDL), patient loss (PTL), and our adaptive contrastive distillation loss (A-CDL).

Preliminary
Patient distillation (Sun et al., 2019) was proposed to compress BERT.Given the training dataset with N samples D = {(x 1 , y 1 ), (x 2 , y 2 ), . . ., (x n , y n )}, the student network can be trained by using the loss function as follows: ( L ce is the task-specific loss and CE(•, •) is the corresponding loss function, in which cross-entropy is commonly adopted for the classification task.L kd is the knowledge distillation loss and ST (•, •) denotes the corresponding loss function, in which the Kullback-Leibler divergence of the output probability distribution between the teacher and student is commonly adopted.L pt is the patient loss introduced in (Sun et al., 2019) and (2) Sum up optimization direction (3) Push negative paris far away

Labels Predictions
Figure 2: Overview of our ACKD framework for BERT compression, which consists of the losses from four parts: cross-entropy loss (CEL), knowledge distillation loss (KDL), patient loss (PTL), and adaptive contrastive distillation loss (A-CDL).In this figure, the teacher and student have L T and L S layers, respectively.Our A-CDL aims to push negative pairs far away from each other, which is calculated based on the hidden state features and the predicted discrimination abilities of different samples.Dynamic feature storage is used to increase the diversity of samples.
the mean square error function.T and S are the teacher and student networks, and their parameters are denoted as θ T and θ S , respectively.z T,m i and z S,m i denote the hidden state feature from the teacher and the student for the i-th sample at the mth paired layers when calculating the patient loss, respectively.M is the number of layers that the patient loss is inserted.α and β are the hyperparameters to control the trade-off of different terms.The loss L ce , L kd , and L pt correspond to the CEL, KDL, and PTL in Fig. 2, respectively.

Contrastive Distillation Loss
Although the loss in Eq. ( 1) can transfer the knowledge from teacher to student, it lacks explicit supervision to learn discriminative student features.Namely, it only provides the supervision to pull the features from the same sample in teacher and student close to each other, while lacking the supervision to push the features from different classes far away from each other for more discriminative feature learning (Harwood et al., 2017;Wu et al., 2017;Suh et al., 2019).To this end, we first design our contrastive distillation loss (CDL) in the ACKD framework.
As our CDL can be introduced at different layers, below, we only focus on the m-th paired layer and omit the layer index for better presentation.For example, we use z T i and z S i to denote the hidden state features for the i-th sample at this layer in teacher and student, respectively.The CDL can be written as follows:

P OS P OS + N EG ,
where (2) Here, cosine(•, •) denotes the cosine similarity.N i denotes the set containing the hidden state features of the samples from different classes with the i-th sample (i.e., negative pair).

Sample Adaptive Reweighting
As mentioned in Sec. 1, similar sentences may have completely different meanings, which makes these samples hard to distinguish.To this end, we propose our sample adaptive reweighting (SAR) strategy to adaptively pay more attention to these hard samples.Specifically, we use a predictor network to predict the discrimination ability of each sample based on its learned features, and incorporate this predicted discrimination into our CDL to form adaptive contrastive distillation loss (A-CDL).For-mally, the A-CDL can be written as follows: where , (3) Here, w i is the predicted discrimination ability of the i-th sample.P(•, •) is the function of the predictor, which is implemented by a neural network.θ p is the learnable parameter of the predictor.Sigmoid(•) is the sigmoid function, which is used to ensure the predicted discrimination abilities are positive.The other notations are the same as before.As all operations are differentiable in this process, we can jointly train this predictor with the student network in distillation.In this way, we can adaptively assign higher weight 1 w i on the samples with less discriminative features and finally form the adaptive contrastive distillation loss, which corresponds to A-CDL in Fig. 2. Note that our predictor is implemented by a simple neural network.Therefore, the extra computation caused by the predictor can be neglected compared with that required by the gradient calculation.

Overall Loss Function
As our A-CDL can be introduced to different paired layers of the teacher and student networks, for better presentation, below, we additionally use the superscript • m to denote the corresponding symbols for the m-th paired layers that A-CDL is inserted.So, the loss function when training the student network in our ACKD framework can be written as: where )).
(4) α, β, and γ are the hyperparameters to control the importance of different terms.L ce , L kd , and L pt are the cross-entropy loss, the knowledge distillation loss, and the patient loss, respectively, which are introduced in Eq. ( 1).L acd is our newly proposed adaptive contrastive distillation loss introduced in Eq. ( 3).Other notations are the same as before.By using the loss introduced in Eq. ( 4), we can use explicit supervision to push the features of negative pairs in the student far away from each other, with the consideration of the sample discrimination abilities.In this way, we construct our ACKD framework for BERT compression.

Dynamic Feature Storage
When introducing the A-CDL to the existing knowledge distillation methods and constructing our ACKD framework, another issue is that A-CDL requires large sample diversity, which is not required in the existing knowledge distillation approaches, making the construction of our ACKD framework a non-trivial task.Specifically, the term N EG is calculated based on the features of different samples.Due to the property of the current deep learning framework, features will be released after the calculation of each mini-batch.Therefore, we can only calculate N EG based on the samples in one minibatch.So, the feature of the i-th sample can be only pushed far away from those of a small portion of negative pairs, which causes inaccurate optimization direction.Inspired by (He et al., 2020), we construct dynamic feature storage to increase the sample diversity.Specifically, after the calculation of each batch, we store the features of this batch in the storage for N EG calculation.At the same time, labels of these samples will be also stored in the storage for identifying the samples in N i .As the BERT model processes a sequence of tokens in parallel, the feature dimension is relatively large, which causes more memory burden to GPU.Therefore, to further save memory usage, we only store the features of the layer that A-CDL is inserted.After the storage is full, we update storage based on the first in first out strategy.In our implementation, we set the storage size as 1000.In this way, we increase sample diversity when calculating N EG.

Discussion
The design concept of our A-CDL is as follows.
In the distillation process, the loss L acd will be minimized.To achieve this, we will maximize the value inside the −log(•) function.So in the training process, the numerator P OS will be increased, which pulls the feature from the same sample in teacher and student close to each other.At the same time, the denominator term N EG will be decreased, which pushes the feature of the j-th sample from different classes in the student far away from that of the i-th sample in the teacher.Moreover, by using discrimination ability 1 w i , we assign higher weights to the samples with less discriminative features.In this way, we introduce explicit supervision with the consideration of sample discrimination abilities to learn more discriminative student features.
From another point of view, our A-CDL can also be viewed as the loss to "eliminate" the influence of incorrect predictions from the teacher when learning the student.Specifically, as in Fig. 1(b), if the green sample is close to the blue one and is misclassified by the teacher, traditional knowledge distillation methods will not be aware of this misclassification.So the green sample in the student will be "attracted" by that in the teacher (black arrow), causing misclassification in the student as well.In contrast, from Eq. ( 3), the negative pair set N i when calculating N EG is obtained based on the ground truth labels.Therefore, as in Fig. 1(c), despite the green sample being misclassified by the teacher, the green sample in the student will be "repelled" by the blue sample in the teacher (red arrow).Although the cross-entropy loss for student is also based on the ground truth labels, the optimization direction will be affected by the incorrect teacher prediction.So our A-CDL can "eliminate" the influence of incorrect predictions from teacher to some extent.

Experiments
In this section, we perform comprehensive experiments and extensive ablation studies.
Following many works (Sun et al., 2019;Zhou et al., 2022), we report the results on MNLI-m and MNLI-mm on MNLI.For MRPC and QQP, we report both F1 and accuracy.For STS-B, we report Pearson and Spearman correlation.For CoLA, we report Matthew's correlation.We use accuracy as the metric for other datasets.

Implementation Details
We implement our ACKD framework based on the PyTorch framework.We follow previous works (Sun et al., 2019;Zhou et al., 2022) to evaluate our ACKD under the task-specific setting, in which the teacher network is firstly fine-tuned on downstream tasks and the student network is also trained based on the downstream tasks in the distillation process.Following (Sun et al., 2019), we use the BERT-Base model as the teacher network, and use BERT with 3 and 6 layers as the student models (denoted as BERT 3 and BERT 6 ), respectively.The number of hidden states is set as 768 in both teacher and student networks.We follow (Sun et al., 2019) to assume the lower layers of the teacher network also contain important information and should be passed to the student.Therefore, we choose the "skip" strategy in (Sun et al., 2019) to insert our A-CDL, which can bring stronger supervision.
We first finetune the pre-trained BERT-Base model on downstream tasks as the corresponding teacher models.The maximum sequence length is set as 128, and AdamW (Loshchilov and Hutter, 2018) optimizer is adopted.We set the initial learning rate and batch size as 2e −5 and 8, respectively.The training epoch ranges from 2 to 4 for different downstream tasks.Then, we train our student network by using our ACKD framework.The discrimination predictor for generating w i in Eq. ( 3) is implemented by a two-layer neural network.The size of dynamic feature storage is set as 1000.We follow (Sun et al., 2019;Zhou et al., 2022) to perform hyperparameter search over student learning rate from {1e −5 , 2e −5 , 5e −5 }, the batch size from {8, 16, 32}, the hyperparameter α from {0.1, 0.3, 0.5}, β from {20, 40, 60}, and γ from {5e −4 , 5e −3 , 5e −2 }.The other hyperparameters are the same as those when training the teacher network.
The results are shown in Table 2. From Table 2, we have following observations: (1) Our ACKD framework outperforms other baseline methods when using BERT 3 and BERT 6 as the students under most of settings, which demonstrates the effectiveness of the proposed ACKD framework.Specifically, when using BERT 3 as the student, our ACKD framework surpasses other baseline methods by more than 2.9% on CoLA.(2) When using BERT 3 as the student, our ACKD framework can achieve higher performance gain.One possible explanation is that the performance of the distilled BERT 6 is close to the teacher network BERT-Base, which is the bottleneck for further performance improvement.Also, BERT 3 has less knowledge than BERT 6 .Therefore, our A-CDL as new knowledge  can bring more information gain for BERT 3 and thus bring more performance improvement.

Ablation Study
In this section, we perform extensive ablation studies.We use BERT-Base as the teacher network and use BERT 3 as the student network to conduct the experiment on QNLI (Rajpurkar et al., 2016).
Effectiveness of L acd in Eq. (4).To investigate the effectiveness of the A-CDL, we remove the L acd in Eq. ( 4) and conduct the distillation.The re- sult is denoted as "w/o L acd " in Fig. 3. Our ACKD method outperforms the alternative approach "w/o L acd " by a large margin, demonstrating the effectiveness of our A-CDL for explicit supervision to push student features of negative pairs far away from each other.Effectiveness of our sample adaptive reweighting strategy.To investigate the effectiveness of our SAR strategy, we perform the experiment to remove the 1 w i in Eq. ( 3) and conduct the distillation.In this case, we use CDL instead of A-CDL in distillation.The result is denoted as "w/o SAR" in Fig. 3. From the result, we observe that our ACKD approach performs better than the alternative method "w/o SAR", which demonstrates the effectiveness of the SAR strategy to pay more attention to less discriminative samples.
Effectiveness of dynamic feature storage.We investigate the effectiveness of using dynamic feature storage (DFS) in our ACKD framework.We perform the experiment to remove the DFS, and the result is denoted as "w/o DFS" in Fig. 3. Our ACKD framework performs better than "w/o DFS", demonstrating the effectiveness of using dynamic feature storage.
Effectiveness of L kd and L pt in Eq. ( 3).We also report the results when removing the L kd and L pt in Eq. ( 4), which are denoted as "w/o L kd " and "w/o L pt " in Fig. 3, respectively.From the results, we observe: (1) The performance of our ACKD framework is better than the methods "w/o L kd " and "w/o L pt ".This suggests it is beneficial to use L kd and L pt .
(2) The accuracy of "w/o L pt " is higher than "w/o L kd ", which indicates the loss L kd is more useful than L pt in our ACKD framework when compressing BERT.

Algorithm Analysis
In this section, we also use BERT-Base as the teacher and use BERT 3 as the student to conduct the experiments on algorithm analysis.We perform the experiments on QNLI (Rajpurkar et al., 2016).
Analysis on the structure of teacher network.In Table 3, we also report the results when using different teacher networks.We observe that we can effectively train the student when using different teacher network structures.
Analysis on the structure of predictor.In our ACKD framework, we use a two-layer neural network as our predictor to predict the discrimination ability of each sample.We also investigate the performance of our ACKD framework when using different predictor structures.When using BERT-Base as the teacher and using BERT 3 as the student, the accuracy of our ACKD framework with two, three, and four layers of predictor are 86.2%,86.4%, and 86.2% on QNLI, respectively.We observe that the performance of our ACKD using different predictor structures is relatively stable.

Visualization
To demonstrate the effectiveness of the proposed A-CDL, we visualize the learned student feature without and with using our A-CDL.Specifically, Fig. 4 visualize the student feature trained without and with using A-CDL (i.e., L acd in Eq. ( 3)) on QNLI and MRPC by using the t-SNE (Van der Maaten and Hinton, 2008) technique.From Fig. 4, we observe that after introducing our A-CDL, the student features from different classes become far away from each other, which demonstrates the effectiveness of our A-CDL.

Conclusion
In this paper, we have proposed a new knowledge distillation approach called adaptive contrastive knowledge distillation (ACKD) for BERT compression.We first introduce a novel contrastive distillation loss (CDL) as the explicit supervision to learn more discriminative student features.Then, we propose a new strategy called sample adaptive reweighting (SAR) to adaptively pay more attention to hard samples with fewer discrimination abilities.The SAR strategy can be seamlessly incorporated into the CDL and form the adaptive contrastive distillation loss (A-CDL).Based on A-CDL, we construct our ACKD framework, where dynamic feature storage is used for better sample diversity.Extensive experiments on multiple natural language processing tasks demonstrate the effectiveness of our ACKD framework for BERT compression.

Limitation
One of the limitations of our framework is we need to design the rough range of hyperparameters to search the best setting.In our future work, we will explore the strategy to avoid hyperparameter tuning.

Ethical Consideration
Our adaptive contrastive knowledge distillation framework aims to improve the performance of knowledge distillation methods and does not introduce extra ethical concerns compared with other knowledge distillation approaches.Therefore, there are no ethical problems caused by the proposed method.D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: (a) and (b): Traditional knowledge distillation method.Student fails to learn discriminative features when sentences from different classes have similar features in the teacher.(c): Our ACKD framework, which uses explicit supervision to push student features from different classes far away from each other.
1 Features from sample 2 Features from sample 3 D e c is io n b o u n d a ry (1) Add explicit supervision for negative pairs Sample adaptive reweighting w/o acd w/o SAR w/o DFS w/o kd w/o pt ACKD

Figure 3 :
Figure 3: Performance of our ACKD framework and other alternative methods on QNLI.
Figure 4: The t-SNE visualization of student features trained without ((a) and (c)) and with ((b) and (d)) using our A-CDL.

C2.
Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values? 4 C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run? 4 C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)? 4 D Did you use human annotators (e.g., crowdworkers) or research with human participants?Left blank.

Table 1 :
Examples of hard samples from GLUE.

Table 2 :
Performance comparison of different methods on the dev set of the GLUE benchmark.CoDIR uses RoBERTa-base as the teacher, and we report the median performance of this method copied from published paper.The results in bold indicate the best result, while the results underlined indicate the second-best result.

Table 3 :
Performance of ACKD framework when using different teacher network structures.BERT l means the BERT model with l layers.