Improved Knowledge Distillation for Pre-trained Language Models via Knowledge Selection

Knowledge distillation addresses the problem of transferring knowledge from a teacher model to a student model. In this process, we typically have multiple types of knowledge extracted from the teacher model. The problem is to make full use of them to train the student model. Our preliminary study shows that: (1) not all of the knowledge is necessary for learning a good student model, and (2) knowledge distillation can benefit from certain knowledge at different training steps. In response to these, we propose an actor-critic approach to selecting appropriate knowledge to transfer during the process of knowledge distillation. In addition, we offer a refinement of the training algorithm to ease the computational burden. Experimental results on the GLUE datasets show that our method outperforms several strong knowledge distillation baselines significantly.


Introduction
Pre-trained language models (PLMs) have significantly advanced state-of-the-art on various natural language processing tasks, such as sentiment analysis (Bataa and Wu, 2019;Baert et al., 2020), text classification (Sun et al., 2019a;Arslan et al., 2021), and question answering (Yang et al., 2019).Despite the remarkable results, PLMs have a large number of parameters which make them expensive for deployment (Yang et al., 2019).
To solve this problem, recent works resort to knowledge distillation (KD) (Hinton et al., 2015) to compress and accelerate the PLMs (Sun et al., 2019b;Sanh et al., 2019;Li et al., 2021b;Liang et al., 2021).Its key idea is to transfer the knowledge from a large PLM (i.e., teacher model) into a lightweight model (i.e., student model) without a significant performance loss.In this process, we typically have multiple types of knowledge extracted from the teacher model, such as response knowledge, feature knowledge, and relation knowledge (Gou et al., 2021).Recent works mainly focus on how the student model learns the transferred knowledge more efficiently, such as designing training schemes (Sun et al., 2019b;Mirzadeh et al., 2020;Jafari et al., 2021;Li et al., 2021a) or enriching task-specific data (Jiao et al., 2020;Liang et al., 2021).However, few works consider how to make full use of the multiple types of knowledge into the training of the student model.
Our preliminary study (see Section 4.1) shows that not all of the knowledge is necessary for learning a good student model when conducting distillation with diverse knowledge.Furthermore, inspired by dynamic KD (Li et al., 2021b), we assume that learning from certain knowledge at different training steps is beneficial to KD.We conduct probing experiments to verify this assumption in Section 4.2.Specifically, we repeat KD 200 times and randomly select knowledge at each training step.We find that different distilled student models have a distinct performance.For example, the best student model is nearly 10% higher than the worst one in terms of accuracy on the RTE dataset.In addition, it notably exceeds the student model that learns from fixed knowledge.Based on our preliminary study, we have the following suggestion: the distilled student model could achieve superior performance if it learns appropriate knowledge at each training step.This suggestion motivates us to investigate how to select the appropriate knowledge for the student model during the process of KD.
Based on the above findings, we propose an actor-critic approach to selecting appropriate knowledge to transfer at different training steps (Bhatnagar et al., 2009).This approach first uses an actor-critic algorithm to implement a knowledge selection module via a long-term reward optimization.This optimization can consider the influence of knowledge selection on future training steps.Furthermore, we develop a multi-phase training approach that divides the distillation process into many phases and provides a particular reward at the phase end.Compared to the usual actor-critic algorithm, it can ease the burden of computing rewards.After that, we perform KD by employing the trained knowledge selection module to select knowledge at each training step.
We test the proposed approach on six GLUE datasets (Wang et al., 2019) which involve question answering, sentiment analysis, and textual entailment.Experimental results show that our method significantly outperforms the vanilla KD method (Hinton et al., 2015) and other competitive KD methods.Notably, results also show our BERT 6 (6layer BERT) student model can achieve more than 98.5% of the performance of teacher BERT BASE while keeping much fewer parameters (∼61%).In addition, we prove that when armed with data augmentation, our approach can yield further improvements and be superior to TinyBERT (Jiao et al., 2020), leading to a strong baseline with the data augmentation.

Related Work
Knowledge distillation (KD) (Hinton et al., 2015) is widely used to compress and accelerate the pretrained language models (PLMs) (Jiao et al., 2020;Sun et al., 2020;Wang et al., 2020;Li et al., 2021b).Its core idea is to transfer the knowledge from a large PLM (i.e., teacher model) into a lightweight model (i.e., student model).Recent works on KD could be classified into three groups.The first group focused on utilizing various knowledge from the teacher model to distill the student model.For example, multiple teacher models are leveraged to provide a more diverse and accurate knowledge for the student model (Liu et al., 2019a;Yuan et al., 2021).Sun et al. (2019b) exploited knowledge from intermediate layers of the teacher model during distillation.Moreover, Weight Distillation (Lin et al., 2021) transfers the knowledge in parameters of the teacher model to the student model.The second group tended to design effective strategies to facilitate the student model to learn the knowledge from different types of knowledge, such as enriching the task-specific data (Jiao et al., 2020;Liang et al., 2021) and the two-stage learning framework (Turc et al., 2019;Jiao et al., 2020).The third group that has attracted less attention generally explored appropriate data and teacher models for student models in the KD.For example, Li et al. (2021b) proposed a data selection strategy that only selects some vital data for the student model according to its competency.Yuan et al. (2021) attempted to adjust the weights to the multiple teacher models during distillation.
Different from these methods, in this paper, we are concerned with how to select appropriate knowledge for student models during the process of KD.To this end, we train an effective knowledge selection model via an actor-critic algorithm and employ it to select appropriate knowledge to transfer at each training step.

Knowledge Distillation
In KD, the student model learns knowledge from the teacher model by mimicking corresponding behaviors.In the distillation process, this mimicking can be implemented by minimizing the following loss function: where X is the training dataset, x is an input sample, f S (•) and f T (•) are the functions of describing the behaviors of the student model and the teacher model, respectively.L dif f (•) is a loss function that evaluates the difference between their behaviors.In the process of KD, due to many behaviors (e.g., extracting features) existing in the student and the teacher model, we typically have multiple types of knowledge.Previous works mainly focus on designing elaborate f S (•), f T (•), and L dif f (•) to encourage the student model to learn certain fixed knowledge better (Sun et al., 2019b;Jiao et al., 2020;Liang et al., 2021).Unlike them, in this work, we make full use of the multiple types of knowledge and select appropriate knowledge for the student model during the process of KD.

Knowledge Types
The student model can learn from various of knowledge extracted from the teacher model in KD.According to Gou et al. (2021), the types of knowledge can often be Response Knowledge (ResK), Feature Knowledge (FeaK), and Relation Knowledge (RelK).In addition, we consider a Finetune Knowledge (FinK) derived from the training dataset on compressing PLMs.The overview of these knowledge types is following:  (Yim et al., 2017).It focuses more on the relationship between model layers.• FinK: FinK is the knowledge that the student model learns from ground-truth labels by finetuning on the training dataset (Hinton et al., 2015;Devlin et al., 2019).Here, we also consider it as a knowledge type in KD.
In this work, we explore how to select appropriate knowledge from these types to transfer during the process of KD.To ensure transferring the above types of knowledge to the student model, we design loss functions L ResK , L F eaK , L RelK , and L F inK , respectively.The Appendix A presents the design details of loss functions.

Preliminary Study
Due to its unique information and character, we assume that each type of knowledge has a different impact on KD results.Based on this, we conduct preliminary studies to probe the relationship between knowledge and KD.

Different Knowledge for KD
We conduct an experiment using different knowledge to distill the student models individually.In

Different Knowledge at Training Steps
Inspired by dynamic KD (Li et al., 2021b), we assume that the appropriate knowledge may change dynamically because the student model and the sample keep changing during training.We conduct probe experiments to verify this assumption.We randomly select knowledge for the student model at each training step.Here, we employ this random strategy for the training steps of all epochs (Random-All) and the training steps of the first epoch (Random-One), respectively.To make experiments more general, both Random-All and Random-One are repeated 200 times.We report the best and worst performance of the distilled student models denoted by S best and S worst in Table 2. From the results of Random-All, we can see that there is a conspicuous gap in accuracy scores between S best and S worst .Notably, S best performs significantly better than S worst by 10% on the RTE dataset.We conjecture that S best means the student model is trained with the appropriate knowledge at most of the training steps, while S worst is the opposite.In addition, we notice that S best achieves 68.4% accuracy, which significantly surpasses all student models with a fixed knowledge.It indicates that KD can benefit from certain knowledge at different training steps.This also motivates us to investigate the method for selecting appropriate knowledge.
For the results of Random-One, there is also a remarkable performance difference between S best and S worst , though only the training steps of the first epoch in KD involves random selection of knowledge.It shows that knowledge selection at the current training step affects the following training steps.Therefore, we should consider the future influence while selecting appropriate knowledge.

Overview
In this work, our goal is to select appropriate knowledge to transfer during the process of KD.Considering the current selection may affect future training steps, we propose the actor-critic approach, which addresses the knowledge selection problem via a long-term reward optimization.Figure 1 gives an overview of our method.Our method involves two stages: training a knowledge selection module (KSM) and distilling the student model with the trained KSM.In the first stage, we use an actorcritic algorithm consisting of an actor and a critic to implement KSM.The actor outputs an action that selects knowledge based on a given state.The critic network predicts the action value, i.e., the sum of future rewards.In the second stage, we employ the trained KSM to select appropriate knowledge and transfer them to the student model.

Definitions
In this work, we use an actor-critic algorithm to implement the KSM.Here, we first describe some key concepts of the actor-critic algorithm, including state, action, and reward.

State.
At training step t, we use s t to denote the corresponding state.Here, s t should comprise sufficient information for selecting appropriate knowledge.To this end, we use the informative [CLS] embeddings of the last layers in the student and the teacher models (Devlin et al., 2019) to achieve it.Specifically, for the i-th sample at training step t, we use c S i and c T i to denote the [CLS] embedding of the last layer of the student and the teacher model, respectively.Then we utilize two trainable feature networks, N S f ea and N T f ea , to extract the useful feature vector v: where both N S f ea and N T f ea consist of a 2-layer multi-layer perceptron (MLP) network.We concatenate all extracted feature vectors in given batch as s t .
Action.Given the state s t , an action a t can be created, which is used to select the appropriate knowledge to transfer at training step t.Here, we present soft and hard actions to select appropriate knowledge subtly.The soft action determines how much to learn from each knowledge type, while the hard action selects one or more knowledge types for training the student model.
Reward.Here we define the immediate reward r t as the cross-entropy loss difference on a development set after the student model is trained with selected knowledge.See Table 6 in Appendix for a comparison of different rewards.

Knowledge Selection Module
To improve the performance of the distillation model, we need to select the appropriate knowledge for transferring at different training steps.For this purpose, we implement a knowledge selection module via the actor-critic algorithm which consists of an actor and a critic.Here, we utilize long-term rewards to optimize it to consider the influence of knowledge selection on future training steps.

Actor and Critic
Actor.In this work, the actor network µ θ is composed of a three-layer MLP with four output neurons.It takes a state s t and outputs the soft action with a Sigmoid function: where a t = {a t 1 , a t 2 , a t 3 , a t 4 } contains four real values belonging to [0, 1].We use a t m in a t to denote the percentage of the selected m-th type of knowledge * .At training step t, the student model's loss is calculated as follows: In addition, to directly select one or more types of knowledge, we obtain the hard action via a conditional function g(•): * Here, we treat FinK, ResK, FeaK, and RelK as 1-th, 2-th, 3-th, and 4-th type of knowledge.where λ ∈ [0, 1] is a threshold value.The student model's loss with the hard action is as follows: Here if a t m ≥ λ, m-th type of knowledge is appropriate for the student model; otherwise, it is not.
Critic.The critic network Q ϕ is also composed of a three-layer MLP.It computes the action value Q ϕ (s t , a t ) of a given pair (s t , a t ), where s t and a t are concatenated as the input.The action value is an estimation of the sum of rewards earned after the actor takes the action a t at the state s t .Here, we define the action value as the long-term reward, i.e., the performance gain of the student model from the training step t to the training end.

Optimization
Actor Optimization.Following the deep deterministic policy gradient algorithm (Silver et al., 2014), we optimize the long-term reward by updating the actor parameters via the sampled policy gradient: Critic Optimization.At each training step, we can only gain an immediate reward r t .However, the critic provides a long-term reward rather than an immediate reward.Therefore, we employ Temporal-Difference (Tesauro, 1991) to approximate the actual long-term reward with r t : where γ is a discount factor (Sutton and Barto, 2018).Then we can give the optimization objective of the critic network through the approximate long-term reward: where M SE(•) is the mean squared error loss function, and N is the number of training steps in KD.
Feature Networks Optimization.During training KSM, we update the student feature network with the actor network and the teacher feature network with the critic network, respectively, in order to reduce the instability of training KSM.

Multi-Phase Training
Due to a large number of training steps in some datasets, computing the reward via the loss on the development set can yield considerable costs during training the KSM.For example, training with 5 epochs on the QQP dataset needs to compute the reward about 50,000 times in an episode.To ease the computational burden, we propose a multiphase training approach to train the KSM.Specifically, we divide the complete KD process into many phases where each phase contains k training steps.After the student model is trained on a phase, we treat the cross-entropy loss difference on a development set as the phase reward r p .It expresses the sum of all rewards in a phase.Here we can use a phase optimization objective L p Q to train the critic network instead of the L Q : where N p is the number of phases in KD, b(j) and e(j) are the beginning and end training steps in the j-th phase.The rt is an estimated reward computed by Eq. 8: for j = 1 to Np do 5: for t = b(j) to e(j) do 6: compute a state st via Eq.2; 7: sample an action at = µ θ (st); 8: utilize the actions in the previous phases to compute r e t via Eq. 12 or Eq.13; 9: compute an action value Q ϕ (s, a); 10: train MS via Eq. 4 or Eq.6; 11: end for 12: compute r p j via the loss difference; 13: train the KSM via Eq. 9 and Eq.10; 14: end for 15: end for 16: return KSM With the help of dividing phases, we can effectively relieve the computational burden.Generally, the conventional training of the actor-critic algorithm needs to compute rewards N times, while our multi-phase training only calculates rewards ⌈N/k⌉ times.

Exploration Reward
To speed up the training of the KSM, we also design an exploration reward r e t at training step t, which encourages the actor to take more different actions (Tang et al., 2017).For the soft action, we consider the similarity between actions as r e t : where Sim(•) is a cosine similarity function, and a l is the l-th action in the previous phase.In addition, we calculate exploration reward r e t for hard action based on its repetition: where count(a t ) is the number of times the action a t was taken in the previous phase, and α is a scale factor.We provide the critic network with an additional optimization objective L e Q via r e t computed by Eq. 9.

Knowledge Distillation
Algorithm 1 presents our overall method.In the first stage, we train a KSM with an actor-critic algorithm.As described in Algorithm 2, we first divide the complete KD process into N p phases in an episode (line 3).At each training step in a phase, we compute a state, an action, and an exploration reward (lines 6-9) and train the student model with selected knowledge (line 10).Then, we compute the phase reward when the phase ends.Subsequently, we optimize KSM via the L e Q and L p Q loss objectives computed by the Eq. 9 and Eq. 10, respectively (line 13).This process iterates on episodes until the performance of the student model converges.In the second stage, we adopt the trained KSM to select appropriate knowledge that is used to distill the student model.

Experiments
6.1 Datasets and Settings Datasets.Following Sun et al. (2019b), we conduct experiments on six GLUE datasets (Wang et al., 2019): MNLI (Williams et al., 2018), QQP (Chen et al., 2018), QNLI (Rajpurkar et al., 2016), SST-2 (Socher et al., 2013), MRPC (Dolan and Brockett, 2005), and RTE (Bentivogli et al., 2009).Settings.We use BERT BASE as the teacher model and the BERT 6 and BERT 3 as the student models.The details of training settings are shown in Appendix B. All experiments are repeated three times, and we report the average results over three runs with different seeds.
In addition, we also design Random-Soft and Random-Hard baselines to evaluate the effectiveness of our method.Random-Soft and Random- For MRPC and QQP, we report F1/Accuracy.We also report the average accuracy for each dataset in the "Avg."column.The results for Dynamic KD are achieved via the Uncertainty-Entropy strategy (Li et al., 2021b).The results of DistilBERT and PKD are taken from Jiao et al. (2020) and Sun et al. (2019b), respectively.
Hard denote that the KSM randomly takes a soft and hard action, respectively.

Main Results
We submit our model predictions to the official GLUE evaluation server, and the results are summarized in Table 3. First, compared with all baselines, our method can achieve optimal results while distilling the student models BERT 6 and BERT 3 on all the datasets.Second, compared with the Random-Hard (or Random-soft), we can observe that our method achieves significant improvements in the F1 and the accuracy scores.It demonstrates that the KSM system outperforms the baselines that randomly select knowledge to use.Third, the BERT 6 student model distilled by our method with soft action significantly outperforms the popular vanilla KD by a margin of at least 1.4%.We attribute this to the fact that KD benefits from selected knowledge at different training steps.Fourth, our method achieves the BERT 6 student model with ∼61% parameters and achieves a similar performance compared with the teacher model BERT BASE .For example, on the QQP dataset, the BERT 6 student model trained by our method with soft action has 89.1% accuracy, which only is 0.1% away from the teacher model.In addition, compared with the teacher model BERT BASE , the BERT 3 student model has only 41% parameters while maintaining 92% performance.Furthermore, we further investigate the performance gain on two different actions.From the results, we can find that the soft action performs better than the hard action, though both can contribute to our method to achieve an improvement over baselines on most of the datasets.One potential explanation might be that the soft action space is larger than the hard action space.Therefore, the soft action is more likely to explore a better knowledge selection than the hard action.

Ablation Study
We conduct an ablation study to explore the effects of the proposed multi-phase training and exploration reward on accuracy and efficiency.Figure 2 shows the accuracy of student models distilled with the KSM on the RTE development set after removing multi-phase training or exploration reward.We can see that using the multi-phase training can train a good KSM more likely, though the number  of offering a reward is reduced.We conjecture that the underlying reason is that the phase reward may be more accurate and stable than the immediate reward at each training step.In addition, we can also observe that without the exploration reward, the KSM fails to explore the better knowledge selection more quickly.

Discussion
Performance on Data Augmentation Although the trained KSM can select the appropriate knowledge, the use of the fewer samples in the training dataset fail to provide the opportunity for the student model to learn them (Jiao et al., 2020;Liang et al., 2021).In such cases, we assume that if armed with the data augmentation, the student model could learn the knowledge assigned by KSM more sufficiently, thus achieving better performance.To this end, we augment the training datasets (i.e., RTE, MRPC, SST-2, and QNLI datasets) about 20 times via the data augmentation procedure (Jiao et al., 2020).To make a fair comparison, we initialize our student model with the released BERT 6 † distilled via General Distillation (Jiao et al., 2020).The results show that our method is consistently better than all baselines.It demonstrates that data augmentation can significantly improve the student models distilled by our method.In addition, compared to the vanilla KD, our method can take advantage of the data more efficiently and transfer more knowledge to the student model.
This performance on data augmentation emphasizes that our method is orthogonal to the Tiny-BERT, as we address the knowledge selection problem during KD.It also provides a suggestion that our method can be complementary to other KD methods, such as the MixKD (Liang et al., 2021) and MiniLM (Wang et al., 2020).

Comparison of Soft Action and Hard Action
To intuitively present the effect of different actions on performance, we compare the soft and hard actions on the RTE dataset.Figure 3 shows the mean accuracy of the student models distilled with different action strategies on the development set.It can be found that our approach can integrate seamlessly with different strategies.Furthermore, we notice that the hard action can achieve faster KSM learning, while the soft action can achieve better results.We can draw similar observations on the results of other datasets, e.g., on a relatively larger dataset QQP (see Figure 5 in Appendix).

Performance on Different Student Models
To explore the performance of different student models, we distill student models with different layers and plot the performance to compare baselines (i.e., finetune and vanilla KD) in Figure 4.The results show that our method consistently outperforms the baselines while distilling the student model with various layers.It indicates that our method can be adapted to the distillation of different student models well.
A Regularization Perspective on Knowledge Selection In practice, we consider that the knowledge selection can act as a regularization which prevents the co-adaptation (Grisogono, 2006;Sabiri et al., 2022) in KD, i.e., distilling a student model highly depends on a certain behavior of the teacher.If the distilled student model receives the inappropriate knowledge from the dependent behavior of the teacher, it can significantly alter the performance of the student model, which is what might happen with overfitting (Hawkins, 2004;Phaisangittisagul, 2016).However, knowledge selection allows the student model to learn from multiple different behaviors properly at each training step during KD.Thus, compared to the traditional KD methods that distill a student model via single or fixed knowledge, our method can ease the effect from the inappropriate knowledge from a behavior to prevent the co-adaptation.
See more discussion in Appendix C.

Conclusion
In this paper, we focus on making full use of the multiple types of knowledge into the distilling of the student model.We have proposed an actorcritic approach to selecting appropriate knowledge to transfer during the process of knowledge distillation.To ease the burden of computing rewards during training, we propose a multi-phase training approach and an exploration reward.Our experiments on GLUE datasets show that our method significantly outperforms several strong knowledge distillation baselines.

Limitations
In this section, we discuss some limitations of this work as follows: • We train a model to select appropriate knowledge to transfer during the process of knowledge distillation.In this process, training the model is time-consuming and resourceintensive.For example, on the RTE dataset, if we utilize one TITAN V GPU and set the distillation epoch to 5, it will take 40 minutes to train the model.
• It is difficult to scale the proposed approach to a large dataset.The underlying reasons are: (i) More training steps in the supersize dataset will make the knowledge selection question more difficult.(ii) The larger dataset, the state is more complex, which may confuse the knowledge selection module.

A Design Details of Learning Each Knowledge Type
A.1 Response Knowledge For learning the response knowledge, we use the vanilla KD loss function: where KL(•) is the Kullback-Leibler divergence function, σ(•) is the softmax function and τ is a temperature hyper-parameter (Hinton et al., 2015).
For the sample x, f S (x) and f T (x) are the final outputs of the student model and the teacher model, respectively.

A.2 Feature Knowledge
For learning the feature knowledge, we utilize the PKD-Skip method (Sun et al., 2019b), where the student model learns the outputs of the assigned teacher model's layers.The corresponding loss function can be defined by: where I(i) is the assigned indexes of the teacher layer for responding the i-th layer of the student model, L S is the number of layers of the student model, h S x,i and h T x,i are the [CLS] embeddings of i-th layer of the student model and the teacher model for the sample x, respectively.

A.3 Relation Knowledge
For learning the relation knowledge, we use the flow of solution procedure (FSP) method (Yim et al., 2017).Specifically, we first define the FSP matrix by: where h 1 and h 2 are the [CLS] embeddings.Let G S (•) and G T (•) to denote the FSP matrices of the student model and the teacher model, respectively.Based on this two FSP matrices, we can achieve the loss function for learning the relation knowledge: where M SE(•) is the mean squared error loss function.Here we use the same layer selection strategy for the teacher model as the PKD-Skip method.

A.4 Finetune Knowledge
For learning the finetune knowledge, we use the loss function: where 1{•} is the indicator function, y is the predicted label and ŷ is the ground-truth label of the sample x.

B.2 Training the Teacher Model
In this work, we denote the number of layers (i.e., Transformer blocks) as L, the hidden size as H, and the number of self-attention heads as A for BERT.We use the pre-trained language model BERT BASE (L = 12, H = 768, A = 12, T otalP arameters = 110M ) (Devlin et al., 2019) as the teacher model.We initialize the BERT BASE with the bert-base-uncased § .The teacher models are trained with random seeds.In addition, the sentence length is 128, and the learning rate is 5e-5.We set the batch size and training epoch to 32 and 5, respectively.Note that it is possible to plug-in any large pre-trained model such as BERT LARGE (Devlin et al., 2019) and RoBERTa (Liu et al., 2019b) in our method.

B.3 Training the Student Model
We primarily report results on two student models: BERT 6 (L = 6, H = 768, A = 12, T otalP arameters = 67.0M ) and BERT 3 (L = 3, H = 768, A = 12, T otalP arameters = 45.7M ).We initialize the BERT 6 and BERT 3 with the bottom 6 and 3 transformer layers of BERT BASE , respectively.In training the student model, we set the batch size to 32 and the sentence length to 128.We select a well-trained student model through their scores among the learning rate set of {2e-5, 3e-5, 5e-5}.

C Discussion
Performance on Different Phase Sizes In Sec.5.3.3,we divide the whole KD process into multiple phases.Then a question may arise about whether the phase size has an impact on the KSM performance.To probe this question, we run our method to distill the student model BERT 6 with different phase sizes.As shown in Figure 6, we observe that the excessive phase size can hurt the performance of the trained KSM, which we attribute to the deficiency of phase rewards.In addition, the too small phase size is not beneficial to our method because its phase reward is similar to an immediate reward.
Performance on Different Phase Rewards We investigate the performance of our method while taking the task metrics (i.e., F1 or Accuracy) difference as the phase reward.As shown in Table 6, we observe that training KSM with a loss reward has the best performance overall.To explore the reasons for this observation, we further compare the phase rewards of different rewards in one episode, as shown in Figure 7.The results show that the loss reward is more stable and smooth than the F1 and Accuracy reward.Based on this, we conclude that the stability and smoothness give the loss reward the ability to train a better KSM.
Figure1: An overview of the proposed method.We use an actor-critic approach to design a knowledge selection module (KSM), which aims to select the appropriate knowledge to transfer during KD.Besides the soft action presented in Figure1(b), we also design the hard action for the actor module (see Section 5.3.1).

Figure 2 :
Figure 2: Ablation study.We plot the mean accuracy of student models distilled with three different seeds on the RTE development set.

Figure 3 :
Figure 3: Comparison of the soft and hard actions.Our method can achieve faster KSM learning with the hard action and gain better results with the soft action.

Figure 4 :
Figure 4: Performance of distilled student model with various layers on the MRPC development set.

Figure 6 :Figure 7 :
Figure 6: Performance of the student models distilled by our method with different phase sizes.

Table 3 :
Results from the GLUE test server.The best and second-best results for each group of student models are in bold and in italics, respectively.The numbers under each dataset indicate the corresponding number of the training dataset.

Table 4 :
Jiao et al. (2020)hod with the augmented datasets from the GLUE test server.The results of TinyBERT are taken fromJiao et al. (2020).

Table 5 :
We implement the KSM described in Sec. 5 based on PyTorch ‡ .Specifically, we use a 2-layer MLP to achieve the student and teacher feature networks.The input layer size and output layer size are the [CLS] embedding size and 8, respectively.Both the actor network and the critic network are composed of a 3-layer MLP.The hidden size and output layer size in the actor are 256 and 4. The hidden size and output layer size in the critic are 256 and 1.In addition, there are six hyper-parameters for training the KSM.The candidate values for these hyper-parameters are introduced in Table5.During training the KSM, we can select the optimal hyper-parameter setups with the best development set accuracy.Hyper-parameters for training the KSM.

Table 6 :
Figure 5: Comparison of the soft and hard actions on the QQP development set.For the QQP dataset, we can train a well-performance KSM only across two episodes.This is because a large dataset can provide more opportunities for updating the KSM in each episode.Accuracy 66.7 88.1/83.287.3 79.1 Results of different rewards for training KSM.