Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains

Pre-trained language models have been applied to various NLP tasks with considerable performance gains. However, the large model sizes, together with the long inference time, limit the deployment of such models in real-time applications. One line of model compression approaches considers knowledge distillation to distill large teacher models into small student models. Most of these studies focus on single-domain only, which ignores the transferable knowledge from other domains. We notice that training a teacher with transferable knowledge digested across domains can achieve better generalization capability to help knowledge distillation. Hence we propose a Meta-Knowledge Distillation (Meta-KD) framework to build a meta-teacher model that captures transferable knowledge across domains and passes such knowledge to students. Specifically, we explicitly force the meta-teacher to capture transferable knowledge at both instance-level and feature-level from multiple domains, and then propose a meta-distillation algorithm to learn single-domain student models with guidance from the meta-teacher. Experiments on public multi-domain NLP tasks show the effectiveness and superiority of the proposed Meta-KD framework. Further, we also demonstrate the capability of Meta-KD in the settings where the training data is scarce.


Introduction
Pre-trained Language Models (PLM) such as BERT (Devlin et al., 2019) and XLNet  have achieved significant success with the two-stage "pre-training and fine-tuning" process. Despite the performance gain achieved in various NLP tasks, the large number of model parameters * H. Pan and C. Wang contributed equally to this work. † M. Qiu is the corresponding author. A physics student may learn physics equations better with a powerful all-purpose teacher. and the long inference time have become the bottleneck for PLMs to be deployed in real-time applications, especially on mobile devices (Jiao et al., 2019;Sun et al., 2020;Iandola et al., 2020). Thus, there are increasing needs for PLMs to reduce the model size and the computational overhead while keeping the prediction accuracy. Knowledge Distillation (KD) (Hinton et al., 2015) is one of the promising ways to distill the knowledge from a large "teacher" model to a small "student" model. Recent studies show that KD can be applied to compress PLMs with acceptable performance loss (Sanh et al., 2019;Sun et al., 2019b;Jiao et al., 2019;Turc et al., 2019;Chen et al., 2020a). However, those methods mainly focus on single-domain KD. Hence, student models can only learn from their in-domain teachers, paying little attention to acquiring knowledge from other domains. It has been shown that it is beneficial to consider cross-domain information for KD, by either training a teacher using cross-domain data or multiple teachers from multiple domains (You et al., 2017;Peng et al., 2020). Consider an academic scenario in Figure 1. A typical way for a physics student to learn physics equations is to directly learn from his/her physics teacher. If we have a math teacher to teach him/her basic knowledge of equations, the student can obtain a better understanding of physics equations. This "knowledge transfer" technique in KD has been proved efficient only when two domains are close to each other . In reality, however, it is highly risky as teachers of other domains may pass non-transferable knowledge to the student model, which is irrelevant to the current domain and hence harms the overall performance (Tan et al., 2017;. Besides, current studies find multi-task fine-tuning of BERT does not necessarily yield better performance across all the tasks (Sun et al., 2019a).
To address these issues, we leverage the idea of meta-learning to capture transferable knowledge across domains, as recent studies have shown that meta-learning can improve the model generalization ability across domains (Finn et al., 2017;Javed and White, 2019;Yin, 2020;Ye et al., 2020). We further notice that meta-knowledge is also helpful for cross-domain KD. Re-consider the example in Figure 1. If we have an "all-purpose teacher" (i.e., the meta-teacher) who has the knowledge of both physics principles and mathematical equations (i.e., the general knowledge of the two courses), the student may learn physics equations better with the teacher, compared to the other two cases. Hence, it is necessary to train an "all-purpose teacher" model for domain-specific student models to learn.
In this paper, we propose the Meta-Knowledge Distillation (Meta-KD) framework, which facilities cross-domain KD. Generally speaking, Meta-KD consists of two parts, meta-teacher learning and meta-distillation. Different from the K-way N-shot problems addressed in traditional metalearning (Vanschoren, 2018), we propose to train a "meta-learner" as the meta-teacher, which learns the transferable knowledge across domains so that it can fit new domains easily. The meta-teacher is jointly trained with multi-domain datasets to acquire the instance-level and feature-level metaknowledge. For each domain, the student model learns to solve the task over a domain-specific dataset with guidance from the meta-teacher. To improve the student's distillation ability, the metadistillation module minimizes the distillation loss from both intermediate layers, output layers, and transferable knowledge, combined with domainexpertise weighting techniques.
To verify the effectiveness of Meta-KD, we conduct extensive experiments on two NLP tasks across multiple domains, namely natural language inference (Williams et al., 2018) and sentiment analysis (Blitzer et al., 2007). Experimental results show the effectiveness and superiority of the proposed Meta-KD framework. Moreover, we find our method performs well especially when i) the in-domain dataset is very small or ii) there is no in-domain dataset during the training of the metateacher. In summary, the contributions of this study can be concluded as follows: • To the best of our knowledge, this work is the first to explore the idea of meta-teacher learning for PLM compression across domains.
• We propose the Meta-KD framework to address the task. In Meta-KD, the meta-teacher digests transferable knowledge across domains, and selectively passes the knowledge to student models with different domain expertise degrees.
• We conduct extensive experiments to demonstrate the superiority of Meta-KD and also explore the capability of this framework in the settings where the training data is scarce.
The rest of this paper is summarized as follows. Section 2 describes the related work. The detailed techniques of the Meta-KD framework are presented in Section 3. The experiments are reported in Section 4. Finally, we conclude our work and discuss the future work in Section 5. 1

Related Work
Our study is close to the following three lines of studies, introduced below.

Knowledge Distillation (KD)
KD was first proposed by (Hinton et al., 2015), aiming to transfer knowledge from an ensemble or a large model into a smaller, distilled model. Most of the KD methods focus on utilizing either the dark knowledge, i.e., predicted outputs (Hinton et al., 2015;Chen et al., 2020b;Furlanello et al., 2018;You et al., 2017) or hints, i.e., the intermediate representations (Romero et al., 2015;Yim et al., 2017;You et al., 2017) or the relations between layers (Yim et al., 2017;Tarvainen and Valpola, 2017) of teacher models. You et al. (2017) also find that multiple teacher networks together can provide comprehensive guidance that is beneficial for training the student network. Ruder et al. (2017) show that multiple expert teachers can improve the performances of sentiment analysis on unseen domains. Tan et al. (2019) apply the multiple-teachers framework in KD to build a state-of-the-art multilingual machine translation system. Feng et al. (2021) considers to build a model to automatically augment data for KD. Our work is one of the first attempts to learn a meta-teacher model that digest transferable knowledge from multiple domains to benefit KD on the target domain.

PLM Compression
Due to the massive number of parameters in PLMs, it is highly necessary to compress PLMs for application deployment. Previous approaches on compressing PLMs such as BERT (Devlin et al., 2019) include KD (Hinton et al., 2015), parameter sharing (Ullrich et al., 2017), pruning (Han et al., 2015) and quantization (Gong et al., 2014). In this work, we mainly focus on KD for PLMs. In the literature, Tang et al. (2019) distill BERT into BiLSTM networks to achieve comparable results with ELMo (Peters et al., 2018). Chen et al. (2021) studies cross-domain KD to facilitate cross-domain knowledge transferring.  use dual distillation to reduce the vocabulary size and the embedding size. DistillBERT (Sanh et al., 2019) applies KD loss in the pre-training stage, while BERT-PKD (Sun et al., 2019b) distill BERT into shallow Transformers in the fine-tuning stage. TinyBERT (Jiao et al., 2019) further distills BERT with a two-stage KD process for hidden attention matrices and embedding matrices. Ad-aBERT (Chen et al., 2020a) uses neural architecture search to adaptively find small architectures. Our work improves the prediction accuracy of compressed PLMs by leveraging cross-domain knowledge, which is complementary to previous works.

Transfer Learning and Meta-learning
TL has been proved to improve the performance on the target domain by leveraging knowledge from related source domains (Pan and Yang, 2010;Mou et al., 2016;Liu et al., 2017;. In most NLP tasks, the "shared-private" architecture is applied to learn domain-specific representations and domain-invariant features (Mou et al., 2016;Liu et al., 2017;Chen et al., 2018. Compared to TL, the goal of meta-learning is to train meta-learners that can adapt to a variety of different tasks with little training data (Vanschoren, 2018). A majority of meta-learning methods for include metric-based (Snell et al., 2017;Pan et al., 2019), model-based (Santoro et al., 2016;Bartunov et al., 2020) and model-agnostic approaches (Finn et al., 2017(Finn et al., , 2018Vuorio et al., 2019). Meta-learning can also be applied to KD in some computer vision tasks (Lopes et al., 2017;Jang et al., 2019;Bai et al., 2020;Li et al., 2020). For example, Lopes et al. (2017) record per-layer metadata for the teacher model to reconstruct a training set, and then adopts a standard training procedure to obtain the student model. In our work, we use instance-based and feature-based meta-knowledge across domains for the KD process.

The Meta-KD Framework
In this section, we formally introduce the Meta-KD framework. We begin with a brief overview of Meta-KD. After that, the techniques are elaborated.

An Overview of Meta-KD
Take text classification as an example. Assume there are K training sets, corresponding to K domains. In the k-th dataset Let M k be the large PLM fine-tuned on D k . Given the K datasets, the goal of Meta-KD is to obtain the K student models S 1 , · · · , S K that are small in size but has similar performance compared to the K large PLMs, i.e., M 1 , · · · , M K .
In general, the Meta-KD framework can be divided into the following two stages: • Meta-teacher Learning: Learn a metateacher M over all domains K k=1 D k . The model digests transferable knowledge from each domain and has better generalization while supervising domain-specific students.
• Meta-distillation: Learn K in-domain students S 1 , · · · , S K that perform well in their respective domains, given only in-domain data D k and the meta-teacher M as input.
During the learning process of the meta-teacher, we consider both instance-level and feature-level transferable knowledge. Inspired by prototypebased meta-learning (Snell et al., 2017;Pan et al., 2019), the meta-teacher model should memorize more information about prototypes. Hence, we compute sample-wise prototype scores as the instance-level transferable knowledge. The loss of the meta-teacher is defined as the sum of classification loss across all K domains with prototypebased, instance-specific weighting. Besides, it also learns feature-level transferable knowledge by adding a domain-adversarial loss as an auxiliary loss. By these steps, the meta-teacher is more generalized and digests transferable knowledge before supervising student models.
For meta-distillation, each sample is weighted by a domain-expertise score to address the metateacher's capability for this sample. The transferable knowledge is also learned for the students from the meta-teacher. The overall meta-distillation loss is a combination of the Mean Squared Error (MSE) loss from intermediate layers of both models (Sun et al., 2019b;Jiao et al., 2019), the soft cross-entropy loss from output layers (Hinton et al., 2015), and the transferable knowledge distillation loss, with instance-specific domain-expertise weighting applied.

Meta-teacher Learning
We take BERT (Devlin et al., 2019) as our base learner for text classification due to its wide popularity. For each sample X The last hidden outputs of this sequence is denoted as k,j ) represents the last layer embedding of the j-th token in X (i) k , and N is the maximum sequence length. For simplicity, we define h(X (i) k ) as the average pooling of the token embeddings, i.e., h(X k,n ). Learning Instance-level Transferable Knowledge. To select transferable instances across domains, we compute a prototype score t (i) k for each sample X (i) k . Here, we treat the prototype representation for the m-th class of the k-th domain: is the k-th training set with the m-th class label. The prototype score t (i) k is: where cos is the cosine similarity function, α is a pre-defined hyper-parameter and ζ = 1−α K−1 . We can see that the definition of the prototype score here is different from previous meta-learning, as we require that an instance X (i) k should be close to its class prototype representation in the embedding space (i.e., p (m) k ), as well as the prototype representations in out-of-domain datasets (i.e., p (m) k with k = 1, · · · , K, k = k). This is because the meta-teacher should learn more from instances that are prototypical across domains instead of indomain only. For the text classification task, the cross-entropy loss of the meta-teacher is defined using the cross-entropy loss with the prototype score as a weight assigned to each instance. Learning Feature-level Transferable Knowledge. Apart from the cross-entropy loss, we propose the domain-adversarial loss to increase the meta-teacher's ability for learning feature-level transferable knowledge.
For each sample X where W and b are sub-network parameters. The domain-adversarial loss for X (i) k is defined as: where σ is the K-way domain classifier, and 1 is the indicator function that returns 1 if k = z (i) k , and 0 otherwise. Here, z k is a false domain label of X  predictions of domain labels. We call h d (X (i) k )) as the transferable knowledge for X (i) k , which is more insensitive to domain differences.
Let L CE (X (i) k ) be the normal cross-entropy loss of the text classification task. The total loss of the meta-teacher L M T is the combination of weighted L CE (X where γ 1 is the factor to represent how the domainadversarial loss contributes to the overall loss.

Meta-distillation
We take BERT as our meta-teacher and use smaller BERT models as student models. The distillation framework is shown in Figure 2. In our work, we distill the knowledge in the meta-teacher model considering the following five elements: input embeddings, hidden states, attention matrices, output logits, and transferable knowledge. The KD process of input embeddings, hidden states and attention matrices follows the common practice (Sun et al., 2019b;Jiao et al., 2019). Recall that M and S k are the meta-teacher and the k-th student model. Let L embd (M, S k , X k ) is the cross-entropy loss of "softened" output logits, parameterized by the temperature (Hinton et al., 2015). A naive approach to formulating the total KD loss L kd is the sum of all previous loss functions, i.e., However, the above approach does not give special considerations to the transferable knowledge of the meta-teacher. Let h M d (X , and M SE is the MSE loss function w.r.t. single element. In this way, we encourage student models to learn more transferable knowledge from the meta-teacher. We further notice that although the knowledge of the meta-teacher should be highly transferable, there still exists the domain gap between the metateacher and domain-specific student models. In this work, for each sample X (i) k , we define the domain expertise weight λ (i) k as follows: k 's class label. Here, the weight λ (i) k is large when the metateacher model i) has a large prototype score t (i) k and ii) makes correct predictions on the target input, i.e., y We can see that the weight reflects how well the meta-teacher can supervise the student on a specific input. Finally, we derive the complete formulation of the KD loss L kd as follows: where γ 2 is the transferable KD factor to represent how the transferable knowledge distillation loss contributes to the overall loss.

Experiments
In this section, we conduct extensive experiments to evaluate the Meta-KD framework on two popular text mining tasks across domains.

Tasks and Datasets
We evaluate Meta-KD over natural language inference and sentiment analysis, using the following two datasets MNLI and Amazon Reviews. The data statistics are in Table 1. • MNLI (Williams et al., 2018) is a largescale, multi-domain natural language inference dataset for predicting the entailment relation between two sentences, containing five domains (genres). After filtering samples with no labels available, we use the original development set as our test set and randomly sample 10% of the training data as a development set in our setting.
• Amazon Reviews (Blitzer et al., 2007) is a multi-domain sentiment analysis dataset, widely used in multi-domain text classification tasks. The reviews are annotated as positive or negative. For each domain, there are 2,000 labeled reviews. We randomly split the data into train, development, and test sets.

Baselines
For the teacher side, to evaluate the cross-domain distillation power of the meta-teacher model, we consider the following models as baseline teachers: • BERT-single: Train the BERT teacher model on the target distillation domain only. If we have K domains, then we will have K BERTsingle teachers.
• BERT-mix: Train the BERT teacher on a combination of K-domain datasets. Hence, we have one BERT-mix model as the teacher model for all domains.
• BERT-mtl: Similar to the "one-teacher" paradigm as in BERT-mix, but the teacher model is generated by the multi-task finetuning approach (Sun et al., 2019a).
• Multi-teachers: It uses K domain-specific BERT-single models to supervise K student models, ignoring the domain difference.
For the student side, we follow TinyBERT (Jiao et al., 2019) to use smaller BERT models as our student models. In single-teacher baselines (i.e., BERT-single, BERT-mix and BERT-mtl), we use TinyBERT-KD as our baseline KD approach. In multi-teachers, because TinyBERT-KD does not naturally support distilling from multiple teacher models, we implement a variant of the TinyBERT-KD process based on MTN-KD (You et al., 2017), which uses averaged softened outputs as the incorporation of multiple teacher networks in the output layer. In practice, we first learn the representations of the student models by TinyBERT, then apply MTN-KD for output-layer KD.

Implementation Details
In the implementation, we use the original BERT B model (L=12, H=768, A=12, Total Parame-ters=110M) as the initialization of all of the teachers, and use the BERT S model (L=4, H=312, A=12, Total Parameters=14.5M) as the initialization of all the students 4 . The hyper-parameter settings of the meta-teacher model are as follows. We train 3-4 epochs with the learning rate to be 2e-5. The batch size and γ 1 are chosen from {16, 32, 48} and {0.1, 0.2, 0.5}, respectively. All the hyper-parameters are tuned on the development sets. 4 https://github.com/huawei-noah/ Pretrained-Language-Model/tree/master/ TinyBERT For meta-distillation, we choose the hidden layers in {3, 6, 9, 12} of the teacher models in the baselines and the meta-teacher model in our approach to learn the representations of the student models. Due to domain difference, we train student models in 3-10 epochs, with a learning rate of 5e-5. The batch size and γ 2 are tuned from {32, 256} and {0.1, 0.2, 0.3, 0.4, 0.5} for intermediatelayer distillation, respectively. Following Jiao et al. (2019), for prediction-layer distillation, we run the method for 3 epochs, with the batch size and learning rate to be 32 and 3e-5. The experiments are implemented on PyTorch and run on 8 Tsela V100 GPUs. Table 2 and Table 3 show the general testing performance over MNLI and Amazon Reviews of baselines and Meta-KD, in terms of accuracy. From the results, we have the following three major insights:

Experimental Results
• Compared to all the baseline teacher models, using the meta-teacher for KD consistently achieves the highest accuracy in both datasets. Our method can help to significantly reduce model size while preserving similar performance, especially in Amazon review, we reduce the model size to 7.5x smaller with only a minor performance drop (from 89.9 to 89.4).
• The meta-teacher has similar performance as BERT-mix and BERT-mtl, but shows to be a better teacher for distillation, as Meta-teacher have better performance than other methods. This shows the meta-teacher is capable of learning more transferable knowledge to help the student. The fact that Meta-teacher → Metadistillation has better performance than other distillation methods confirms the effectiveness of the proposed Meta-KD method.
• Meta-KD gains more improvement on the small datasets than large ones, e.g. it improves from 86.7 to 89.4 in Amazon Reviews while 79.3 to 80.4 in MNLI. This motivates us to explore our model performance on domains with few or no training samples

Ablation Study
We further investigate Meta-KD's capability with regards to different portion training data for both of two phases and explore how the transferable knowledge distillation loss contributes to final results.

No In-domain Data during Meta-teacher Learning
In this set of experiments, we consider a special case where we assume all the "fiction" domain data in MNLI is unavailable. Here, we train a meta-teacher without the "fiction" domain dataset and use the distillation method proposed in Jiao et al. (2019) to produce the student model for the "fiction" domain with in-domain data during distillation. The results are shown in Table 4. We find that KD from the meta-teacher can have large     improvement, compared to KD from other outdomain teachers. Additionally, learning from the out-domain meta-teacher has a similar performance to KD from the in-domain "fiction" teacher model itself. It shows the Meta-KD framework can be applied in applications for emerging domains.

Few In-domain Data Available during Meta-distillation
We randomly sample a part of the MNLI dataset as the training data in this setting. The sample rates that we choose include 0.01, 0.02, 0.05, 0.1 and 0.2. The sampled domain datasets are employed for training student models when learning from the in-domain teacher or the meta-teacher. The experimental results are shown in Figure 3, with results reported by the improvement rate in averaged accuracy. The experimental results show that when less data is available, the improvement rate is much larger. For example, when we only have 1% of the original MNLI training data, the accuracy can be increased by approximately 10% when the student tries to learn from the meta-teacher. It shows Meta-KD can be more beneficial when we have fewer in-domain data.

Influence of the Transferable Knowledge Distillation Loss
Here, we explore how the transferable KD factor γ 2 affects the distillation performance over the Amazon Reviews dataset. We tune the value of γ 2 from 0.1 to 1.0, with results are shown in Figure  4. We find that the optimal value of γ 2 generally lies in the range of 0.2 -0.5. The trend of accuracy is different in the domain "DVD" is different from those of the remaining three domains. This means the benefits from transferable knowledge of the meta-teacher vary across domains.

Conclusion and Future Work
In this work, we propose the Meta-KD framework which consists of meta-teacher learning and meta distillation to distill PLMs across domains. Experiments on two widely-adopted public multidomain datasets show that Meta-KD can train a meta-teacher to digest knowledge across domains to help better teach in-domain students. Quantitative evaluations confirm the effectiveness of Meta-KD and also show the capability of Meta-KD in the settings where the training data is scarce i.e. there is no or few in-domain data. In the future, we will examine the generalization capability of Meta-KD in other application scenarios and apply other meta-learning techniques to KD for PLMs.