Distilling Linguistic Context for Language Model Compression

A computationally expensive and memory intensive neural network lies behind the recent success of language representation learning. Knowledge distillation, a major technique for deploying such a vast language model in resource-scarce environments, transfers the knowledge on individual word representations learned without restrictions. In this paper, inspired by the recent observations that language representations are relatively positioned and have more semantic knowledge as a whole, we present a new knowledge distillation objective for language representation learning that transfers the contextual knowledge via two types of relationships across representations: Word Relation and Layer Transforming Relation. Unlike other recent distillation techniques for the language models, our contextual distillation does not have any restrictions on architectural changes between teacher and student. We validate the effectiveness of our method on challenging benchmarks of language understanding tasks, not only in architectures of various sizes but also in combination with DynaBERT, the recently proposed adaptive size pruning method.


Introduction
Since the Transformer, a simple architecture based on attention mechanism, succeeded in machine translation tasks, Transformer-based models have become a new state of the arts that takes over more complex structures based on recurrent or convolution networks on various language tasks, e.g., language understanding and question answering, etc (Devlin et al., 2018;Lan et al., 2019;Raffel et al., 2019;. However, in exchange for high performance, these models suffer from a major drawback: tremendous computational and memory costs. In particular, it is not possible to deploy such large models on platforms with limited resources such as mobile and wearable devices, and it is an urgent research topic with impact to keep up with the performance of the latest models from a small-size network. As the main method for this purpose, Knowledge Distillation (KD) transfers knowledge from the large and well-performing network (teacher) to a smaller network (student). There have been some efforts that distill Transformer-based models into compact networks Turc et al., 2019;Sun et al., 2019Sun et al., , 2020Jiao et al., 2019;. However, they all build on the idea that each word representation is independent, ignoring relationships between words that could be more informative than individual representations.
In this paper, we pay attention to the fact that word representations from language models are very structured and capture certain types of semantic and syntactic relationships. -Word2Vec (Mikolov et al., 2013) and Glove (Pennington et al., 2014) demonstrated that trained embedding of words contains the linguistic patterns as linear relationships between word vectors. Recently, Reif et al. (2019) found out that the distance between words contains the information of the dependency parse tree. Many other studies also suggested the evidence that contextual word representations (Belinkov et al., 2017;Tenney et al., 2019a,b) and attention matrices (Vig, 2019;Clark et al., 2019) contain important relations between words. Moreover, Brunner et al. (2019) showed the vertical relations in word representations across the transformer layers through word identifiability. Intuitively, although each word representation has respective knowledge, the set of representations of words as a whole is more semantically meaningful, since words in the embedding space are positioned relatively by learning.
Inspired by these observations, we propose a novel distillation objective, termed Contextual Knowledge Distillation (CKD), for language tasks that utilizes the statistics of relationships between word representations. In this paper, we define two types of contextual knowledge: Word Relation (WR) and Layer Transforming Relation (LTR). Specifically, WR is proposed to capture the knowledge of relationships between word representations and LTR defines how each word representation changes as it passes through the network layers.
We validate our method on General Language Understanding Evaluation (GLUE) benchmark and the Stanford Question Answer Dataset (SQuAD), and show the effectiveness of CKD against the current state-of-the-art distillation methods. To validate elaborately, we conduct experiments on taskagnostic and task-specific distillation settings. We also show that our CKD performs effectively on a variety of network architectures. Moreover, with the advantage that CKD has no restrictions on student's architecture, we show CKD further improves the performance of adaptive size pruning method (Hou et al., 2020) that involves the architectural changes during the training.
To summarize, our contribution is threefold: • (1) Inspired by the recent observations that word representations from neural networks are structured, we propose a novel knowledge distillation strategy, Contextual Knowledge Distillation (CKD), that transfers the relationships across word representations.
• (2) We present two types of complementary contextual knowledge: horizontal Word Relation across representations in a single layer and vertical Layer Transforming Relation across representations for a single word.
• (3) We validate CKD on the standard language understanding benchmark datasets and show that CKD not only outperforms the stateof-the-art distillation methods but boosts the performance of adaptive pruning method.

Related Work
Knowledge distillation Since recently popular deep neural networks are computation-and memory-heavy by design, there has been a long line of research on transferring knowledge for the purpose of compression. Hinton et al. (2015) first proposed a teacher-student framework with an objective that minimizes the KL divergence between teacher and student class probabilities. In the field of natural language processing (NLP), knowledge distillation has been actively studied (Kim and Rush, 2016;Hu et al., 2018). In particular, after the emergence of large language models based on pre-training such as BERT (Devlin et al., 2018;Raffel et al., 2019), many studies have recently emerged that attempt various knowledge distillation in the pre-training process and/or fine-tuning for downstream tasks in order to reduce the burden of handling large models. Specifically, Tang et al. (2019); Chia et al. (2019) proposed to distill the BERT to train the simple recurrent and convolution networks. ; Turc et al. (2019) proposed to use the teacher's predictive distribution to train the smaller BERT and Sun et al. (2019) proposed a method to transfer individual representation of words. In addition to matching the hidden state, Jiao et al. (2019); Sun et al. (2020);  also utilized the attention matrices derived from the Transformer. Several works including ; Hou et al. (2020) improved the performance of other compression methods by integrating with knowledge distillation objectives in the training procedure. In particular, DynaBERT (Hou et al., 2020) proposed the method to train the adaptive size BERT using the hidden state matching distillation. Different from previous knowledge distillation methods that transfer respective knowledge of word representations, we design the objective to distill the contextual knowledge contained among word representations.

Contextual knowledge of word representations
Understanding and utilizing the relationships across words is one of the key ingredients in language modeling. Word embedding (Mikolov et al., 2013;Pennington et al., 2014) that captures the context of a word in a document, has been traditionally used. Unlike the traditional methods of giving fixed embedding for each word, the contextual embedding methods (Devlin et al., 2018;Peters et al., 2018) that assign different embeddings according to the context with surrounding words have become a new standard in recent years showing high performance. Xia and Zong (2010) improved the performance of the sentiment classification task by using word relation, and Hewitt and Manning (2019); Reif et al. (2019) found that the distance between contextual representations contains syntactic information of sentences. Recently, Brunner et al. (2019) also experimentally showed that the contextual representations of each token change over the layers. Our research focuses on knowledge distillation using context information between words and between layers, and to our best knowledge, we are the first to apply this context information to knowledge distillation.

Setup and background
Most of the recent state-of-the-art language models are stacking Transformer layers which consist of repeated multi-head attentions and position-wise feed-forward networks.
Transformer based networks. Given an input sentence with n tokens, X = [x 1 , x 2 , . . . , x n ] ∈ R d i ×n , most networks (Devlin et al., 2018;Lan et al., 2019; utilize the embedding layer to map an input sequence of symbol representations X to a sequence of continuous representations E = [e 1 , . . . , e n ] ∈ R de×n . Then, each l-th Transformer layer of the identical structure takes the previous representations R l−1 and produces the updated representations R l = [r l,1 , r l,2 , . . . , r l,n ] ∈ R dr×n through two sublayers: Multi-head Attention (MHA) and positionwise Feed Forward Network (FFN). The input at the first layer (l = 1) is simply E. In MHA operation where h separate attention heads are operating independently, each input token r l−1,i for each head is projected into a query q i ∈ R dq , key k i ∈ R dq , and value v i ∈ R dv , typically Here, the key vectors and value vectors are packed into the matrix forms K = [k 1 , · · · , k n ] and V = [v 1 , · · · , v n ], respectively, and the attention value a i and output of each head o h,i are calculated as followed: The outputs of all heads are then concatenated and fed through the FFN, producing the single word representation r l,i . For clarity, we pack attention values of all words into a matrix form A l,h = [a 1 , a 2 , .., a n ] ∈ R n×n for attention head h.
Knowledge distillation for Transformer. In the general framework of knowledge distillation, teacher network (T ) with large capacity is trained in advance, and then student network (S) with pre-defined architecture but relatively smaller than teacher network is trained with the help of teacher's knowledge. Specifically, given the teacher parameterized by θ t , training the student parameterized by θ s aims to minimize two objectives: i) the crossentropy loss L CE between the output of the student network S and the true label y and ii) the difference of some statistics L D between teacher and student models. Overall, our goal is to minimize the following objective function: where λ controls the relative importance between two objectives. Here, K characterizes the knowledge being transferred and can vary depending on the distillation methods, and L D is a matching loss function such as l 1 , l 2 or Huber loss. Recent studies on knowledge distillation for Transformer-based BERT can also be understood in this general framework. In particular, each distillation methods of previous works are summarized in Appendix A.

Contextual Knowledge Distillation
We now present our distillation objective that transfers the structural or contextual knowledge which is defined based on the distribution of word representations. Unlike previous methods distilling each word separately, our method transfers the information contained in relationships between words or between layers, and provides a more flexible way of constructing embedding space than directly matching representations. The overall structure of our method is illustrated in Figure 1(a). Specifically, we design two key concepts of contextual knowledge from language models: Word Relation-based and Layer Transforming Relation-based contextual knowledge, as shown in Figure 1(b).

Word Relation (WR)-based Contextual Knowledge Distillation
Inspired by previous studies suggesting that neural networks can successfully capture contextual relationships across words (Reif et al., 2019;Pennington et al., 2014;Mikolov et al., 2013), WR-based CKD aims to distill the contextual knowledge contained in the relationships across words at certain layer. The "relationship" across a set of words can be defined in a variety of different ways. Our work focuses on defining it as the sum of pair-wise and triple-wise relationships. Specifically, for each input X with n words, let R l = [r l,1 , · · · r l,n ] be the word representations at layer l from the language model (it could be teacher or student), as described in Section 3. Then, the objective of WR-based CKD (a) (b) Figure 1: Overview of our contextual knowledge distillation. (a) In the teacher-student framework, we define the two contextual knowledge, word relation and layer transforming relation which are the statistics of relation across the words from the same layer (orange) and across the layers for the same word (turquoise), respectively. (b) Given the pair-wise and triple-wise relationships of WR and LTR from teacher and student, we define the objective as matching loss between them.
is to minimize the following loss: where χ = {1, . . . , n}. The function φ and ψ define the pair-wise and triple-wise relationships, respectively and λ WR adjust the scales of two losses.
Here, we suppress the layer index l for clarity, but the distillation loss for the entire network is simply summed for all layers. Since not all terms in Eq.
(1) are equally important in defining contextual knowledge, we introduce the weight values w ij and w ijk to control the weight of how important each pair-wise and triple-wise term is. Determining the values of these weight values is open as an implementation issue, but it can be determined by the locality of words (i.e. w ij = 1 if |i − j| ≤ δ and 0, otherwise), or by attention information A to focus only on relationship between related words. In this work, we use the locality of words as weight values.
While functions φ and ψ defining pair-wise and triple-wise relationship also have various possibilities, the simplest choices are to use the distance between two words for pair-wise φ and the angle by three words for triple-wise ψ, respectively.
Pair-wise φ via distance. Given a pair of word representations (r i , r j ) from the same layer, φ(r i , r j ) could be defined as cosine distance: Triple-wise ψ via angle. Triple-wise relation captures higher-order structure and provides more flexibility in constructing contextual knowledge. One of the simplest forms for ψ is the angle, which is calculated as where ·, · denotes the dot product between two vectors. Despite its simple form, efficiently computing the angles in Eq. (2) for all possible triples out of n words requires storing all relative representations (r i − r j ) in a (n, n, d r ) tensor 1 . This incurs an additional memory cost of O(n 2 d r ). In this case, using locality for w ijk in Eq. (1) mentioned above can be helpful; by considering only the triples within a distance of δ from r j , the additional memory space required for efficient computation is O(δnd r ), which is beneficial for δ n. It also reduces the computation complexity of computing triple-wise relation from O(n 3 d r ) to O(δ 2 nd r ).
1 From the equation ri − rj 2 2 = ri 2 2 + rj 2 2 − 2 ri, rj , computing the pair-wise distance with the right hand side of equation requires no additional memory cost.
Moreover, we show that measuring angles in local window is helpful in the performance in the experimental section.

Layer Transforming Relation (LTR) -based Contextual Knowledge Distillation
The second structural knowledge that we propose to capture is on "how each word is transformed as it passes through the layers". Transformer-based language models are composed of a stack of identical layers and thus generate a set of representations for each word, one for each layer, with more abstract concept in the higher hierarchy. Hence, LTR-based CKD aims to distill the knowledge of how each word develops into more abstract concept within the hierarchy. Toward this, given a set of representations for a single word in L layers, [r s 1,w , · · · , r s L,w ] for student and [r t 1,w , · · · , r t L,w ] for teacher (Here we abuse the notation and {1, . . . , L} is not necessarily the entire layers of student or teacher. It is the index set of layers which is defined in alignment strategy; this time, we will suppress the word index below), the objective of LTR-based CKD is to minimize the following loss: where ρ = {1, . . . , L} and λ LTR again adjust the scales of two losses. Here, the composition of Eq. (3) is the same as Eq. (1), but only the objects for which the relationships are captured have been changed from word representations in one layer to representations for a single word in layers. That is, the relationships among representations for a word in different layers can be defined from distance or angle as in Eq. (2): φ(r l , r m ) = cos(r l , r m ) or r l − r m 2 and ψ(r l , r m , r o ) = r l −rm r l −rm 2 , ro−rm ro−rm 2 . Alignment strategy. When the numbers of layers of teacher and student are different, it is important to determine which layer of the student learns information from which layer of the teacher. Previous works (Sun et al., 2019;Jiao et al., 2019) resolved this alignment issue via the uniform (i.e. skip) strategy and demonstrated its effectiveness experimentally. For L t -layered teacher and L slayered student, the layer matching function f is defined as f (step s × t) = step t × t, for t = 0, . . . , g where g is the greatest common divisor of L t and L s , step t = L t /g and step s = L s /g.
Overall training objective. The distillation objective aims to supervise the student network with the help of teacher's knowledge. Multiple distillation loss functions can be used during training, either alone or together. We combine the proposed CKD with class probability matching (Hinton et al., 2015) as an additional term. In that case, our overall distillation objective is as follows: where λ CKD is a tunable parameter to balance the loss terms.

Architectural Constraints in Distillation Objectives
State-of-the-art knowledge distillation objectives commonly used come with constraints in designing student networks since they directly match some parts of the teacher and student networks such as attention matrices or word representations. For example, DistilBERT  and PKD (Sun et al., 2019) match each word representation independently using their cosine similarities, n i=1 cos(r t l,i , r s l,i ), hence the embedding size of student network should follow that of given teacher network. Similarly, TinyBERT (Jiao et al., 2019) and MINI-LM  match the attention matrices via H h=1 KL(A t l,h , A s l,h ). Therefore, we should have the same number of attention heads for teacher and student networks (see Appendix A for more details on diverse distillation objectives).
In addition to the advantage of distilling contextual information, our CKD method has the advantage of being able to select the student network's structure more freely without the restrictions that appear in existing KD methods. This is because CKD matches the pair-wise or triple-wise relationships of words from arbitrary networks (student and teacher), as shown in Eq. (1), so it is always possible to match the information of the same dimension without being directly affected by the structure. Thanks to this advantage, in the experimental section, we show that CKD can further improve the performance of recently proposed Table 1: Comparisons for task-agnostic distillation. For the task-agnostic distillation comparison, we do not use task-specific distillation for a fair comparison. The results of TinyBERT and Truncated BERT are ones reported in . Other results are as reported by their authors. We exclude BERT-of-Theseus since the authors do not consider task-agnostic distillation. Results of development set are averaged over 4 runs. "-" indicates the result is not reported in the original papers and the trained model is not released. † marks our runs with the officially released model by the authors.   DynaBERT (Hou et al., 2020) that involves flexible architectural changes in the training phase.

Experiments
We conduct task-agnostic and task-specific distillation experiments to elaborately compare our CKD with baseline distillation objectives. We then report on the performance gains achieved by our method for BERT architectures of various sizes and inserting our objective for training DynaBERT which can run at adaptive width and depth through pruning the attention heads or layers. Finally, we analyze the effect of each component in our CKD and the impact of leveraging locality δ for w ijk in Eq. (1).
Dataset. For task-agnostic distillation which compresses a large pre-trained language model into a small language model on the pre-training stage, we use a document of English Wikipedia. For evaluating the compressed language model on the pretraining stage and task-specific distillation, we use the GLUE benchmark (Wang et al., 2018) which consists of nine diverse sentence-level classification tasks and SQuAD (Rajpurkar et al., 2016).
Setup. For task-agnostic distillation, we use the original BERT without fine-tuning as the teacher. Then, we perform the distillation on the student where the model size is pre-defined. We perform distillation using our proposed CKD objective with class probability matching of masked language modeling for 3 epochs while task-agnostic distillation following the Jiao et al. (2019) and keep other hyperparameters the same as BERT pretraining (Devlin et al., 2018). For task-specific distillation, we experiment with our CKD on top of pre-trained BERT models of various sizes which are released for research in institutions with fewer computational resources 2 (Turc et al., 2019). For the importance weight of each pair-wise and triplewise terms, we leverage the locality of words, in that w ij = 1 if |i − j| ≤ δ and 0, otherwise. For this, we select the δ in (10-21). More details including hyperparameters are provided in Appendix B. The code to reproduce the experimental results is available at https://github.com/GeondoPark/CKD.

Main Results
To verify the effectiveness of our CKD objective, we compare the performance with previous distillation methods for BERT compression including task-agnostic and task-specific distillation. Following the standard setup in baselines, we use the BERT BASE (12/768) 3 as the teacher and 6-layer BERT (6/768) as the student network. Therefore, the student models used in all baselines and ours have the same number of parameters (67.5M) and inference FLOPs (10878M) and time.
Task-agnostic Distillation. We compare with three baselines: 1) Truncated BERT which drop top 6 layers from BERT base proposed in PKD (Sun et al., 2019), 2) BERT small which trained using the Masked LM objectives provided in PD (Turc et al., 2019), 3) TinyBERT (Jiao et al., 2019) which pro-pose the individual word representation and attention map matching. Since MobileBERT (Sun et al., 2020) use the specifically designed teacher and student network which have 24-layers with an inverted bottleneck structure, we do not compare with. Dis-tilBERT  and MINI-LM     Task-specific Distillation. Here, we compare with four baselines that do not perform distillation in the pre-training: 1) PD (Turc et al., 2019) which do pre-training with Masked LM and distills with Logit KD in task-specific fine-tuning process.
2) PKD (Sun et al., 2019) which uses only 6 layers below BERT base , and distillation is also performed only in task-specific fine-tuning. The GLUE results on dev sets of PKD are taken from (Xu et al., 2020). 3) TinyBERT (Jiao et al., 2019). For the TinyBERT, we also perform distillation only in the task-specific fine-tuning with their objectives on the top of the pre-trained model provided by Turc et al. (2019) for a fair comparison. 4) BERT-of-Theseus (Xu et al., 2020) which learn a compact student network by replacing the teacher layers in a fine-tuning stage. Results of task-specific distillation on GLUE dev sets and SQuAD datasets are presented in Table 2 and 3, respectively. Note that briefly, the CKD also outperforms all baselines for all GLUE datasets and SQuAD dataset except for RTE for task-specific distillation, convincingly verifying its effectiveness. These results consistently support that contextual knowledge works better than other distillation knowledge.

Effect of CKD on various sizes of models
For the knowledge distillation with the purpose of network compression, it is essential to work well in more resource-scarce environments. To this end, we further evaluate our method on various sizes of architectures. For this experiment, we perform distillation on a task-specific training process on top of various size pre-trained models provided by Turc et al. (2019). We compare CKD with three baselines: 1) LogitKD objective used by ; Turc et al. (2019). 2) TinyBERT (Jiao et al., 2019) objective which includes individual word representations and attention matrix matching. 3) MINI-LM  objective which includes attention matrix and value-value relation matching. We implement the baselines and runs for task-specific distillation. We note that MINI-LM and TinyBERT objective are applicable only to models (*/768) which have the same number of attention heads with the teacher model (12/768). Figure 2 illustrate that our CKD consistently exhibits significant improvements in the performance compared LogitKD. In addition, for task-specific distillation, we show that CKD works better than all baselines on (*/768) student models. The results on more datasets are provided in Appendix E.

Incorporating with DynaBERT
DynaBERT (Hou et al., 2020) is a recently proposed adaptive-size pruning method that can run at adaptive width and depth by removing the attention heads or layers. In the training phase, Dyn-aBERT uses distillation objectives which consist of LogitKD and individual word representations matching to improve the performance. Since the CKD objective has no constraints about architecture such as embedding size or the number of attention heads, we validate the objective by replacing it with CKD. The algorithm of DynaBERT and how to insert CKD are provided in Appendix D. To observe just how much distillation alone improves performance, we do not use data augmentation and an additional fine-tuning process. We note that objectives proposed in MINI-LM  and TinyBERT (Jiao et al., 2019) cannot be directly applied due to constraints of the number of attention heads. As illustrated in Figure 3, CKD consistently outperforms the original DynaBERT on dynamic model sizes, supporting the claim that distribution-based knowledge is more helpful than individual word representation knowledge. The results on more datasets are provided in Appendix E.

Ablation Studies
We provide additional ablation studies to analyze the impact of each component of the CKD and the introduced locality (w i,j = δ) in Eq. (1) as the weight of how important each pair-wise and triple-wise term is. For these studies, we fix the student network with 4-layer BERT (4/512) and report the results as an average of over 4 runs on the development set.
Impact of each component of CKD. The proposed CKD transfers the word relation based and layer transforming relation based contextual knowledge. To isolate the impact on them, we experiment successively removing each piece of our objective. Table 4 summarizes the results, and we observe that WR and LTR can bring a considerable performance gain when they are applied together, verifying their individual effectiveness.
Locality as the importance of relation terms. We introduced the additional weights (w ij , w ijk ) in Eq.
(1) for CKD-WR (and similar ones for CKD-LTR) to control the importance of each pair-wise and triple-wise term and suggested using the locality for them as one possible way. Here, we verify the effect of locality by increasing the local window size (δ) on the SST-2 and QNLI datasets. The result is illustrated in Figure 4. We observe that as the local window size increases, the performance improves, but after some point, the performance is degenerated. From this ablation study, we set the window size (δ) between 10-21.

Conclusion
We proposed a novel distillation strategy that leverages contextual information efficiently based on word relation and layer transforming relation. To our knowledge, we are the first to apply this contextual knowledge which is studied to interpret the language models. Through various experiments, we show not only that CKD outperforms the state-ofthe-art distillation methods but also the possibility that our method boosts the performance of other compression methods. Table 5 present the details of knowledge distillation objectives of previous methods and their constraints.

A Explanation of previous methods and their constraints
DistilBERT  uses logit distillation loss (Logit KD), masked language modeling loss, and cosine loss between the teacher and student word representations in the learning process. The cosine loss serves to align the directions of the hidden state vectors of the teacher and student. Since the cosine of the two hidden state vectors is calculated in this process, they have the constraint that the embedding size of the teacher and the student model must be the same.
PKD (Sun et al., 2019) transfers teacher knowledge to the student with Logit KD and patient loss. The patient loss is the mean-square loss between the normalized hidden states of the teacher and student. To calculate the mean square error between the hidden states, they have a constraint that the dimensions of hidden states must be the same between teacher and student.
TinyBERT (Jiao et al., 2019) uses additional loss that matches word representations and attention matrices between the teacher and student. Although they acquire flexibility on the embedding size, using an additional parameter, since the attention matrices of the teacher and student are matched through mean square error loss, the number of attention heads of the teacher and student must be the same.
MobileBERT (Sun et al., 2020) utilizes a similar objective with TinyBERT (Jiao et al., 2019) for task-agnostic distillation. However, since they match the hidden states with l2 distance and attention matrices with KL divergence between teacher and student, they have restrictions on the size of hidden states and the number of attention heads.
MiniLM  proposes distilling the self-attention module of the last Transformer layer of the teacher. In self-attention module, they transfer attention matrices such as TinyBERT and MobileBERT and Value-Value relation matrices. Since they match the attention matrices of the teacher and student in a one-to-one correspondence, the number of attention heads of the teacher and student must be the same.
The methods introduced in Table 5 have constraints by their respective knowledge distillation objectives. However, our CKD method which uti-lizes the relation statistics between the word representations (hidden states) has the advantage of not having any constraints on student architecture.

B Details of experiment setting
This section introduces the experimental setting in detail. We implemented with PyTorch framework and huggingface's transformers package .
Task-agnostic distillation We use the pretrained original BERT base with masked language modeling objective as the teacher and a document of English Wikipedia as training data. We set the max sequence length to 128 and follow the preprocess and WordPiece tokenization of Devlin et al. (2018). Then, we perform the distillation for 3 epochs. For the pre-training stage, we use the CKD objective with class probability matching of masked language modeling and keep other hyperparameters the same as BERT pre-training (Devlin et al., 2018).
Task-specific distillation Our contextual knowledge distillation proceeds in the following order. First, from pre-trained BERT base , task-specific finetuning is conducted to serve as a teacher. Then, prepare the pre-trained small-size architecture which serves as a student. In this case, pre-trained models of various model sizes provided by Turc et al. (2019) are employed. Finally, task-specific distillation with our CKD is performed.
To reduce the hyperparameters search cost, λ WR in Eq. (1) and λ LTR in Eq. (3) are used with same value. For the importance weights introduced for pair-wise and triple-wise terms, the locality is applied only to the importance weight w of the word relation (WR)-based CKD loss. The importance weight w of the layer transforming relation (LTR)based CKD loss is set to 1. In this paper, we report the best result among the following values to find the optimal hyperparameters of each dataset: • Alpha (α) : 0.7, 0.9 • Temperature (T ) : 3, 4 • λ WR , λ LTR : 1, 10, 100, 1000 • λ CKD : 1, 10, 100, 1000 Other training configurations such as batch size, learning rate and warm up proportion are used following the BERT (Devlin et al., 2018). Table 5: Overview of distillation objectives used for language model compression and their constraint on architecture. S k means scaled softmax function across the kth-dimension.

Knowledge Distillation Objectives Constraint
DistilBERT    Table 6: Full comparison of task-agnostic distillation comparing our CKD against the baseline methods. For the task-agnostic distillation comparison, we do not use task-specific distillation for a fair comparison. The results of TinyBERT cited as reported by . Other results are as reported by their authors. Results of the development set are averaged over 4 runs. "-" means the result is not reported and the trained model is not released. † marks our runs with the officially released model.

C Additional comparison on task-agnostic distillation
We report the fair comparison of our method and baselines about the task-agnostic distillation in Section 5.1 of the main paper. Several works  use the additional BookCorpus dataset which is no longer publicly available. Here, we present the full comparison of CKD and baselines including DistilBERT  and MINI-LM . As shown in Table 6, even though we do not use the BookCorpus dataset, we outperform all baselines on four datasets and obtain comparable performance on the rest of the datasets.

D Applying CKD to DynaBERT
In this section, we describe how we apply our CKD objective to DynaBERT (Hou et al., 2020). Training DynaBERT consists of three stages: 1) Rewire the model according to the importance and then 2) Go through the two-stage of adaptive pruning with distillation objective. Since we suppress some details of DynaBERT for clarity, refer to the paper (Hou et al., 2020) for more information.
We summarize the training procedure of Dyn-aBERT with CKD in algorithm 1. To fully exploit the capacity, more important attention heads and neurons must be shared more across the various sub-networks. Therefore, we follow phase 1 in Dyn-aBERT to rewire the network by calculating the loss and estimating the importance of each attention head in the Multi-Head Attention (MHA) and neuron in the Feed-Forward Network (FFN) based on gradients. Then, they train the DynaBERT by accumulating the gradient varying the width and depth of BERT. In these stages, they utilize distillation objective which matches hidden states and logits to improve the performance. We apply our CKD at these stages by replacing their objective with CKD as shown in algorithm 1 (Blue). Since CKD has no restrictions on student's architecture, it can be easily applied.

E More Results
Due to space limitations in the main paper, we only report the results on a subset of GLUE datasets for experiments about the effect of model size for CKD and boosting the DynaBERT with CKD. Here, we report all datasets of GLUE except for CoLA for two experiments. We exclude the CoLA dataset since the distillation losses are not converged properly in the very small-size models.
Here, we present the results of three experiments