Universal-KD: Attention-based Output-Grounded Intermediate Layer Knowledge Distillation

Intermediate layer matching is shown as an effective approach for improving knowledge distillation (KD). However, this technique applies matching in the hidden spaces of two different networks (i.e. student and teacher), which lacks clear interpretability. Moreover, intermediate layer KD cannot easily deal with other problems such as layer mapping search and architecture mismatch (i.e. it requires the teacher and student to be of the same model type). To tackle the aforementioned problems all together, we propose Universal-KD to match intermediate layers of the teacher and the student in the output space (by adding pseudo classifiers on intermediate layers) via the attention-based layer projection. By doing this, our unified approach has three merits: (i) it can be flexibly combined with current intermediate layer distillation techniques to improve their results (ii) the pseudo classifiers of the teacher can be deployed instead of extra expensive teacher assistant networks to address the capacity gap problem in KD which is a common issue when the gap between the size of the teacher and student networks becomes too large; (iii) it can be used in cross-architecture intermediate layer KD. We did comprehensive experiments in distilling BERT-base into BERT-4, RoBERTa-large into DistilRoBERTa and BERT-base into CNN and LSTM-based models. Results on the GLUE tasks show that our approach is able to outperform other KD techniques.


Introduction
Despite the great success of deep neural networks in different tasks such as computer vision (Huang et al., 2018;Lawrence et al., 1997) and natural language processing (NLP) Devlin et al., 2019;Radford et al., 2018;Vaswani et al., 2017) due to huge over-parameterization, when it comes to deploying these models for the end user or on edge devices with limited memory and computational power, this over-parameterization can be very prohibitive. Therefore, different neural model compression techniques such as quantization (Gong et al., 2014;Prato et al., 2019), pruning (Han et al., 2015), layer decomposition (Mondelli and Montanari, 2019) and knowledge distillation (KD) (Buciluǎ et al., 2006;Hinton et al., 2015) aim at reducing the number of parameters of the models, improving their memory requirements or running efficiency.
KD is one of the most well-known neural model compression techniques. KD provides a way to compress a large model (so-called the teacher model) into a small model (i.e. the student model). KD has shown being reliable in reducing the number of parameters and computations while achieving competitive results on downstream tasks. Recently, KD has attracted more attention in the NLP field especially due to large pre-trained language models (PLMs) (Sanh et al., 2019;Sun et al., 2019b;Jiao et al., 2019). However, it is evident that the original KD is not performing very well in maintaining the performance of compressed PLMs and it is required to be equipped with other auxiliary training objectives (Jiao et al., 2019;Sun et al., 2019b). Jiao et al. (2019) refers to a few of these auxiliary techniques such as Intermediate Layer Distillation (ILD) and data augmentation. Mo-bileBERT (Sun et al., 2020) deploys a progressive layer-wise training. Jafari et al. (2021) proposes a two-stage gradual training by using a dynamic temperature factor during training to address the common capacity gap problem (i.e. training with KD becomes more difficult when the capacity gap between the teacher and student networks become too large) in KD. Sun et al. (2019b) and  show the importance of ILD and try to improve the search and skip problems in ILD, respectively.
Among all auxiliary techniques, the focus of this paper is on ILD as a promising approach in improving KD. Existing ILD techniques are established on raw intermediate representations. We argue that the physical interpretation of these raw representations are not known. Currently in most of ILD techniques, distilling these representations from one network to another is done arbitrarily and without any particular reasoning. Performing ILD in this form leads to other weaknesses such as requiring layer mapping search and requiring the architecture of the two networks to be the same. To address these limitations, we provide a universal solution which maps the raw hidden representations of each layer to the output space using so-called intermediate pseudo classifiers. We demonstrate in this work that our Universal-KD technique leads to the following contributions: 1. Universal-KD provides an interpretable ILD technique which can be flexibly combined with other ILD methods to improve their results.
2. Universal-KD is extended to solve the capacity gap problem without requiring any extra teacher assistant network and outperforms corresponding state-of-the-art models.
3. Universal-KD is appropriate for intermediate layer cross-architecture distillation and to the best of our knowledge, it is the first time ILD is able to show improvement in this setting on NLP tasks.

Background
KD adds a new loss term (KD loss) to the regular Cross Entropy (CE) classification loss function. KD loss aims in pushing the output of the student model to follow that of the teacher.
where α is a hyper-parameter to adjust the contribution of the KD loss vs. the CE loss, σ(.) is the softmax function, z t and z s are the teacher and student logits, S(x; θ) = σ(z s (x; θ)) shows the student output class probabilities, T indicates the temperature factor which controls the smoothness of the output probabilities of the two networks, θ and φ refer to the parameters of the teacher and student networks respectively, and KL(·, ·) is the KL divergence loss.

Related Work
Recent years have seen a wide array of methods that leverage intermediate layer matching Ji et al., 2021), data augmentation (Fu et al., 2020;Jiao et al., 2019;Kamalloo et al., 2021), adversarial training (Zaharia et al., 2021;Rashid et al., 2020Rashid et al., , 2021, lately loss terms re-weighting (Clark et al., 2019;Zhou et al., 2021;Jafari et al., 2021) in order to reduce the teacher-student performance gap. In this section, we introduce the related work to ILD, the capacity gap problem, and intermediate layer crossarchitecture distillation. You can find a comprehensive comparison of Universal-KD and related works in Table 1. Sun et al. (2019b) found that apart from the teacher's final prediction, the student can also benefit from the internal components of the teacher. Therefore, they proposed PKD to extract such knowledge from the hidden layers at the fine-tuning stage. In Attention-based Layer Projection KD (ALP-KD),  raised two problems in ILD techniques including PKD, when the number of teacher layers N are more than that of the student M : the skip problem and the search problem. The skip problem refers to the issue that when N > M , multiple layers of the teacher might be ignored in the ILD process which can lead to loss of information. The search problem indicates that for N > M , finding the best layer mapping requires a tedious search process. To address these problems, ALP-KD uses an attention mechanism between each layer of the student and all layers of the teacher. This way, no teacher layer is skipped during distillation and hence the skip problem and the search problem are solved. Aside from the above two problems, ILD is usually performed in the hidden space of intermediate layers which lacks clear interpretation. In other words, it is hard to assign physical meanings to these hidden representations which makes it difficult to justify how their distillation works. We refer to this problem as interpretability problem in ILD. Multi-head KD (MHKD) (Wang et al., 2020) proposed a solution to address this problem by adding auxiliary classifier heads to intermediate layers. Therefore, the matching between the teacher and student intermediate layers can be done in the output space which is more interpretable. However, MHKD suffers   Attention (No Search) Hidden N/A --NLP MHKD (Wang et al., 2020) Arbitrary (  from the skip and search problem, and it is done only on computer vision tasks. We will explain the benefits of our Universal-KD over MHKD in the Section 4.1 in more detail.

Capacity Gap Problem in KD
It has been observed that larger teachers do not necessarily lead to better distillation results, especially when the size of the teacher is much larger than the size of the student (Mirzadeh et al., 2020). This problem is referred to as the capacity gap in KD (Jafari et al., 2021). Teacher assistants KD (TAKD) (Mirzadeh et al., 2020) alleviated this gap by incorporating some intermediate size TA networks (larger than the student but smaller than the teacher) in the distillation process. Son et al. (2020) pointed out that TAKD suffers from error accumulation when one of the TAs transfers wrong knowledge to the next component. Thus, they proposed Densely Guided KD (DGKD) (Son et al., 2020) to employ all TAs and the teacher to guide the model simultaneously. However, training multiple TA networks in above techniques can be prohibitive in the real world especially when dealing with PLMs. Jafari et al. (2021) came up with Annealing KD which does not need any extra TA for its training. They let the teacher gradually generate annealed soft targets at different temperatures and let the student follow the annealed output of the teacher in the meantime. Asadian and Salehi-Abari (2021) proposed another TA-Free Distillation via Intermediate Heads (DIH) to reduce the computational cost. They attach auxiliary classifier heads to some selected teacher layers, and the output of these classifiers are used to guide the final student classifier. However, in contrast to our model, the selection of which layers to add classifiers is arbitrary, each classifier contributes equally to the distillation, and they only explored computer vision tasks.

Intermediate Layer Cross-Architecture Distillation
The increase of computation resources required by Transformer-based (Vaswani et al., 2017) NLP models (Devlin et al., 2019; has motivated researchers to consider more efficient student architectures. This type of knowledge transfer is referred as cross-architecture KD. While we can find a few solutions for cross-architecture distillation in the literature (Tang et al., 2019;Kaliamoorthi et al., 2021), intermediate layer crossarchitecture distillation has not yet been much explored due to the interpretability problem in ILD. We could only find (Mukherjee and Awadallah, 2020) on multilingual Named Entity Recognition KD which transfers knowledge from a BERTbase (Devlin et al., 2019) teacher to an LSTM student through ILD. However, their ILD is not performed in the output space (not interpretable), does not consider the search and skip problem, and does not show any improvement on the results using ILD. Therefore, their work might not be directly comparable to ours.

Methodology
Methodology of our work concerns introducing our Universal-KD solution and showing how this technique can address the main discussed problems in KD which are interpretability of ILD, the capacity gap issue and intermediate layer cross-architecture distillation.

Universal-KD
In the regular ILD (Sun et al., 2019b), layer matching is performed in the representation space (e.g. the CLS output representation of each layer for BERT-based models or the average representation of the entire sequence). This representation matching is usually done using the MSE loss over the where Sim(·, ·) refers to the similarity function between two input vectors, h t i and h s j indicate the hidden representation vector of the i th layer of the teacher and j th layer of the student, respectively. However, the physical meaning of these hidden space representations of different layers (i.e. meaningful features of the input data which are encoded by hidden space representation of different layers) of each network might not be very clear, and consequently finding a proper layer distillation projection (which distils similar intermediate features) for two networks would be quite challenging.
To address this problem, we propose our Universal-KD based on layer matching in the output space rather than the hidden space level. In other words, we suggest mapping the intermediate hidden layer representations to the output space by applying some pseudo classifiers on the corresponding intermediate layers before performing distillation. After applying the pseudo classifiers, we can get the output distributions over classes and measure the similarity of the hidden space representations in the output space using the KL divergence as following: where Sim Univ. (·, ·) is the universal similarity mea- are the pseudo classifiers applied to the i th layer of the teacher and j th layer of the student respectively. Moreover, σ(.) refers to the softmax function. W t i and W s j are the corresponding weight matrices of the pseudo classifiers applied to the i th layer of the teacher and j th layer of the student respectively. Bear in mind that W t i can be pretrained in a warm-up phase with the ground-truth training data when the weights of the pre-trained teacher model are kept frozen. For the student, the weights of pseudo classifiers are trained during the Universal-KD training.
Inspired by ALP-KD  , Universal-KD applies attention-based layer projection on top of pseudo classifier outputs. For the intermediate KD setup, we apply pseudo classifiers on top of all layers of the student and teacher networks. Then, for each pseudo classifier of the student we find weighted prediction of teacher's pseudo classifiers, F t (j), to be used for distillation to the j th pseudo classifier of the student. The weights of this summation are derived from the attention weights calculated according to the similarity values between the outputs of each pseudo classifiers of the student and all pseudo classifiers of the teacher: where λ ij describes the attention weight of the j th student layer to the i th teacher layer. Then, the aggregated prediction of teacher's pseudo classifiers, F t (j), to be used for distillation to the j th layer of the student can be calculated as: where N is the total number of layers of the teacher. The Universal-KD loss for each layer of the student can be calculated as following: In the following, we explain how our Universal-KD loss can be set up for ILD, capacity gap problem and cross-architecture problems.
Bear in mind that our work can be distinguished from MHKD (Wang et al., 2020) in different points: first, in contrast to MHKD which only applies pseudo classifiers to some arbitrarily selected layers of the teacher and skip the others, Universal-KD applies pseudo classifiers to all the teacher and student layers to avoid the loss of information and to avoid the burden of selecting the best layer mapping strategy; second, we have an attention mechanism in the process of intermediate pseudo classifiers distillation; third, unlike MHKD which applies CE loss to the training of student's pseudo classifiers, our Universal-KD avoids using the CE loss for training the student pseudo classifiers to prevent them from over-fitting ; fourth, MHKD only investigates the ILD setting in computer vision tasks, but our Universal-KD deploys the output-grounded ILD to solve a more comprehensive set of problems in KD such as the capacity gap problem and intermediate layer cross architecture distillation in NLP.

Capacity Gap Problem
We can apply Universal-KD to address the capacity gap problem. In this regard, instead of training excessive TA networks, which can be really prohibitive for large PLMs, we create pseudo TAs by applying pseudo classifier heads to the intermediate layers of the teacher. Each teacher's pseudo classifier will play the role of a TA (which we refer to as a pseudo TA) to fill the capacity gap for us. Then, we can distill the aggregated pseudo classifiers of the teacher into the last layer of the student: This setting is equivalent to distilling from multiple TAs at the same time (see Fig. 2). But unlike DIH (Asadian and Salehi-Abari, 2021), we do not have a uniform distillation from the pseudo TAs. Instead, it is the output layer of the student determines how much we should attend to each pseudo TA.
Cross-Architecture Distillation In the crossarchitecture setting, the building blocks of the teacher and student network are different. In this case, since the output of the pseudo classifiers are not depending to the architecture, the Universal-KD loss is similar to the ILD scenario: We need to add a remark here that without the pseudo classifiers, the intermediate hidden representations of two architecturally different networks are not grounded, and hence matching them would not be meaningful.

Training Process
By combining the Universal loss with the KD loss and CE loss which are described in Eq. 1, we can obtain the final objective function for distilling the teacher into the student model: where α, β and γ are hyper-parameters. It is worth mentioning that we apply a two-stage training to avoid extensive hyper-parameter tuning. In the first stage, the model is only guided by the Universal-KD loss and KD loss. α is set to 0, β is selected from {0.2, 0.5, 0.7}, and γ is set to 1-β in the first stage. For the second stage, we only train the model on the CE loss so the α is set to 1 and other two hyper-parameters are set to 0 in stage two.

Experimental Setup
We first test our method Universal-KD (IL) for intermediate layer knowledge distillation. Our teacher is a standard 12-layer BERT-base model (Devlin et al., 2019) with 12 heads and the hidden and feed-forward dimensions are 768 and 3072 respectively. Our students are also BERT models with 4 layers, while other configurations are the same as the BERT-base model. To make fair comparisons with PKD and ALP, we utilize the first 4 layers of the pre-trained BERT-base teacher to initialize the students. During training, we run a grid search over batch size ({8, 16, 32}), learning rate ({2e − 5, 5e − 5}) as well as β ({0.2, 0.5, 0.7}) for each task to find the best hyper-parameters. In the capacity gap experiment, we want to verify whether our proposed solution Universal-KD (CG) for the capacity gap problem is effective especially when the size of the teacher and the student differs a lot. Thus, we choose Roberta-large  as our teacher and Distilroberta as our student model. Our teacher has 24 layers with a hidden dimension of 1024, 16 attention heads and 355M parameters in total. Our student is a 6-layer model with a hidden dimension of 768, 8 attention heads and a total of 82M parameters. The selection of hyper-parameters and the hyper-parameter search strategy is exactly the same with the first experiment.
For all experiments discussed in Section 5.2, after the teacher is trained on the task, we train the pseudo classifiers all together through the cross entropy loss while the teacher model is frozen. It is a common technique to train pseudo classifiers (Asadian and Salehi-Abari, 2021;Xin et al., 2020). The student is first trained for 20 epochs in stage 1 for all datasets except for CoLA. CoLA needs 50 epochs to get high-quality results. Then, the model is further trained for 10 epochs in stage 2.
We also conduct two more experiments under the cross-architecture setting. Our teacher is a strong BERT-base model (Devlin et al., 2019) while the student is Bi-LSTM and Gated CNN, respectively. In this experiment, we want to verify if our method is still helpful when the student and teacher are of different types. We set without KD and Vanilla KD as baselines for all experiments. For all dev set results, we report Matthew's Correlations for CoLA, Pearson correlations for STS-B and accuracy scores for all other datasets.

Results of Intermediate Layers KD
We compare our Universal-KD (IL) with three intermediate layer KD methods that are described in Section 3.1: PKD, MHKD and ALP. All the results of 4-layer BERT-base students on the GLUE dev set and test set are summarized in Table 2 and 3. We observe that Universal-KD (IL) outperforms all other ILD methods on both the dev set and test set, which shows the superiority of incorporating the attention mechanism with the output space distillation. Moreover, compared with MHKD and ALP, the attention mechanism and the output space matching in Universal-KD (IL) gives 0.7% and 0.5% performance improvement respectively on average.

Capacity Gap
Here, we evaluate Universal-KD (CG) compared to the three baselines of the capacity gap problem: TAKD, DIH and Annealing KD. Considering the computational cost, the TA used in TAKD is a single RoBERTa-base model with 12 layers. We present the DistilRoBERTa student results on GLUE dev set in Table 4. Surprisingly, We find that TAKD has a similar performance to vanilla KD, which means that without multiple well-designed TA networks, TAKD can hardly fill the gap between the teacher and the student. DIH performs better than TAKD with an acceptable margin, which is expected since DIH distills from multiple interme-     Table 5: Performances of Bi-LSTM and Gated-CNN students on GLUE dev sets when BERT-base is used as teacher. For each student architecture, we report the scores of models without KD, with vanilla KD, and with Universal-KD.
diate classifiers at the same time. However, DIH just employs a uniform distillation without considering the contribution of each teacher layer to the final distillation. This explains why DIH has worse results compared to Universal-KD (CG) . Annealing KD is the current state-of-the-art method for solving the capacity gap problem, and it works better than the two previous baselines. It is worth noting that our Universal-KD can outperform Annealing KD by 0.2% without any temperature adjustment.

Cross-Architecture KD
In this section, we test the validity of our approach in the intermediate layer cross-architecture KD (Mukherjee and Awadallah, 2020;Kaliamoorthi et al., 2021) setting. We experiment with Bi-LSTM (Hochreiter and Schmidhuber, 1997) and Gated CNN (Dauphin et al., 2017;Ghaddar and Langlais, 2019) as students, while the teacher is a BERT-base model. The encoder of the LSTM and CNN models have 3 and 4 layers, respectively, with a hidden size of 768 and both models have simi-lar capacity of around 25M trainable parameters. At the output layer, the attention sum (Bahdanau et al., 2015) mechanism is used to reduce the encoded sequence into a single vector, which in turn is projected to the output space. Hyper-parameter details are listed in Appendix A. Table 5 reports the performance on GLUE dev sets for Bi-LSTM and Gated-CNN student models without KD, with vanilla KD and Universal-KD (CA) . First, we notice that students without KD perform roughly 20% less compared to their counterparts in Table 2. This is primarily due to the superiority of pre-trained and fine-tuning approach over feature-based one (Devlin et al., 2019;Sun et al., 2019a;Ghaddar et al., 2021b,a), and partially due to the use of the single encoder for sentence pair tasks like STS-B, QNLI and MNLI (For more detail please refer to Appendix). Second, we observe that LSTM students outperform CNN ones by roughly 1.6% on average. It is worth mentioning that in our experiments, CNN models were 3 times faster to train due to their parallelism ability (Strubell et al., 2017). Expectedly, vanilla KD improves the performance of both Bi-LSTM and Gated-CNN models by an average of 0.5% and 0.1%, respectively. Our Universal-KD significantly improves the performances of Vanilla-KD by 0.8% and 1.4% for Bi-LSTM and Gated-CNN, respectively.

Analysis
Attention on logits v.s. attention on losses .
Attention weights could either be used to aggregate teacher classifier outputs (our Universal-KD (CG) setting) or weigh the loss function. For the latter, the loss function L 2 Univ. could be expressed as L 2 Univ . We conduct one experiment on two small datasets (MRPC, RTE), one medium dataset (SST-2) and one large dataset (QQP) to verify which way of utilizing attention weights is more beneficial. All the settings are the same with Universal-KD (CG) except we use L 2 Univ. rather than L CG Univ. . We call this variant as Universal-KD . Table 6 shows the dev results. To justify the effect of attention weights, We also include the results of DIH which does not weigh losses. First, we observe that Universal-KD has better results than DIH, which indicates that different teacher layers contribute differently, and the attention score reflects the importance of each teacher layer. Second, on all datasets, our original Universal-KD (CG) outperforms the weighted aver-  Output space v.s. hidden space To verify if the attention layer projection works better in the output space or the hidden space, we visualize the attention weights of ALP and Universal-KD (IL) . Since the concrete importance degree of each teacher layer is hard to measure, we treat the Euclidean distance between real labels and classifier outputs as a reference measure for our student models. We randomly select 10 samples from the RTE dataset, and demonstrate the importance levels of each layer of the BERT-base teacher in the left figure of Fig. 3. The x and y axes represent 12 teacher layers and 10 samples respectively. It is clearly shown that the last two layers can generate more accurate predictions close to the true label, thus later layers of the student are expected to focus more on these layers of the teacher. In the middle and right figures of Fig. 3, we show attention weights between the third (penultimate) layer of the 4-layer Bertbase student (i.e. trained by ALP and Universal-KD (IL) , respectively) and all 12 teacher layers. From Fig. 3, we observe that the penultimate student layer in Universal-KD (IL) attends to more important teacher layers, compared to the same student layer in ALP, which mostly focuses on first teacher layers. Figure 3: Visualizing (left) the Euclidean distance between real labels and classifier outputs of all 12 teacher layers, (middle) attention weights between the third layer of the 4-layer student trained by Universal-KD (IL) with all 12 teacher layers, (right) attention weights between the third layer of the 4-layer student trained by ALP with all 12 teacher layers for 10 samples from RTE.

Conclusion
In this paper, we introduced the Universal-KD approach, which employs attention-based layer pro- In this section, we include more implementation details of Universal-KD (IL) and Universal-KD (CG) . For Universal-KD (IL) , our teacher is a 12-layer BERT-base model while the student is a 4-layer BERT-base model. The student model is initialized by the first 4 layers of the pre-trained BERT-base teacher. For Universal-KD (CG) experiments, our teacher is a 24-layer RoBERTa-large model while the student is a 6-layer DistilRoBERTa model. We apply the same training configuration for both BERT-4 and DistilRoBERTa student. During training, grid search is performed over batch size ({8, 16, 32}), learning rate ({2e−5, 5e−5}) and β ({0.2, 0.5, 0.7}). We set T to 1 for all experiments. The student is first trained for 20 epochs in stage 1 for all datasets except for CoLA. CoLA needs 50 epochs to get high-quality results. Then the model is further trained for 10 epochs in stage 2. The best hyper-parameters values we used for BERT-4 and DistilRoBERTa students are summarized in Table 7 and Table 8, respectively.

A.2 Hyper-parameter Tunning on Bi-LSTM and CNN students
We adopt the Adam (Kingma and Ba, 2014) optimization algorithm, and varied learning rate between 1e−4 and 5e−5, batch size ({8, 16, 32, 64}), dropout ([0.1 − 0.8]), and β ({0.2, 0.5, 0.7}). We select the best performing hyper-parameters independently for each model on each task. Table 9 shows the hyper-parameters values we used for Bi-LSTM and Gated-CNN experiments. It is worth mentioning that we do not adopt a siamese architecture (Bromley et al., 1993) 2 for sentence pair classification tasks, although it performs much better in these tasks (Grégoire and Langlais, 2018;Tang et al., 2019). We prefer to keep our models simple by using a single sequence encoder in all experiments. We do so to avoid having a mismatch between teacher's and student's inputs during distillation, which is an interesting problem we want to explore in future works.