A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models

Large language models have become a vital component in modern NLP, achieving state of the art performance in a variety of tasks. However, they are often inefficient for real-world deployment due to their expensive inference costs. Knowledge distillation is a promising technique to improve their efficiency while retaining most of their effectiveness. In this paper, we reproduce, compare and analyze several representative methods for task-agnostic (general-purpose) distillation of Transformer language models. Our target of study includes Output Distribution (OD) transfer, Hidden State (HS) transfer with various layer mapping strategies, and Multi-Head Attention (MHA) transfer based on MiniLMv2. Through our extensive experiments, we study the effectiveness of each method for various student architectures in both monolingual (English) and multilingual settings. Overall, we show that MHA transfer based on MiniLMv2 is generally the best option for distillation and explain the potential reasons behind its success. Moreover, we show that HS transfer remains as a competitive baseline, especially under a sophisticated layer mapping strategy, while OD transfer consistently lags behind other approaches. Findings from this study helped us deploy efficient yet effective student models for latency-critical applications.


Introduction
Large language models have become a crucial component in modern NLP.They have achieved exceptional performance on various downstream tasks (Devlin et al., 2019;Liu et al., 2019;Lewis et al., 2020) and their capability shows consistent improvement with more compute, data, and model parameters (Kaplan et al., 2020;Brown et al., 2020;Touvron et al., 2023).On the downside, it is becoming increasingly difficult to deploy such models in real-world environments due to their inefficiency, i.e. high computation, memory, latency and storage costs (Xu and McAuley, 2023).
Knowledge distillation (Hinton et al., 2015) is a promising technique to overcome this challenge by transferring the knowledge of the original model (teacher) to a smaller, more efficient model (student).This can be conducted in either task-specific (Turc et al., 2019;Jiao et al., 2020) or task-agnostic manner (Sanh et al., 2019;Wang et al., 2020).The latter only requires distilling a single generalpurpose student which can be directly finetuned on any downstream task.Due to its high convenience, we focus on this latter approach in this study.
In recent years, there have been various methods proposed for task-agnostic distillation of Transformer language models.The aim of this paper is to reproduce, compare and analyze the most representative methods in this area.We generally focus on the architecture-agnostic distillation which imposes no or minimal restriction on the student architecture1 : the representative methods include Output Distribution (OD) transfer (Hinton et al., 2015), Hidden State (HS) transfer based on linear mapping (Jiao et al., 2020;Mukherjee et al., 2021) and Multi-Head Attention (MHA) transfer based on MiniLMv2 (Wang et al., 2021).
For HS transfer, the layer mapping strategy between teacher and student layers plays a significant role in overall performance, however, the optimal strategy remains unknown or controversial (Sun et al., 2019;Wu et al., 2020;Ko et al., 2023).Therefore, we explore a diverse range of strategies to empirically evaluate each technique.
For MHA transfer, the MiniLMv2 approach has been shown to achieve state-of-the-art performance, however, there is relatively little understanding behind its success.Therefore, we develop a novel variant named DirectMiniLM which is useful for Lines between the student and teacher depict which level of information is transferred in each method.
understanding the effectiveness behind MiniLMv2 both theoretically and empirically.
In contrast to most previous studies, all methods are reproduced on a single unified codebase for fair and consistent comparison.We also conduct distillation on 4 different student architectures, reducing the model size in various dimensions to fit various parameter and latency budgets.Finally, all experiments are conducted on both monolingual and multilingual settings, distilled from open-source BERT (Devlin et al., 2019) and in-house XLM-RoBERTa (Conneau et al., 2020), respectively.
Through our extensive experiments, we critically analyze the effectiveness of each distillation method and provide practical advice for both researchers and practitioners working in this area.In summary, our key findings are: • MHA transfer is generally the best option for various student architectures and language settings.By comparison with DirectMiniLM, we provide novel insights underlying its success.• While the effectiveness of HS transfer depends on the layer mapping strategy, it remains as a competitive baseline.More sophisticated layer mapping strategy can provide a boost in performance, esp. in the multilingual setting.• Methods relying on OD transfer consistently lag behind other methods.This shows that classical OD distillation can be less effective when distilling complex language models on a generalpurpose objective.

Transformer Language Models
First, we briefly review the standard architecture of Transformer language models (Vaswani et al., 2017;Devlin et al., 2019).A Transformer consists of a stack of L Transformer layers, where each layer comprises two sub-layers: a Multi-Head Attention (MHA) layer followed by a fully connected Feed-Forward (FF) layer (Figure 1, (a)).Formally, let x denote the input sequence, d h the hidden state size, and H i ∈ R |x|×d h the hidden state of the i th Transformer layer (H 0 denotes the input sequence embeddings).Given H i , the MHA layer first computes the query, key, and value mappings Q i,a , K i,a , V i,a for each attention head a ∈ [1, A h ], which are combined to obtain the attention head output O i,a : Here, d k denotes the attention head size (typically set to d h A h ) and W Q,i,a , W K,i,a , W V,i,a ∈ R d h ×d k are the learnt weight matrices.The output of the MHA layer is the concatenation of O i,a , namely a=1 O i,a .Next, the MHA layer output is followed by a position-wise FF layer with an intermediate size of d f and a non-linear activation (we use GELU (Hendrycks and Gimpel, 2016) in all models).The hidden state of the next Transformer layer is computed as Finally, to predict the output distribution over the entire vocabulary V , a linear layer is applied on top of the last hidden state to compute the logits z = H L W O ∈ R |x|×|V | .The output distribution can be obtained by applying the softmax function over z, denoted as softmax(z).
Throughout this paper, we assume that both the student and teacher are Transformer language models with L S and L T layers, respectively.

Distillation Methods
Next, we introduce the representative task-agnostic distillation methods illustrated in Figure 1, (b-d).For Multi-Head Attention (MHA) transfer, we consider two approaches: MiniLMv2 and its novel variant DirectMiniLM.For a survey of advanced methods and topics we could not cover in this study, please refer to Appendix A.
Output Distribution (OD) Transfer The output distribution of the teacher contains useful information on the relative probabilities of plausible (even if incorrect) predictions (Hinton et al., 2015).In OD transfer, the student is trained to replicate the teacher's output distribution.This is achieved by optimizing the following loss function, where z S , z T denote the student/teacher logits, CE(.) the cross entropy loss and T the output temperature: Hidden State (HS) Transfer Transformer language models progressively learn useful and generalizable features layer by layer.In HS transfer, the student is trained to predict such useful features represented in the teacher's hidden states.Formally, each student layer is mapped to a set of teacher layers to be predicted.Let φ(i) denote the set mapped from the i th student layer, where is linearly transformed to predict the hidden state of the j th teacher layer H T j ∈ R |x|×d T h .3This is represented by the following loss function, where h denotes the linear transformation weight and MSE(.) the mean squared error loss: One open problem in this approach is the choice of layer mapping strategy φ.We conduct extensive experiments to compare a diverse range of strategies, which will be discussed in §4.

MiniLMv2
The MHA layer is a key component in Transformer language models which controls the long-range dependencies and interactions within input texts.MiniLMv2 (Wang et al., 2021) is an effective method to deeply transfer this module while allowing different number of attention heads A S h and A T h for the student and teacher.Their main idea is to distil the attention relation matrices (Q-Q, K-K and V-V) obtained by first concatenating the query (Q), key (K), and value (V) mappings from all attention heads and re-splitting them into the same number of attention relation heads A r .
Formally, let denote the concatenated and re-split queries, keys, and values for the i th student layer, where a ∈ Ar .For instance, e. original queries from A S h attention heads are simply concatenated and then resplit into A r matrices.We use the same notation for the j th teacher layer, Ar .Then, the loss function of MiniLMv2 can be defined as follows: Here, R T α,j,a , R S α,i,a ∈ R |x|×|x| denote the attention relation matrices which are computed based on the matrix products of A T α,i,a , A S α,i,a in eq. ( 8), (9), respectively.Intuitively, this aims to transfer the teacher's queries (Q), keys (K) and values (V) in a somewhat indirect way through their matrix products (Q-Q, K-K and V-V).
However, there is minimal justification for why this method works effectively.It is also difficult to directly compare the method against HS transfer since the losses are computed differently.To better understand MiniLMv2, we propose its novel variant named DirectMiniLM for our analysis.
DirectMiniLM In DirectMiniLM, we aim to transfer the teacher's Q/K/V mappings more directly through the linear transformation of the student's ones, just as we did in HS transfer.Specifically, we use the following loss function with the linear transformation W α,a ∈ R d S r ×d T r : DirectMiniLM is important in two aspects.First, this approach is directly comparable to HS transfer based on eq. ( 6) with the only difference in which information you transfer: the hidden states From this comparison, we can quantify the precise advantage of transferring each knowledge in an apples-to-apples manner.
Second, DirectMiniLM is also closely relevant to MiniLMv2: if we constrain W α,a to be orthogonal (i.e.W α,a W T α,a = I) and take the matrix product for each term within the MSE loss in eq. ( 10), we obtain the following loss function: (11) This loss closely resembles MiniLMv2 from eq. ( 7) with a minor difference of using MSE loss instead of CE loss with softmax.Therefore, DirectMiniLM with certain constraints naturally corresponds to MiniLMv2.The major difference is in whether A T α,i,a is transferred directly (with linear mappings) or indirectly (with relation matrices): by comparing these two approaches, we can precisely quantify the advantage of each optimization technique.

Experimental Setup
We explore the task-agnostic knowledge distillation methods under two settings:4 1. Monolingual Distillation: We train English students using the open-source BERT (Devlin et al., 2019) as the teacher.These models are distilled on the same corpus used for pretraining BERT, i.e., English Wikipedia (Devlin et al., 2019) and BookCorpus (Zhu et al., 2015).2. Multilingual Distillation: We train multilingual students using our in-house XLM-RoBERTa (Conneau et al., 2020) as the teacher, and distill on the CC100 dataset (Conneau et al., 2020), which consists of data in more than 100 languages.We only use a small subset of the corpus to conduct our experiments within a reasonable computation budget while maintaining the language-wise distribution.In both settings, we use the Base (12 layer) architecture for the teacher, as shown in Table 1.For more details on each distillation setup (e.g.hyperparameters), please refer to Appendix B.
Student Models To conduct a strong comparison of the representative knowledge distillation methods, we train 4 students of varying architectures and latency/parameter budgets.A summary of the student architectures, with their parameters and latency of inference, are shown in Table 1.
Our largest student is a 6 layer model that follows the same architecture as DistilBERT (Sanh et al., 2019).We also use the 6 layer model used in Mukherjee et al. (2021), which has a smaller hidden size than the teacher.Our smaller 4 and 3 layer students were obtained as recommendations from a Neural Architecture Search process (Trivedi et al., 2023) to find good student architectures for distillation from the XLM-RoBERTa teacher, conditioned to minimize the latency on CPU.Please refer to Appendix C for more details.

Layer Mapping Strategies
The layer mapping strategy φ is a parameter that needs to be considered for both HS and MHA transfer.For HS transfer, we explore the following three settings: 1. Single Mapping: We only distil the last (L T th ) teacher layer into the last student layer, which has been shown to be a simple yet competitive baseline (Ko et al., 2023).2. 1-to-1 Mapping: Prior work shows that mapping not only the last layer but also the intermediate layers improves distillation (Sun et al., 2019).In 1-to-1 mapping, we distil one teacher layer into each student layer by choosing: • Last L S teacher layers, i.e.
). Empirically, last teacher layers capture more high-level (e.g.semantic) knowledge in their representations (Tenney et al., 2019;Jawahar et al., 2019).• A Uniform selection of teacher layers which chooses every k th teacher layer, i.e. φ(i) = {ki}, where k = L T /L S .5This method can also transfer the lower teacher layers, which empirically captures local (e.g.syntactic) knowledge (Tenney et al., 2019).3. 1-to-N Mapping: Some works even show that mapping each student layer to multiple teacher layers can avoid the loss of information and facilitate student learning (Wu et al., 2020;Passban et al., 2021).For 1-to-N Mapping, we ex- plore the following choices of teacher layers: where k = L T /L S .This avoids the loss of information since all teacher layers are mapped to at least one student layer.• Combining the Uniform and Last strategies from the 1-to-1 mapping (Uniform+Last).This selects 2 teacher layers per student layer based on each 1-to-1 strategy, expecting to take the best out of both approaches.
For MHA transfer, we always take the single mapping strategy and distill a single teacher layer into the last student layer, following Wang et al. (2021).Specifically, we experiment with the last three teacher layers as a choice for distillation for both MiniLMv2 and DirectMiniLM.Table 2 summarizes our layer selection options.
While OD transfer can be conducted from scratch, we found this converges slowly and does not perform competitively. 6Therefore, we take the style of multi-stage distillation (Mukherjee et al., 2021) and conduct OD transfer after HS transfer, using the distilled checkpoint from HS transfer.This approach converges much faster with better final performance, hence we take this approach as the representative OD transfer method. 6Our 6L monolingual student takes 49 hours on 30 V100 GPUs to reach acceptable performance, while the same model achieves better scores in only 10.5 hours when initialized from the HS transferred checkpoint.

Evaluation and Results
For both our monolingual and multilingual models, we measure performance on the English GLUE Benchmark (Wang et al., 2019) and report the average score of all tasks (without CoLA7 ).For multilingual models, we provide evaluations on the XNLI dataset (Conneau et al., 2018), a set of inference tasks which evaluates the model's performance on 15 languages after being finetuned on only English training data.We report the average score of all languages for XNLI.
Table 3 summarizes the performance of each distillation method on 4 student architectures.For detailed evaluations of each method based on the best configuration, please refer to Appendix D. We also provide a comparison against DistilBERT (Sanh et al., 2019), a representative architectureconstrained method, in Appendix E.

HS Transfer
From Table 3, we can verify that the performance of HS transfer varies with different layer mapping strategies, and no strategy dominates the others in all settings.In the monolingual setting, we found that the single mapping strategy performs competitively, which is in line with the findings of Ko et al. (2023).However, in the multilingual setting, more sophisticated 1-to-N strategies generally show superiority over the simpler baselines.This indicates that more supervision from the teacher can be helpful (and at worst harmless), hence we advocate for the adoption 1-to-N strategies, esp. in the challenging multilingual distillation.
OD Transfer As mentioned in §4, we initialize the model from the HS transferred checkpoints with each layer mapping strategy.Interestingly, we see a slight degradation in performance on downstream tasks compared to only HS transfer, with a signifi- cant loss observed for smaller students.This indicates that learning effective representations from the output distribution signals is difficult, especially for students with lower capacity.Moreover, given how computationally expensive OD transfer can be, HS transfer is a cheaper and more effective alternative for knowledge transfer.

MHA Transfer
For both MiniLMv2 and Direct-MiniLM, we found distilling the upper-middle teacher layer, i.e. (L T −1) th or (L T −2) th strategy, led to the best performance, in line with the original findings of Wang et al. (2021).Importantly, we found that both MHA transfer methods generally outperform HS transfer, which points to the benefit of transferring the Q/K/V knowledge over the hidden state knowledge.This is consistent with the latest comparative study by Wang et al. (2023), although they only evaluate on the 6L-DistilBERT architecture in the monolingual setting.We also note that MiniLMv2 and DirectMiniLM perform equivalently, with the notable exception on XNLI.We attribute this to two factors: 1. MiniLMv2 transfers relational representations conditioned on the whole input, while Direct-MiniLM transfers absolute position-wise representations.The former may be more semantically informative, as the contextual representations often exhibit rich relational structures (Park et al., 2021;Liu et al., 2022a).2. DirectMiniLM requires learning the linear transformation weight W α,a , while MiniLMv2 does not incur any additional parameters.
From these observations, we generally expect MiniLMv2 to be the best distillation method and have adopted it in our latency-critical applications.8However, DirectMiniLM performs comparably and provides meaningful insights on the benefit of each optimization technique, which can be useful for debugging and analyzing MiniLMv2.Therefore, we recommend its comparison for both reseachers and practitioners in future studies.

Conclusion
This study critically analyzes the representative methods for task-agnostic distillation of language models.Specifically, we compare Output Distribution (OD), Hidden State (HS), and Multi-Head Attention (MHA) transfer for different student architectures, language settings, and layer mapping strategies.Through our extensive experiments, we show that MHA transfer based on MiniLMv2 is the best option across many settings, followed by HS transfer with sophisticated 1-to-N mapping strategies.Meanwhile, we did not find OD transfer to be an effective alternative.Finally, we propose Di-rectMiniLM to demistify the precise advantage of the indirect (i.e.relation matrix based) optimization technique proposed in MiniLMv2.Overall, we hope this study will be a useful guide for both researchers and practitioners working in this area.

A Related Work
MobileBERT (Sun et al., 2020) is an effective technique to compress BERT into a specially designed student with a bottleneck architecture.In BERTof-Theseus (Xu et al., 2020), the modules of the teacher are progressively replaced with smaller ones to improve efficiency.However, these approaches constrain the architecture of the students.
In contrast, we focus on the architecture-agnostic distillation methods for better flexibility.
Improvements on distillation objectives are also made, e.g.transferring the relational, structural or holistic representations of the language models may provide more useful signals for students (Park et al., 2021;Liu et al., 2022a;Tan et al., 2023).When the transfer set is limited, various methods of data augmentation (Liang et al., 2021;Zhang et al., 2022;Liu et al., 2022b) can be applied successfully.To alleviate the capacity gap between the teacher and student, previous works proposed scheduled annealing in OD transfer (Jafari et al., 2021), multi-stage distillation with intermediatesized teacher assistants (Mirzadeh et al., 2020;Son et al., 2021), and meta-learning to optimize the teacher for student distillation (Zhou et al., 2022;Ma et al., 2022).We leave the exploration of such advanced techniques as future work.
Layer mapping strategies for HS transfer have also been studied extensively.Jiao et al. (2021) proposed an evolutionary search process to obtain the optimal layer mapping for specific downstream tasks.Li et al. (2020) applied Earth Mover's Distance to prioritize mappings with smaller cost (i.e.distillation loss).The attention mechanism can also be applied to map student layers to similar teacher layers, where the similarity is computed based on the cosine similarity (Passban et al., 2021) or the predictions of internal classifiers (Wu et al., 2021).Finally, random mapping has been shown to work surprisingly well, potentially working as a regularizer to prevent overfitting (Haidar et al., 2022).
In this study, we focus instead on the carefully designed and easily applicable heuristic strategies.

B Distillation Setup
We train our monolingual students on the entire Wikipedia and BookCorpus using the AdamW Optimizer (Loshchilov and Hutter, 2019) with β 1 = 0.9, β 2 = 0.98.For HS and MHA transfer, students are trained for 7 epoch with a peak learning rate (LR) of 5e − 4. For OD transfer, we train for 3 epochs with a peak LR of 3e − 4 after HS transfer.We use a linear LR warmup over the first 5% of the training steps and then a linear decay.We use a batch size of 32 with the maximum sequence length set to 256 and train on 30 V100 GPUs.
For multilingual distillation, we use a small subset of CC-100 containing 7M sentences, which we found to be sufficient for developing competitive students.We generally use the same setup as monolingual distillation, except we use the peak LR of 8e − 4 for MHA transfer.Multilingual students are trained on 2 A100-80GB GPUs.
Finally, the method-specific hyperparameters ( §3) are as follows.For OD transfer, we set the output temperature T to the default value of 1.For MiniLMv2, we use A r > A h to transfer more fine-grained knowledge in the Q/K/V mappings: specifically, we set A r = 48, which is also used in Wang et al. (2021).For DirectMiniLM, we found using A r = A h without the orthogonal constraints on W α,a led to the best performance and used this setting throughout our experiments.

C Finding Smaller Student Models
Our smallest students, a 4 layer and a 3 layer model, were obtained as recommendations from a Neural Architecture Search process to find good student architectures for task-agnostic distillation from an XLM-RoBERTa teacher, conditioned to minimize the latency of inference on a CPU.Specifically, we follow the KD-NAS method of Trivedi et al. (2023) and modify the reward to reduce the distillation loss L HS defined in Eq. ( 6), along with the CPU latency of the student (lat(S)) normalized by the teacher's latency (lat(T )):

D Evaluation Results for Best Models
We include detailed results of each distillation method for the best configuration (i.e.layer mapping strategy).Specifically, we show the results of each GLUE task for monolingual and multilingual distillation in Table 5 and 6.We show languagewise performance on XNLI in Table 7.All downstream tasks are evaluated on 3 random seeds.
For the sake of efficient evaluation, we did not conduct expensive grid search for finetuning hyperparameters.After some manual tuning, we used the same LR of 2e − 5 and batch size of 32 for finetuning all models on all tasks.We used 3 epochs of finetuning for GLUE tasks (except CoLA, where we used 6 and 10 epochs for monolingual and multilingual models) and 5 epochs for XNLI.

E Architecture Constrained Distillation: DistilBERT
DistilBERT (Sanh et al., 2019) is one of the earliest and most widely used baseline.This method comprises (1) layer initialization from the teacher layers, (2) HS transfer based on cosine similarity loss, and (3) OD transfer.The first two techniques restrict the architecture of each student layer to be identical to the teacher model, which limits our analysis to the 6L-DistilBERT student architecture.

Feed
Figure 1: A high-level illustration of (a) the Transformer architecture and (b-d) representative distillation methods.(b-d) denote Output Distribution (OD), Hidden State (HS), and Multi-Head Attention (MHA) transfer, respectively.Lines between the student and teacher depict which level of information is transferred in each method.
their original paper for more details.

Table 1 :
Model Architectures displayed as [L, A h , d h , d f ].All parameters are in millions, with the difference in the monolingual and multilingual parameters due to the vocabulary sizes (30K for monolingual and 252K for multilingual).All latencies are in milliseconds, measured over 5 runs, with standard deviation in parenthesis.T th , (L T −1) th , (L T −2) th

Table 2 :
Layer mapping strategies explored in each distillation method.The same strategies are explored for MiniLMv2 and DirectMiniLM in MHA Transfer.

Table 3 :
Performance of the representative distillation methods evaluated on avg.GLUE and XNLI.Results based on the best layer mapping strategy for each method is underlined, and the best overall result is shown in bold.

Table 4 :
DistilBERT Performance.Average GLUE scores reported for all tasks w/o CoLA.Average XNLI scores reported for all languages.Average taken over 3 random seeds with standard deviation in parenthesis.