Knowledge Distillation Transfer Sets and their Impact on Downstream NLU Tasks

Teacher-student knowledge distillation is a popular technique for compressing today's prevailing large language models into manageable sizes that fit low-latency downstream applications. Both the teacher and the choice of transfer set used for distillation are crucial ingredients in creating a high quality student. Yet, the generic corpora used to pretrain the teacher and the corpora associated with the downstream target domain are often significantly different, which raises a natural question: should the student be distilled over the generic corpora, so as to learn from high-quality teacher predictions, or over the downstream task corpora to align with finetuning? Our study investigates this trade-off using Domain Classification (DC) and Intent Classification/Named Entity Recognition (ICNER) as downstream tasks. We distill several multilingual students from a larger multilingual LM with varying proportions of generic and task-specific datasets, and report their performance after finetuning on DC and ICNER. We observe significant improvements across tasks and test sets when only task-specific corpora is used. We also report on how the impact of adding task-specific data to the transfer set correlates with the similarity between generic and task-specific data. Our results clearly indicate that, while distillation from a generic LM benefits downstream tasks, students learn better using target domain data even if it comes at the price of noisier teacher predictions. In other words, target domain data still trumps teacher knowledge.


Introduction
In the recent past, large language models (LMs; BERT-Large, Devlin et al., 2019;GPT-2, Radford et al., 2019;T5, Raffel et al., 2020) pretrained in a self-supervised manner on massive web corpora have consistently shown state-of-the-art per-formance for multiple natural language understanding (NLU) tasks.Therefore, it is no surprise that these models are of much interest for virtual assistants such as Amazon Alexa, Apple Siri, and Google Assistant.Some studies have shown that these large models trained on generic corpora seem to be more robust to data distributional shifts, relying less on domain-specific training data to perform well (Brown et al., 2020).
Since large models cannot be directly used for low-latency applications on devices with limited computing capacity, many techniques have been developed to compress them in size.Knowledge distillation (referred to simply as distillation hereafter; Hinton et al., 2015), has shown promising results, especially at the high compression rates typically required in NLU (Jiao et al., 2020, Soltan et al., 2021).In this paradigm, lightweight models referred to as students, are trained to mimic the teacher predictions over a transfer set (Hinton et al., 2015).When the pretraining and task-specific corpora have significantly different distributions, as is often the case, the choice of data for the transfer set can be ambiguous.On the one hand, using pretraining corpora in the transfer set ensures high quality teacher predictions that are important for effective distillation.On the other, using the downstream corpora, although it might cause noisier teacher predictions, ensures the adaptation of the student to its final use case.
To investigate this trade-off, we present a set of experiments where we distill several multilingual students from a large multilingual teacher LM trained using a masked language modeling (MLM) objective.We perform the distillations using transfer sets that comprise of generic and task-specific data in varying proportions.The students are then finetuned and evaluated on two downstream NLU tasks of interest: a Domain Classification (DC) task and a joint Intent Classification/Named Entity Recognition (ICNER) task.For each input utter-ance DC predicts the relevant domain (Books, Music, Shopping, etc.), IC identifies the user's intent (find a book, play a song, buy an item, etc.) and NER extracts the entities in the utterance (dates, names, locations, etc.).
Our contributions: (1) We confirm for our setup that model preparation via distillation from a larger LM is more beneficial for downstream task performance when compared to encoder training from scratch.(2) We show that the largest improvements are seen when using only the downstream task's unlabelled data during the distillation process.Even though teacher predictions are expected to be noisy over data that is different from pretraining corpora, our results clearly indicate that students learn best in this setting.(3) Because our ICNER corpora is divided per domain, we are also able to provide a finer-grained analysis of the impact of corpora similarity on downstream results.(4) Finally, we also confirm that further adaptation of the teacher to the target-domain data, results in improved student performance across tasks.

Relevant Work
Building models with inference speeds that are suitable for production systems is of utmost importance in the industrial setting.Therefore techniques for model compression (quantization Gong et al., 2014;pruning redundant connections Han et al., 2015) have been active research topics, with distillation (Romero et al., 2015, Hinton et al., 2015, Jiao et al., 2020) showing much promise for NLU models (Sanh et al., 2019).Distillation processes and their data have evolved over the past few years.In the teacher-student framework proposed by Hinton et al. (2015), they recommend using the original pretraining set as the transfer set.Jiao et al. (2020) proposes a more complex two-stage process with generic and task-specific distillation phases, each with their own data sets, designed to augment the performance of the final model towards the task at hand.
Our work is focused on exploring how varying proportions of generic and task-specific data within the transfer set of a single distillation process impacts downstream NLU performance.Since our scope does not include optimizing the distillation process itself, we use a cheaper alternative to Jiao et al. (2020), via a single-stage distillation setup to conduct our exploration (see Section A.3 for details).Gururangan et al. (2020) showed for the pretraining phase, that continued domain-adaptive and taskadaptive pretraining using the downstream task's unlabeled data can improve performance.Our work presents similar results for the distillation phase.

Distillation data
For distillation, we created the transfer sets by mixing two types of data with different distributions: • Generic data: This data set consisted of Wikipedia and Common Crawl processed by an in-house tokenizer.
• Task-specific data: This in-house data set comprised of de-identified utterances from a voice assistant across domains of interest.The text data collected here was the output of an Automatic Speech Recognition (ASR) model, which assigned a confidence score per utterance.In order to retain only the highest quality data, we filtered it by an ASR score threshold.
The data was de-identified, prior to use.
Our distilled students were trained as part of a larger program resulting in a collection of nine European and Indic languages being used for distillation.The language list and counts are shown in Table A1.
We built transfer sets that had three ratios of generic to task-specific data: (1) generic-only (baseline) (2) 7:3 generic to task-specific, to mimic the commonly encountered low task-specific data setting and (3) task-specific-only.To have a comparable distribution of data from each language, we created samples of equal size for each language using either generic only, task-specific only or combining both the generic and the task-specific data based on the targeted ratio.Upsampling is used when a source data set contains a number less than the required number.The 7:3 ratio consisted of Wikipedia, Common Crawl and task-specific data upsampled to counts of 35M, 35M and 30M respectively, for each language.For two languages Indian-English and Marathi, where some data constituents were unobtainable, available data was used in proportion (see Table A1).Once the data sets were created with the targeted mixing ratio, they were split into train and validation sets with a ratio of 0.995:0.005and then used in the transfer sets.

Data for downstream tasks
We evaluated our multilingual distilled students in the context of two commonly utilized NLU tasks of interest, DC and ICNER.We limit the scope of our evaluation to just four languages German, French, Italian and Spanish.Our finetuning data consisted of 26 domains (see fractional utterance counts in Table A4) across each language, with each domain comprising a set of intents (similar to Su et al. 2018).As with the task-specific data used in our transfer sets, this data has also been de-identified prior to use.
It is important to note that, although collected over non-overlapping time intervals (and thus consisting of different absolute counts), the finetuning data was from the same distribution as the taskspecific data described in Section 3.1.We sampled the finetuning data so as to have equal counts across each domain in all four languages (see Appendix A.1 for the evaluation data set sampling strategy).We then combined all languages and split the data into proportions of 80:10:10 for train, validation and test, respectively.
For the DC task, we classified the input utterances into one of the 26 domains.Therefore, the DC model is trained using the combined training data from the four languages across all domains and is tested on language-specific test data sets.For the joint ICNER task, we classified each utterance within a domain to its corresponding intent and also recognized its named-entities.For this task, we trained a model per domain, using the combined training data from the four languages for that domain.The model was evaluated using languagespecific test data sets for that domain.We present results on two types of test sets.test comprises of the full test set obtained from the split above while tail_test is the subset of data points within test that have a frequency of occurrence less than or equal to 3. The relative data proportions used can be found in the Appendix (Table A4).

Models
Figure 1 shows a schematic of the models and experimental setup described in this section.

Distilled students and baselines
We use a 170 million parameter teacher (170Mteacher) that was prepared using Wikipedia, Common Crawl and mC4 (Xue et al., 2021) data.See Appendix A.2 for details on teacher prepa-ration.From this teacher, we distilled a total of five students.We use our three transfer sets described in Section 3.1, i.e. (1) generic-only (2) 7:3 (generic:task-specific) and (3) task-specific-only, to distill the first three students.We refer to the student distilled using (1) as the generic-distilled baseline.The latter two are referred to as experiment-7:3 and experiment-task-specific-only; the naming aligned with the transfer set used.In addition to these, we create another two students where the teacher was finetuned using an MLM task before being used for distillation.In each case, the teacher was finetuned for 15625 steps using the same transfer set that was used for the subsequent distillation.We refer to these two students as experiment-7:3-FT and experiment-task-specific-only-FT.The teacher finetuning was run on a p3.16X instance with an average run time of approximately 45 hours.We collectively refer to all distilled students that are not a baseline as experimental students.
The architectures of our teacher and students are as follows.As in the paper by Devlin et al. (2019), we denote the number of layers (i.e., Transformer blocks) as L, the hidden size as H, and the number of self-attention heads as A.
• 170M-teacher: L=16, H=1024, A=16, feed-forward/filter size=3072, total parame-ters=170M • Students: L=4, H=768, A=16, feedforward/filter size=1200, total parame-ters=17M For a description of the distillation setup, see Appendix A.3.Distillation was run for 1 epoch with each student extracted at 78125 steps, which equates to approximately 80M data points seen.We ran distillation on a single p3.16X instance utilizing 8 GPUs with batch-size of 2 and gradient accumulation at every 64 steps.The average run time was approximately 195 hours.Note that each distillation run used only a sample of the full data set mentioned in Section 3.1, determined by the step count.However, since the data is sampled uniformly, the language ratios and the generic:taskspecific data ratio stays consistent during training.
In addition to the distilled baseline, we also created another baseline (without distillation) that was directly pretrained using the generic-only data.The architecture and size of this baseline was identical to that of the distilled students and it is referred to, here onward, as the directly-pretrained baseline.We used this baseline to observe performance differences between models that use students distilled from the large teacher and those that use a directly pretrained encoder.

DC and ICNER models
In order to evaluate the impact of the different transfer sets on our targeted downstream NLU tasks, we finetune the experimental students and baselines toward DC and ICNER tasks.Each DC model consisted of an encoder, embedding and positional embedding obtained from an experimental student or baseline combined with a decoder consisting of an MLP classifier for domain prediction with layer size 128, dropout set at 0.1 and ReLU activation.Each ICNER model consisted of the same encoder, embedding and positional embeddings used for the corresponding DC model with an MLP classifier output layer for the IC task with layer size 128, dropout set at 0.1 and ReLU activation and a CRF sequence-labeler output layer for the NER task with layer size 256, dropout set at 0.1 and GeLU activation.We trained each DC model for 1 epoch and each ICNER model for 4 epochs.
Evaluation: The DC performance was evaluated using the F1 score while the ICNER performance was evaluated using the Semantic Error Rate (SemER;Su et al., 2018, Varada et al., 2020, Peris et al., 2020).The definition of SemER is where D (deletion), I (insertion), S (substitution), C (correct slots).The Intent was treated as a slot in this metric, and the Intent error was considered as a substitution.

Experimental results
In this section, note that model refers a model that uses an experimental student or baseline encoder and has been finetuned towards a DC or ICNER task.Experimental models comprise of experimental student encoders and baseline models comprise of baseline encoders (see lower panel in Figure 1).
We used data across 26 domains to train and evaluate the DC and ICNER models (see Section 3.2).We compare the performance of each experimental model against the two baseline models (see Section 4.1).The improvements we quote in this section are ∆F1 (↑) (higher is better) and ∆SemER (↓) (lower is better; we use the weighted average of SemER across all domains) for DC and ICNER respectively, measured against the baseline models (Tables 1, 2, A2, A3).
The results in Tables 1 and 2 show that in general for both DC and ICNER tasks, all experimental students distilled with a mix of task-specific data  1: Relative DC ∆F1 (↑), measured against the generic distilled baseline for each experimental student (positive is better).We run three iterations of each experimental student and show the percentage change of their means and its standard deviation.Table 2: Relative ICNER ∆SemER (↓), measured against the generic distilled baseline for each experimental student (negative is better).As with DC, we run three iterations of the experimental students and show the percentage change of their means and its standard deviation.In calculating these percentage changes, we use the weighted average of the SemER for each domain in a given language, as the overall SemER in that language.
(30% or 100%) perform significantly better than the generic distilled baseline.We further observe that models with encoders distilled with task-specificonly data yields the best overall performance which means that, in our setup, students learn better using target-domain data even if it comes at the price of noisier teacher predictions.
For all four languages across DC and for three out of four languages across ICNER, the best performances are observed with student models that were distilled from the finetuned teacher.This confirms that the additional step of finetuning the teacher and adapting it to the task-specific dataset, results in students that perform better on the intended downstream tasks.
We also note that across all task, language and test set combinations, the improvements seen against the directly pretrained baseline (see Tables A2 and A3) are larger than the improvements seen against the generic distilled baseline.For our setup, this shows that distilling from a large LM can benefit downstream tasks as opposed to using a similar-sized encoder pretrained from scratch; in other words our findings suggest that it is better to distill than to directly pretrain.However, we note that additional resources (in our case approximately 45 p3.16X hours) are required for this.
The tail_test, comprising of low frequency utterances within test, provides insights on the ability of the model to generalize to rarely seen utterances.For DC, we note that the improvements on tail_test are significantly larger (∼2X) than the improvements seen on the test set.This indicates that prediction on examples that appear infrequently in the task-specific data benefits more, from task-specific data being included in the distillation process.

Dataset similarity and its correlation to SemER improvements for ICNER
To further explore our conclusion that students learn better using target-domain data we explore how ∆SemER for each domain, correlate to the similarity of the domain's data to the generic data.Note that, here, negative ∆SemER represents improvements of the experimental students against the generic distilled baseline while the opposite is true for positive ∆SemER.SemER results are from the test set.
The hypothesis here is that the more distant a domain is from the generic data, the more value we should see in adding this domain's data to the distil- lation transfer set, even though teacher predictions might be noisy.We note here that we calculate cosine similarity on a very rudimentary corpus-level embedding (i.e.tf-idf) for measuring similarity, as explained below.We leave more sophisticated similarity measurements for later work.
To calculate similarity between domain-level and generic data, we use the following process.For each domain in each of the four languages, we sample up to 100K utterances.All available data is considered for domains with <100K utterances.We then sample 50K utterances each, from the Wikipedia and Common Crawl data sets of the corresponding language.We create a tf-idf vector for each sampled dataset and calculate their cosine similarity as a measure of dataset similarity.In order to account for any variability associated with the sampling, we repeat the process 3 times and obtain the mean similarity and the standard deviation per domain.We plot dataset similarity against ∆SemER (a single point represents one domain and a panel represents a language as seen in Figure 2).We neglect domains with lower data and thus high variability and fit a line to show how ∆SemER correlates to dataset similarity.
In Figure 2, we observe that a majority of cases (all except German) show a positive correlation.A positive correlation shows that domains that are less similar to Wikipedia/CommonCrawl have relatively larger improvement in SemER, when compared to domains that are more similar to Wikipedia/CommonCrawl.This suggests that the addition of task-specific data in the distillation transfer sets helps domains that are less similar to the generic data available for distillation, even though teacher predictions on them will be more noisy.
It should be noted that the domains of the one exception, German, display low similarity values across the board unlike the other languages which show a wider spread (German has 65% of domains < 0.2 whereas French, Italian and Spanish has 23%, 31% and 12% < 0.2 respectively).The lack of domains with high similarity might explain the failure for a stable correlation to be observed in German.

Conclusions
We have explored how the use of transfer sets that comprise different ratios of generic to task-specific data, impacts downstream results.Encoders distilled from a large teacher perform better than ones trained from scratch, showing that it is better to distill than to directly pretrain, when possible.The largest benefits are shown when using the downstream task's unlabelled data to distill, a student despite noisier teacher predictions.We also find that domains with data that are dissimilar to the generic data show greater performance improvements against a generic baseline when using a stu-dent distilled using task-specific data.These improvements further confirm that distilling using target-domain data can be helpful for downstream performance.Finally, we show that if costs permit, teacher-adaptation to the target-domain data via finetuning can result in improved student performance across downstream tasks.

A Appendix
A.1 Data for finetuning DC and ICNER models For finetuning our distilled students for DC and IC-NER, we use labelled datasets from four languages (German, French, Italian and Spanish) each consisting the same 26 domains (Table A4) and each domain supporting a set of intents (similar to Su et al. 2018).In order to have equivalent utterance counts across domains for each language, we used a stratified sampling strategy as follows.First, we ranked each language per domain based on its utterance counts.In order to prevent heavy upsampling or downsampling in any single language when creating equivalently sampled domains, we picked the language that had the second highest utterance counts in most domains (in our case French).We sampled utterances from the domains of other languages to match the domain-level utterance frequency distribution of French (i.e.random sample utterances with replacement, from each domain in each language until that number matches the utterance count of the respective domain in French).
We then combined all languages and split the data into proportions of 80:10:10 for train, validation and test, respectively.

A.2 Teacher model
The 170M-teacher used in this work was, itself, a student that was distilled from a larger model with 2 billion parameters (see Stage 1 pretraining section in FitzGerald et al. ( 2022) for details on creation and architecture).The 170M-teacher was distilled using a transfer set that comprised Wikipedia, Common Crawl and mC4 (Xue et al., 2021) data.Picking this intermediate-sized model helped us avoid potential performance degradation due to having too large a size gap between teacher and student (Mirzadeh et al., 2019).

A.3 Student setup
For our single-stage distillation setup, we skip the generic distillation phase done by Jiao et al. (2020) and use a non-finetuned teacher model to directly distill our students.In addition, as a sanity check, we also explore distillation from a finetuned teacher model to verify improved student performance across tasks.Similar to the hidden states based distillation followed in TinyBERT (Jiao et al., 2020), we mapped the student layers [0, 1, 2, 3] to learn from the teachers hidden layers [3,7,11,15], respectively.We ignored attention based distillation (Jiao et al., 2020) since we did not observe significant improvements by using it.We also penalized the soft cross-entropy loss between the student network's logits against the teacher's logits, to fit the students predictions to those of the teacher as in Hinton et al. (2015).We use a MLM objective for the distillation process.In our loss, we weight the hidden layer matching, the logit matching and the MLM at a 1:2:1 ratio.A3: Relative ICNER ∆SemER (↓) measured against the directly pretrained baseline for each experimental student (negative is better).In calculating these percentage changes, we use the weighted average of the SemER for each domain in a given language, as the overall SemER in that language.
Table A4: Fractions of finetuning data used per domain.Note that the faction in each cell represents the utterance count for that partition, for that domain, as a fraction of the total utterance count in that language.As mentioned in Section 3.2, these fractions are not based on the counts in

Figure 1 :
Figure 1: A schematic of the models that we present in this paper and how they are evaluated.

Figure 2 :
Figure 2: Cosine similarity of tf-idf vectors vs. change in SemER for each domain for languages German, French, Italian and Spanish.We represent only domains with >1000 test utterances to avoid noise added by smaller domains which have higher variability.

Table A1 :
Raw data counts used for transfer set creation.Counts represent the number of sentences for generic data and the number of de-identified utterances for task-specific data.For task-specific data for Telugu and Tamil, machine-translated (MT) data from Indian English was used.Only task-specific data was used for Indian English because Wikipedia and Common Crawl data were not available.Only generic data was used for Marathi as the translation system used for this work did not support the language as yet.

Table A2 :
Relative DC ∆F1 (↑) measured against the directly pretrained baseline for each experimental student (positive is better)