Towards Zero-Shot Knowledge Distillation for Natural Language Processing

Knowledge distillation (KD) is a common knowledge transfer algorithm used for model compression across a variety of deep learning based natural language processing (NLP) solutions. In its regular manifestations, KD requires access to the teacher’s training data for knowledge transfer to the student network. However, privacy concerns, data regulations and proprietary reasons may prevent access to such data. We present, to the best of our knowledge, the first work on Zero-shot Knowledge Distillation for NLP, where the student learns from the much larger teacher without any task specific data. Our solution combines out-of-domain data and adversarial training to learn the teacher’s output distribution. We investigate six tasks from the GLUE benchmark and demonstrate that we can achieve between 75% and 92% of the teacher’s classification score (accuracy or F1) while compressing the model 30 times.


Introduction
Deep learning based natural learning processing (NLP) systems have become state-of-the-art on many applications such as machine translation (MT) (Vaswani et al., 2017;Lioutas and Guo, 2020), natural language understanding (NLU) (Devlin et al., 2019) and language generation (Brown et al., 2020) among others. These models are increasingly trained on huge corpora and with billions of trainable parameters (Brown et al., 2020). This is prohibitive for deploying them on edge devices as well as maintaining them on servers. Moreover, training and evaluating them leaves a significant environmental footprint (Strubell et al., 2019) wherein avoiding the resource hungry training is very challenging (Ghaddar and Langlais, 2019) and may be unavoidable . Model com- * Work done during an internship at Huawei Noah's Ark Lab.
pression approaches make it feasible to employ current state of the art models on edge devices.
KD is one of the most commonly used, application and model agnostic, compression and ensembling algorithm. It is one of the most widely researched algorithms for compressing transformer based language models (Rogers et al., 2020;Jafari et al., 2021;Wu et al., 2020;Kamalloo et al., 2021). In KD, the student needs to be trained with the teacher's training data so as to prevent loss of accuracy. However, we can not assume this access for many practical problems. Some of the concerns preventing access include data privacy, intellectual property, size and transience (Micaelli and Storkey, 2019). e.g. a model trained on patient health records might be available but the data itself may be inaccessible due to patient privacy.
In computer vision (CV), Zero-shot KD (ZSKD) has been proposed to train a student without using any data. In this context Zero-shot refers to training without using data instead of no training at all. Nayak et al. (2019) propose generating "data impressions" by updating noise using backpropogation until it generates 'valid' teacher logits and then training the student on these data impressions. Chen et al. (2019) use a generator to produce synthetic images and use the teacher as discriminator, observing that for real images the softmax function of the teacher encourages a unimodal distribution. Micaelli and Storkey (2019) use a generator to produce synthetic training samples employing adver-sarial training to improve the quality. Yoo et al. (2019) generate synthetic data by conditioning a generator on output samples from the teacher and a low dimensional representation of the generated samples. These works assume that there is no data available whatsoever for training the student but they do not transfer to text due to the discrete input space (Krishna et al., 2019). We relax this condition and argue that we can still achieve the goals of ZSKD if we use easy to access out-of-domain (OOD), task agnostic data to aid the process. Krishna et al. (2019) put forth a similar argument, albeit for the problem of model extraction, where they use simple heuristic rules to generate training data for a student, of similar or larger size to the teacher, in order to learn the teacher's output distribution. However, they do not put constraints on the size of the student and even propose a student larger than the teacher.
We study the problem of ZSKD for NLP and adapt an OOD dataset similar to (Krishna et al., 2019). In addition we train a text generator to generate samples which maximize the divergence between the teacher and student output while staying close to the OOD distribution. Our contributions are as follows: • We present one of the first works in NLP on model compression for NLU models using KD without the teacher's training data or any other task-specific data.
• We present a novel KD algorithm which combines OOD data gathering and adversarial training.
• Our algorithm generalizes to different classification tasks for NLP including sentiment analysis, question answering, entailment etc.
• We present an analysis of our algorithm on Natural Language Inference.
• We demonstrate that our algorithm can be competitive in the general fine-tuning setting.

Related Work
Knowledge Distillation KD (Hinton et al., 2015) is a well-known deep learning technique used to transfer the knowledge from an already trained large teacher model to a smaller student network. KD adds a new loss function to the student's regular training loss over the training labels. This new loss function aims at matching the smoothened output probabilities of the student with those of the teacher. More specifically, the training data is fed into the teacher model and the teacher logits are obtained. These are fed, typically, into a softmax function and the temperature parameter is adjusted to smoothen the resulting label distribution. The training loss function for the KD algorithm is as following: where CE is the cross-entropy, KL is the Kullback-Leibler divergence and z s and z t are the student and teacher logits respectively. σ is the softmax function, and τ and α the temperature and interpolation weights respectively are hyperparameters.

Zero-shot Knowledge Distillation
ZSKD deals with scenarios in which either no training data is available (e.g. in (Nayak et al., 2019)) or at least teacher's training data is not available (for example due to customer's privacy issues). Lopes et al. (2017) introduce a data-free knowledge distillation approach with the assumption that the teacher's network and some meta-data (i.e. the teacher activation records or statistics on the teacher's training data) are given. This work reconstructs the original training data by tweaking a noise input and trying to recover the given metadata. We are different from (Lopes et al., 2017) in the sense that our model does not need any metadata for training. Another case in point is (Nayak et al., 2019) which introduces a data-free knowledge distillation approach with no knowledge about the target data distribution. In this regard, their Zero-shot technique models the softmax output of the teacher using the Dirichlet distribution and then builds the underlying data samples (so called Data Impressions ) corresponding to the modeled distribution for the teacher. This approach is infeasible for NLP tasks due to the fact that the input data is discrete and the size of the output softmax can be really large.
One potential practical scenario for NLP can be training students without accessing teacher's training data. In this scenario, we are allowed to use any text corpus in the public domain except the Figure 1: Schematic Diagram of our Zero-shot KD solution. a) We assume access to a pre-trained teacher. b) We gather out-of-domain (OOD) data and train the generator adversarially. c) Finally we use the generated data and the OOD data for KD. data used for training the teacher network. In this case we can borrow ideas from model extraction techniques such as (Pal et al., 2019;Krishna et al., 2019;Yoo et al., 2019) to facilitate ZSKD training by querying the teacher model using unlabeled data. (Krishna et al., 2019) deal with textual input but do not consider smaller students and the KD scenario. The framework of (Pal et al., 2019) does not apply to pairwise classification tasks. (Yoo et al., 2019) designs a conditional data generator to tackle with lack of training data for the student network and focuses on image classification. However, our solution works on text and our text generator is unconditional.

Adversarial Training
Adversarial examples are small perturbations to training samples indistinguishable to humans but enough to fool neural network classifiers. Goodfellow et al. (2014) proposed adding them to the training set to make CV systems robust to adversarial attacks. Miyato et al. (2016) adapt adversarial training to text classification and improve performance on a few supervised and semi-supervised text classification tasks.
Adversarial training although proposed for model robustness (Ebrahimi et al., 2018;Ghaddar et al., 2021b,a), has been shown to improve state-of-the-art model performance Zhu et al., 2019; in NLP.  study machine translation and propose making the model robust to both source and target perturbations, generated by swapping the word embedding of a word with that of its synonym. They model small perturbations by considering word swaps which cause the smallest increase in loss gradient. Zhu et al. (2019) propose a novel adversarial training algorithm, FreeLB, to make gradient based adversarial training efficient by updating both embedding perturbations and model parameters simultaneously during the backward pass of training. They show improvements on multiple language models on the GLUE benchmark.
Micaelli and Storkey (2019) adapt adversarial training for ZSKD and train an image generator to increase the divergence between student and teacher and train the student to decrease this divergence.

Methodology
We rely on an adversarial text generator as the backbone of our method. However, we still need data to pre-train the generator. Since we assume access to a general purpose OOD data, we delineate general principles to extract a training set from this source. Finally, we apply KD on a combination of the OOD training data and the adversarial training data. Figure 1 gives a visual illustration of the proposed ZSKD method.

Out-of-Domain Training Data
Our ZSKD method assumes that we do not have the original training data on which the teacher model is trained as well as any other task specific data. Similar to (Krishna et al., 2019), we construct an out-of-domain (OOD) dataset. The idea is that using a general purpose corpus of text, we randomly sample sentences from the text. Then depending on the task we add simple heuristics to make the text suitable for the problems at hand. We summarize a list of targeted tasks all taken from the GLUE benchmark (Wang et al., 2018).
Sentiment Classification (SST-2). We do not modify the sampled sentences for this task but simply feed them to the teacher to get the sentiment output distribution, even though most sentences in the sampled text would have neutral sentiment.
Pairwise Sentence Classification The training sequence typically consists of two input sentences. Depending on the task these can be: • In Natural Language Inference (NLI), the two input sentences are the hypothesis and the premise. Depending on the task, the goal can be to determine whether the hypothesis is true (entailment), false (contradiction), or undetermined (neutral) given the premise (MNLI) or whether the hypothesis entails the premise in the form of binary classification (RTE). For these tasks, we generate the OOD data by randomly extracting a sentence from the corpus to serve as the premise and then by random chance construct the hypothesis to either be a slightly changed version of the premise or be a completely new random sentence.
• In tasks such as Quora Question Pair (QQP) and Microsoft Research Paraphrase Corpus (MRPC), the goal is to determine if the two input sentences are semantically equivalent or not. We follow a strategy similar to NLI tasks but for the QQP task we post-process the generated sentences by appending a question mark at the end.
Question NLI. The goal of this task is to determine if the given paragraph contains the answer to the input question. We sample a paragraph from our corpus and, randomly, either sample a segment from within the paragraph to form a question or sample an unrelated sentence from the corpus.
Then, we randomly append a questioning word such as Who, Where, What etc. to the start of the segment and a question mark at the end.

Generator Pre-training
Inspired by (Micaelli and Storkey, 2019) and on the promise of adversarial training for NLP (Zhu et al., 2019), the key ingredient of our proposed method is to learn a generator that generates training samples. Our adversarial generation is close to adversarial training and we consider the adversarial samples to be perturbations of the OOD training data D. Therefore we pre-train the generator to follow the distribution of D. Specifically, our generator G is a masked language model, such as BERT, which can generate text from noise such that: where x p is the output of the generator and is a sequence of tokens, φ is the set of generator parameters and z ∼ N (0, std).
The generator is pre-trained by minimizing the following loss function: where H CE is the cross-entropy loss and x k is a sample from the OOD training set D. Note that the noise z matches the length and dimension of the embedding of x k , with the classification token (CLS) added at the beginning and the separator tokens (SEP) inserted at the same locations as in x k .

Adversarial Training
Most methods in adversarial training for NLP (Zhang et al., 2020) perturb the word embeddings instead of generating text due to the discreteness problem of text. In order to generate text, we need an argmax operation which breaks end-to-end differentiability. Since our goal is KD, embedding perturbation introduces the problem of size mismatch between the student and teacher embedding. Instead we generate text and sample from the argmax by using the Gumbel-Softmax distribution (Kusner and Hernández-Lobato, 2016; Jang et al., 2016), a continuous distribution over the simplex that can approximate one-hot samples from a discrete distribution.

Adversarial Step
Once pre-trained, the generator is trained with two losses. The first loss maximises the KL-divergence between the teacher and student model on the generated data. The teacher and student model parameters are fixed. The goal is to generate training samples where the teacher and student diverge the most. However, this can lead to degenerate samples which are not useful for transferring teacher knowledge. The second loss is the same as Equation 3 and prevents the generator from diverging too much from the OOD training data. The overall loss, L T , for generator training is thus: where T is the teacher, S is the student, x k is a sample from the OOD training set, x p is the softmax output of the generator and x p , the one-hot output, is defined as: Here σ Gumbel is the Gumbel-Softmax and x l are the logits of the generator.

Knowledge Distillation
In each training loop we train the generator for n G steps and the student for n S steps. Specifically, the student is optimized using a joint KD loss between the data samples generated from the generator G and the data samples coming from the OOD dataset. Overall the student is trained on: where x k and x p are as defined above and α is a weight interpolation parameter. Note that unlike regular KD where we have a hard loss and a soft loss, here we have two soft losses. One matches the student and the teacher output on adversarially augmented data and the other on OOD data respectively. Algorithm 1 presents all the steps of our procedure.

Experiments
We evaluated our proposed adversarial ZSKD approach on six classification tasks from the General Language Understanding Evaluation (GLUE) (Wang et al., 2018) benchmark. The tasks include binary sentiment analysis on the SST-2 dataset (Socher et al., 2013), ternary NLI on MNLI (Williams et al., 2018), binary entailment on the RTE dataset (Bentivogli et al., 2009), semantic equivalence on QQP (Chen et al., 2018) and MRPC (Dolan and Brockett, 2005), and finally question answering adapted to binary classification evaluated on QNLI (Wang et al., 2018). Section 3.1 gives more details about these tasks.

Experimental Setup
All models used in this paper, except those in section 4.5, are based on two architecture settings from the BERT (Devlin et al., 2019)  Hyper-parameters We fine-tuned the BERTbased student model for 10 epochs and picked the best checkpoint that gave the lowest loss during training. We report results for all methods on the given Dev set. For each task, we selected the best fine-tuning learning rate among 5e-5, 4e-5, 3e-5, and 2e-5 values. We used the AdamW (Loshchilov and Hutter, 2017) optimizer with the default values. In addition, we used a linear decay learning rate scheduler with no warmup steps. We set the α values from our algorithm to be 0.2 and the std value to 0.01. Additionally, we set the value n G and n S (see Algorithm 1) to 10 and 100. Finally, we pre-train the generator for two epochs.
Hardware Details We trained all models using a single NVIDIA V100 GPU. The batch size was set to 64. We used mixed-precision training (Micikevicius et al., 2018) to expedite the training procedure. All experiments were run using the PyTorch 1 framework.

Results
Table 1 presents our result on SST-2. For all the tasks, we present the original large teacher score, the smaller student score when trained on the training data, the student trained with KD on the OOD data and two experiments with different training set sizes using our algorithm. Our baseline is the KD with OOD data and is adapted from (Krishna et al., 2019). They presented the results where the student was the same size as the teacher or larger. Moreover, they applied their method only to SST-2 and MNLI from the GlUE benchmark. We implemented their method, applied it to the smaller student setting and extended the OOD generation process for the 4 other datasets of GLUE.
The data size (x1, x2 and x4) are the OOD data sizes compared to the task specific training data size. The adversarially trained student, in additional to the OOD data, generates an equal number of adversarial examples. On SST-2, we attain close to the student accuracy using the OOD training data. Our method using x2 OOD data does just as well as the baseline but when we use all the OOD data used by the baseline we increase the accuracy by 1%.
The results of the NLI classification tasks, MNLI and RTE, are on Table 2. MNLI is one of the two hardest tasks that we evaluated on. Looking at the accuracy scores we can see that the student trained on the training data falls well short of the teacher. On this task, we can see the strength of our method as the adversarial training improves the score both when we use x2 OOD data and even further when we use x4 OOD data. High model capacity is important for MNLI. We see a similar trend for RTE.
On pairwise sentence classification, on Table 3 we see that MRPC follows a similar trend where the adversarial training algorithm improves the F1 score both when used with x2 OOD data and with x4 OOD data. The same applies for the QQP task. Similar to MNLI, the model capacity and the amount of training data appears to be important for this task.   the QNLI task and we see improvements using our algorithm both when using half the OOD data as the baseline and when using the same OOD data. On average we see an improvement of 1.4 over all the tasks. Overall, we were able to recover between 98.2% (SST-2) and 86.2% (MNLI and QQP) of the performance of a version of the student model trained with the original dataset. Similarly, we recovered between 92.3% (SST-2) and 75.1% (MNLI) of the teacher performance.

Analysis
We inspected the per-class results for MNLI to gain insight into the properties of the adversarially generated samples. Table 6 shows that adding adversarial examples continuously improves the performances on neutral and entailment classes.
Our manual inspection shows that adding the generator to the loop makes the student more robust on examples where the premise and hypothesis doesn't significantly overlap. The gain could be imputable to the diversity of the adversarial examples, although, the generator may produce a nonsensical sequence of words. We observed that the premise and hypothesis rarely share common words, contrary to heuristically populated examples 2 . Adversarial examples prevent the student from relying on the superficial syntactic properties of OOD samples.  Table 7: Results when using GPT-2 for OOD generation

Alternative for OOD Generation
We explored the use of a language model for OOD generation. Here instead of sampling OOD data from a corpus, we used distilGPT-2 (Wolf et al., 2019) a lighter version of the GPT-2 (Radford et al., 2019) model. We expected the model to perform slightly better since GPT-2 is trained on a much larger dataset. However, as seen in Table 7, we do not observe any improvement and the algorithm is much slower in comparison due to the complexity of executing a language model in the training loop.

Fine-tuning Setting
We apply our data generator to fine-tuning a 6-layer transformer model on the GLUE benchmark. In this setting we use the training data and demonstrate that our algorithm can be used for data augmentation. We use RoBERTa LARGE  as the teacher and distilRoBERTa  as the generator. The student is also initialized with the weights of a pre-trained dis-tilRoBERTa model. Our baseline is a fine-tuned distilRoBERTa model. We train the student using all the steps in Algorithm 1. The only difference is that since we have access to the labels we apply cross-entropy loss on the training data and KLdivergence on augmented data. Table 5 presents the results and we observe that we can improve the average performance on the baseline by almost 2 points. Note that the baseline is trained with just the cross-entropy loss.

Conclusion
We present the first study on Zero-shot Knowledge Distillation (ZSKD) for NLP. We present an algorithm based on OOD data generation and ad-versarial learning and evaluate on six tasks from the GLUE benchmark reaching to within 75% of the teacher performance on all tasks while attaining a 30x compression. The next steps are to a) explore a generic methodology for OOD data creation and b) study sequence generation tasks such as machine translation and abstractive summarization and achieve compression without the original training data while having access to a teacher.