MATE-KD: Masked Adversarial TExt, a Companion to Knowledge Distillation

The advent of large pre-trained language models has given rise to rapid progress in the field of Natural Language Processing (NLP). While the performance of these models on standard benchmarks has scaled with size, compression techniques such as knowledge distillation have been key in making them practical. We present MATE-KD, a novel text-based adversarial training algorithm which improves the performance of knowledge distillation. MATE-KD first trains a masked language model-based generator to perturb text by maximizing the divergence between teacher and student logits. Then using knowledge distillation a student is trained on both the original and the perturbed training samples. We evaluate our algorithm, using BERT-based models, on the GLUE benchmark and demonstrate that MATE-KD outperforms competitive adversarial learning and data augmentation baselines. On the GLUE test set our 6 layer RoBERTa based model outperforms BERT-large.


Introduction
Transformers (Vaswani et al., 2017) and transformer-based Pre-trained Language Models (PLMs) (Devlin et al., 2019) are ubiquitous in applications of NLP. They are highly parallelizable and their performance scales well with an increase in model parameters and data. Increasing model parameters depends on the availability of computational resources and PLMs are typically trained on unlabeled data which is cheaper to obtain.
Recently, the trillion parameter mark has been breached for PLMs (Fedus et al., 2021) amid serious environmental concerns (Strubell et al., 2019). However, without a change in our current training * Equal Contribution † Work done during an internship at Huawei Noah's Ark Lab.
paradigm , training larger models may be unavoidable . In order to deploy these models for practical applications such as for virtual personal assistants, recommendation systems, e-commerce platforms etc. model compression is necessary.
Knowledge Distillation (KD) (Buciluǎ et al., 2006;Hinton et al., 2015) is a simple, yet powerful knowledge transfer algorithm which is used for neural model compression (Jiao et al., 2019;, ensembling (Hinton et al., 2015) and multi-task learning (Clark et al., 2019). In NLP, KD for compression has received renewed interest in the last few years. It is one of the most widely researched algorithms for the compression of transformer-based PLMs (Rogers et al., 2020).
One key feature which makes KD attractive is that it only requires access to the teacher's output or logits and not the weights themselves. Therefore, if a trillion parameter model resides on the cloud, an API level access to the teacher's output is sufficient for KD. Consequently, the algorithm is architecture agnostic, i.e., it can work for any deep learning model and the student can be a different model from the teacher.
Recent works on KD for transfer learning with PLMs extend the algorithm in two main directions. The first is towards "model" distillation Wang et al., 2020;Jiao et al., 2019) i.e. distilling the intermediate weights such as the attention weights or the intermediate layer output of transformers. The second direction is towards curriculum-based or progressive KD (Sun et al., 2020;Mirzadeh et al., 2019;Jafari et al., 2021) where the student learns one layer at a time or from an intermediary teacher, known as a teacher assistant. While these works have shown accuracy gains over standard KD, they have come at the cost of architectural assumptions, least of them a common architecture between student and teacher, and greater access to teacher parameters and intermediate outputs. Another issue is that the decision to distill one teacher layer and to skip another is arbitrary. Still the teacher typically demonstrates better generalization We are interested in KD for model compression and study the use of adversarial training (Goodfellow et al., 2014) to improve student accuracy using just the logits of the teacher as in standard KD. Specifically, our work makes the following contributions: • We present a text-based adversarial algorithm, MATE-KD, which increases the accuracy of the student model using KD.
• Our algorithm only requires access to the teacher's logits and thus keeps the teacher and student architecture independent.
• We evaluate our algorithm on the GLUE (Wang et al., 2018) benchmark and demonstrate improvement over competitive baselines.
• On the GLUE test set, we achieve a score of 80.9, which is higher than BERT LARGE • We also demonstrate improvement on out-ofdomain (OOD) evaluation.
2 Related Work

Knowledge Distillation
We can summarize the knowledge distillation loss, L, as following: where H CE represents the cross entropy between the true label y and the student network prediction S(X) for a given input X, D KL is the KL divergence between the teacher and student predictions softened using the temperature parameter T , z(X) is the network output before the softmax layer (logits), and σ(.) indicates the softmax function. The term λ in the above equation is a hyper-parameter which controls the amount of contribution from the cross entropy and KD loss. Patient KD  introduces an additional loss to KD which distills the intermediate layer information onto the student network. Due to a difference in the number of student and teacher layers they propose either skipping alternate layers or distilling only the last few layers. Tiny-BERT (Jiao et al., 2019) applies embedding distillation and intermediate layer distillation which includes hidden state distillation and attention weight distillation. Although it achieves strong results on the GLUE benchmark, this approach is infeasible for very large teachers. MiniLM (Wang et al., 2020) proposed an interesting alternative whereby they distill the key, query and value matrices of the final layer of the teacher.

Adversarial Training
Adversarial examples are small perturbations to training samples indistinguishable to humans but enough to produce misclassifications by a trained neural network. Goodfellow et al. (2014) showed that adding these examples to the training set can make a neural network model robust to perturbations. Miyato et al. (2016) adapt adversarial training to text classification and improve performance on a few supervised and semi-supervised text classification tasks.
In NLP, adversarial training has surprisingly been shown to improve generalization as well Zhu et al., 2019).  study machine translation and propose making the model robust to both source and target perturbations, generated by swapping the embedding of a word with that of its synonym. They model small perturbations by considering word swaps which cause the smallest increase in the loss gradient. They achieve a higher BLEU score on Chinese-English and English-German translation compared to the baseline. Zhu et al. (2019) propose a novel adversarial training algorithm, FreeLB, to make gradient-based adversarial training efficient by updating both embedding perturbations and model parameters simultaneously during the backward pass of training. They show improvements on multiple language models on the GLUE benchmark. Embedding perturbations are attractive because they produce stronger adversaries (Zhu et al., 2019) and keep the system end-to-end differentiable as the embeddings are continuous. The salient features of adversarial training for NLP are a) a minimax formulation where adversarial examples are generated to maximize a loss function and the model is trained to minimize the loss function and b) a way of keeping the perturbations small such as a norm-bound on the gradient (Zhu et al., 2019) or replacing words by their synonyms .
If these algorithms are adapted to KD one key challenge is the embedding mismatch between the teacher and student. Even if the embedding size is the same, the student embedding needs to be frozen to match the teacher embedding and freezing embeddings typically leads to lower performance. If we adapt adversarial training to KD, one key advantage is that access to the teacher distribution relaxes the requirement of generating label preserving perturbations. These considerations have prompted us to design an adversarial algorithm where we perturb the actual text instead of the embedding. Rashid et al. (2020) also propose a text-based adversarial algorithm for the problem of zero-shot KD (where the teacher's training data is unavailable), but their generator instead of perturbing text generates new samples and requires additional losses and pre-training to work well.

Data Augmentation
One of the first works on BERT compression (Tang et al., 2019) used KD and proposed data augmentation using heuristics such as part-of-speech guided word replacement. They demonstrated improvement on three GLUE tasks. One limitation of this approach is that the heuristics are task specific. Jiao et al. (2019) present an ablation study in their work whereby they demonstrate a strong contribution of data augmentation to their KD algorithm performance. They augment the data by randomly selecting a few words of a training sentence and replacing them with words with the closest embedding under cosine distance. Our adversarial learning algorithm can be interpreted as a data augmentation algorithm, but instead of a heuristic approach we propose a principled end-to-end differentiable augmentation method based on adversarial learning. Khashabi et al. (2020) presented a data augmentation technique for question answering whereby they took seed questions and asked humans to perturb only a few tokens to generate new ones. The human annotators could modify the label if needed. They demonstrated improved generalization and robustness with the augmented data. We will demonstrate that our algorithm is built on similar principles but does not require humans in the loop. Instead of human annotators to modify the labels we use the teacher.

Methodology
We propose an algorithm that involves co-training and deploy an adversarial text generator while training a student network using KD. Figure 1 gives an illustration of our architecture.

Generator
The text generator is simply a pre-trained masked language model which is trained to perturb training samples adversarially. We can frame our technique in a minimax regime such that in the maximization step of each iteration, we feed the generator with a training sample with few of the tokens replaced by masks. We fix the rest of the sentence and replace the masked tokens with the generator output to construct a pseudo training sample X . This pseudo sample is fed to both the teacher and the student models and the generator is trained to maximize the divergence between the teacher and the student. We present an example of the masked generation process in Figure 2. The student is trained during the minimization step.

Maximization Step
The generator is trained to generate pseudo samples by maximizing the following loss function: where D KL is the KL divergence, G φ (.) is the text generator network with parameters φ, T (·) and S θ (·) are the teacher and student networks respectively, and X m is a randomly masked version of the input X = [x 1 , x 2 , ..., x n ] with n tokens.
where unif(0, 1) represents the uniform distribution, and the Mask( · ) function masks the tokens of inputs sampled from the data distribution D with the probability of ρ. The term ρ can be treated as a hyper-parameter in our technique. In summary, for each training sample, we randomly mask some tokens according to the samples derived from the uniform distribution and the threshold value of ρ.
Then in the forward pass, the masked sample, X m , is fed to the generator to obtain the output pseudo text based on the generator predictions of the mask tokens. The generator needs to output a one-hot representation but using an argmax inside the generator would lead to non-differentiability. Instead we apply the Gumbel-Softmax (Jang et al., 2016), which, is an approximation to sampling from the argmax. Using the straight through estimator (Bengio et al., 2013) we can still apply argmax in the forward pass and can obtain text, X from the network outputs: g i ∼ Gumbel(0, 1) and z φ (.) returns the logits produced by the generator for a given input. τ is the temperature in equation 5.
In the backward pass, the generator simply applies the gradients from the Gumbel-Softmax without the argmax :

Minimization Step
In the minimization step, the student network is trained to minimize the gap between the teacher and student predictions and match the hard labels from the training data by minimizing the following loss equation: where In Equation 7, the terms L KD and L CE are the same as Equation 1, L KD (θ) and L ADV (θ) are used to match the student with the teacher, and L CE (θ) is used for the student to follow the groundtruth labels y.
Bear in mind that our L MATE-KD (θ) loss is different from the regular KD loss in two aspects: first, it has the additional adversarial loss, L ADV to minimize the gap between the predictions of the student and the teacher with respect to the generated masked adversarial text samples, X , in the maximization step; second, we do not have the weight term λ form KD in our technique any more (i.e. we consider equal weights for the three loss terms in L MATE-KD ).

Rationale Behind the Masked Adversarial Text Generation for KD
The rationale behind generating partially masked adversarial texts instead of generating adversarial texts from scratch (that is equivalent to masking the input of the text generator entirely) is three-fold: 1. Partial masking is able to generate more realistic sentences compared to generating them from scratch when trained only to increase teacher and student divergence. We present a few generated sentences in section 4.6 2. Generating text from scratch increases the chance of generating OOD data. Feeding OOD data to the KD algorithm leads to matching the teacher and student functions across input domains that the teacher is not trained on.
3. By masking and changing only a few tokens of the original text, we constrain the amount of perturbation as is required for adversarial training.
In our MATE-KD technique, we can tweak the ρ to control our divergence from the data distribution and find the sweet spot which gives rise to maximum improvement for KD. We also present an ablation on the effect of this parameter on downstream performance in section 4.5.

Experiments
We evaluated MATE-KD on all nine datasets of the General Language Understanding Evaluation (GLUE) (Wang et al., 2018) benchmark which include classification and regression. These datasets can be broadly divided into 3 families of problems. Single set tasks which include linguistic acceptability (CoLA) and sentiment analysis (SST-2). Similarity and paraphrasing tasks which include paraphrasing (MRPC and QQP) and a regression task (STS-B). Inference tasks which include Natural Language Inference (MNLI, WNLI, RTE) and Question Answering (QNLI).

Experimental Setup
We evaluate our algorithm on two different setups. On the first the teacher model is RoBERTa LARGE  and the student is initialized with the weights of DistillRoBERTa . RoBERTa LARGE consists of 24 layers with a hidden dimension of 1024 and 16 attention heads and a total of 355 million parameters. We use the pretrained model from Huggingface . The student consists of 6 layers, 768 hidden dimension, 8 attention heads and 82 million parameters. Both models have a vocabulary size of 50,265 extracted using the Byte Pair Encoding (BPE) (Sennrich et al., 2016) tokenization method.
On our second setup, the teacher model is BERT BASE (Devlin et al., 2019) and the student model is initialized with the weights of DistilBERT which consists of 6 layers with a hidden dimension of 768 and 8 attention heads. The pre-trained models are taken from the authors' release. The teacher and the student are 110M and 66M parameters respectively with a vocabulary size of 30,522 extracted using BPE.
Hyper-parameters We fine-tuned the RoBERTa student model and picked the best checkpoint that gave the highest score on the dev set of GLUE. These hyper-parameters were fixed for the GLUE test submissions as well as the BERT experiments.
We used the AdamW (Loshchilov and Hutter, 2017) optimizer with the default values. In addition, we used a linear decay learning rate scheduler with no warmup steps. We set the masking probability p to be 0.3. Additionally, we set the value n G to 10 and n S to 100. The learning rate, number of epochs, and other hyper-parameters are presented on table 8 of Appendix A.
Hardware Details We trained all models using a single NVIDIA V100 GPU. We used mixedprecision training (Micikevicius et al., 2018) to expedite the training procedure. All experiments were run using the PyTorch 1 framework. Table 1 presents the results of MATE-KD on the GLUE dev set. Even though the datasets have different evaluation metrics, we present the average of all scores as well, which is used to rank the submissions to GLUE. Our first baseline is the fine-tuned DistilRoBERTa and then we compare with KD, FreeLB, FreeLB plus KD, and TinyBERT (Jiao et al., 2019) data augmentation plus KD.

Results
We observe that FreeLB (Zhu et al., 2019) significantly improves the fine-tuned student by around 1.2 points on average. However, when we apply both FreeLB + KD, we do not see any further improvement whereas applying KD alone improves the score by about 2 points. This is so because FreeLB relies on the model (student) output rather than the teacher output to generate adversarial perturbation and therefore cannot benefit from KD. As previously discussed, FreeLB relies on embedding perturbation and in order to generate the teacher output on the perturbed student, both the embeddings need to be tied together, which is infeasible due to the size and training requirements.
We also compared against the data augmentation algorithm of TinyBERT. We ran their code to generate the augmented data offline. Although they augment the data about 20 times depending on the GLUE task, we observed poor results if we use all this data to fine-tune with KD. We only generated 1x augmented data and saw an average improvement of 0.35 score over KD. MATE-KD achieves the best result among the student models on all GLUE tasks and achieves an average improvement of 1.87 over just KD. We also generated the same number of adversarial samples as the training data.
We present the results on the test set of GLUE on Table 2. We list the number of parameters for each model. The results of BERT BASE , BERT LARGE (Devlin et al., 2019), TinyBERT and MobileBERT (Sun et al., 2020) are taken from the GLUE leaderboard 2 . The KD models have RoBERTa Large , finetuned without ensembling as the teacher.
TinyBERT and MobileBERT are the current state-of-the-art 6 layer transformer models on the GLUE leaderboard. We include them in this comparison although their teacher is BERT BASE as opposed to RoBERTa Large . We make the case that one reason we can train with a larger and more powerful teacher is that we only require the logits of the teacher while training. Most of the works in the literature proposing intermediate layer distillation (Jiao et al., 2019;Sun et al., 2020 are trained on 12 layer BERT teachers. As PLMs get bigger in size, feasible approaches to KD will involve algorithms which rely on only minimal access to teachers.
We apply a standard trick to boost the performance of STS-B and RTE, i.e., we initialize these models with the trained checkpoint of MNLI . This was not done for the dev results. The WNLI score is the same for all the models and although, not displayed on the table, is part of the average score. We make a few observations from this table. Firstly, using KD a student with a powerful teacher can overcome a significant difference in parameters between competitive models. Secondly, our algorithm significantly improves KD with an average 2 point increase on the unseen GLUE testset. Our model is able to achieve stateof-the-art results for a 6 layer transformer model on the GLUE leaderboard.
We also evaluate our algorithm using BERT BASE as teacher and DistilBERT as student on GLUE benchmark. WNLI results are the same for all and they are used to calculate the average. We compare against the teacher, student, and KD plus Tiny-BERT augmentation. Here, remarkably MATE-KD can beat the teacher performance on average. On the two largest datasets in GLUE, QQP and MNLI, we beat and match the teacher performance respectively.
We observe that MATE-KD outperforms its competitors when both the teacher is twice the size and four times the size of the student. This may be because the algorithm generates adversarial examples based on the teacher's distribution. A well designed adversarial algorithm can help us probe parts of the teacher's distribution not spanned by the training data leading to better generalization.

OOD Evaluation
It has been shown that strong NLU models tend to learn spurious surface level patterns from the dataset (Poliak et al., 2018;Gururangan et al., 2018) and may perform poorly on carefully constructed OOD datasets. In Table 4   We use the same model checkpoint as the one presented in Table 1 and compare against Dis-tilRoBERTa. We observe that MATE-KD improves the baseline performance on both evaluation datasets. The performance increase on HANS is larger. We can conclude that the algorithm improvements are not due to learning spurious correlations and biases in the dataset. Table 5 presents the contribution of the generator and adversarial learning to MATE-KD. We first present the result of MATE-KD on all the GLUE datasets (except WNLI) and compare against the effect of removing the adversarial training and then the generator altogether. When we remove the adversarial training, we essentially remove the maximization step and do not train the generator. The generator in this setting is a pre-trained masked language model. In the minimization step, we still generate pseudo samples and apply all losses. The setting where we remove the generator is akin to a simple KD.

Ablation Study
We observe that the generator improves KD by an average of 1.3 and the adversarial training increases the score further by 0.6.

Sensitivity Analysis
Our algorithm does not require the loss interpolation weight of KD but instead relies on one additional parameter, ρ, which is the probability of masking a given token. We present the effect of changing ρ in Table 7 on MNLI and RTE dev set results fixing all other hyper-parameters. We selected MNLI and RTE because they are part of Natural Language Inference, which is one of the hardest tasks on GLUE. Moreover, in the RoBERTa experiments we see the largest drop in student scores for these two datasets. We can observe that for MNLI the best result is for 30% followed by 20% and for RTE the best choice is 40% followed by 30%. This corresponds to the heuristic based data augmentation works where they typically modify tokens with a 30% to 40% probability. We set this parameter to 30% for all the experiments and did not tune this for each dataset or each architecture.  Original Generated the new insomnia is a surprisingly faithful sinister new insomnia shows a surprisingly terrible remake of its chilly predecessor, and remake of its hilarious predecessor, and beautifully shot, delicately scored and beautifully sublime, delicately scored, powered by a set of heartfelt performances powered by great dozens of heartfelt performances a perfectly pleasant if slightly pokey comedy a 10 pleasant if slightly pokey comedy that appeals to me Federal appeals punished me good news to anyone who's fallen under good news for anyone who's fallen under the sweet, melancholy spell of this the sweet, melancholy spell of this unique director's previous films unique director's previous mistakes

Generated Samples
We present a few selected samples that our generator produced during training for the SST-2 dataset on table 6. SST-2 is a binary sentiment analysis dataset. The data consist of movie reviews and is both at the phrase and sentence level. We observe that we only modify a few tokens in the generated text. However, one of three things happens if the text is semantically plausible. Either the generated sentence keeps the same sentiment as in Examples 2 and 3, or it changes the sentiment as in Examples 1 and 4 or the text has ambiguous sentiment as in Example 5. We can use all of these for training since we do not rely on the original label but obtain the teacher's output.

Discussion and Future Work
We have presented MATE-KD, a novel text-based adversarial training algorithm which improves the student model in KD by generating adversarial examples while accessing the logits of the teacher only. This approach is architecture agnostic and can be easily adapted to other applications of KD such as model ensembling and multi-task learning.
We demonstrate the need for an adversarial training algorithm for KD based on text rather than embedding perturbation. Moreover, we demonstrate the importance of masking for our algorithm.
One key theme that we have presented in this work is that as PLMs inevitably increase in size and number of parameters, techniques that rely on access to the various layers and intermediate parameters of the teacher will be more difficult to train. In contrast, algorithms which are wellmotivated and require minimal access to the teacher may learn from more powerful teachers and would be more useful. An example of such an algorithm is the KD algorithm itself.
Future work will consider a) using label information and a measure of semantic quality to filter the generated sentences b) exploring the application of our algorithm to continuous data such as speech and images and c) exploring other applications of KD.

A Training Details
We present the details of the learning rate, number of epochs, and the batch size we use for each training set of GLUE for both the BERT and the RoBERTa settings.