Distilling Multilingual Transformers into CNNs for Scalable Intent Classiﬁcation

We describe an application of Knowledge Dis-tillation used to distill and deploy multilingual Transformer models for voice assistants, enabling text classiﬁcation for customers globally. Transformers have set new state-of-the-art results for tasks like intent classiﬁcation, and multilingual models exploit cross-lingual transfer to allow serving requests across 100+ languages. However, their prohibitive inference time makes them impractical to deploy in real-world scenarios with low latency requirements, such as is the case of voice assistants. We address the problem of cross-architecture distillation of multilingual Transformers to simpler models, while maintaining multi-linguality without performance degradation. Training multilingual student models has received little attention, and is our main focus. We show that a teacher-student framework, where the teacher’s unscaled activations (log-its) on unlabelled data are used to supervise student model training, enables distillation of Transformers into efﬁcient multilingual CNN models. Our student model achieves equivalent performance as the teacher, and outperforms a similar model trained on the labelled data used to train the teacher model. This approach has enabled us to accurately serve global customer requests at speed (18x improvement), scale, and low cost.


Introduction
For nearly all natural language understanding tasks, e.g.SuperGLUE (Wang et al., 2019), state-of-theart results are obtained using pre-trained Transformer models.Their performance is dependent on their size and the amount of pre-training data, typically billions of tokens (Xue et al., 2021).
Intent Classification (IC), the task of understanding a user's intent from an utterance, is a core component of all voice assistants such as Siri or Alexa.IC is challenging due to the hundreds of intents and contexts that such systems must support, and IC performance has benefited greatly from Transformers (Chen et al., 2019).As voice systems have expanded support to new languages, the benefits of Transformers have multiplied with the advent of multilingual versions such as XLM-RoBERTa (Conneau et al., 2020).
Despite the advantages, deploying Transformers at scale is not always feasible, mainly due to: (i) large memory footprint (hundreds of GB),1 and (ii) long inference time 2 that is prohibitive for applications processing millions of inputs per minute.
While approaches to reducing memory footprint -such as quantization (Vargaftik et al., 2021) or pruning (Gordon et al., 2020) -have been proposed, minimizing inference time is more challenging.Pruning can speed up inference, but there are limitations to how many self-attention layers can be pruned without loss of performance.Knowledge Distillation (KD) (Hinton et al., 2015) is another approach for transferring knowledge across model architectures, e.g. from Transformers to LSTMs (Wasserblat et al., 2020), to optimize performance.
However, cross-architecture distillation of multilingual Transformers to multilingual non-Transformer architectures has received almost no attention in the community.In this work we present the first exposition of this task.Specifically, we describe an approach used to deploy multilingual IC models for voice assistants allowing accurate inference at scale, speed, and low-cost.
We face two key challenges: (i) meeting low inference latency requirements, allowing us to globally serve customers in real time (millions per minute), and (ii) supporting multi-linguality, here we support 11 locales with 7 languages.Example utterances are shown below, which represent ecommerce questions issued in different languages.
• how many calories are in a banana?(EN) • wie viel fett enthält hühnchen?(DE) • come si conservano le vongole in frigo (IT) • cómo se hace un queque de yogur (ES) • combien de temps peut-on réfrigérer une banane (FR) • é possível congelar pastéis de nata (PT) We use the teacher-student distillation paradigm, and show the optimal KD strategy for multilingual IC can leverage teacher logits alone (Mukherjee and Awadallah, 2020).Utterances for IC are typically 10-40 tokens, allowing us to exploit an efficient ConvNet architecture, and assess how they can obtain multilingual and pretrained knowledge from models like XLM-R via distillation.
While there have been previous attempts on distilling transformer models into ConvNets (Chia et al., 2019), our work is the first to explore crossarchitecture multilingual KD on real-world applications with strict requirements for latency and accuracy.We make the following contributions.
• Knowledge distillation from Transformers to multilingual student (ConvNet) for intent classification based on the teacher-student paradigm; • Minimal inference latency multilingual student models (18x speed up relative to teacher) without any loss in classification accuracy.
• Evaluation framework outlining the amount of distillation data required, and assessment of the student model's generalization on unseen data.

Related Work
We now review some of the popular approaches for distilling and compressing Transformer models.
Model Finetuning.Eisenschlos et al. (2019) propose an efficient way to fine tune monolingual models on multilingual tasks by simply using the output of cross-lingual Transformer models as pseudolabels.Their approach is based on the ULMFiT model (Howard and Ruder, 2018), where instead of the stacked LSTM networks (Hochreiter and Schmidhuber, 1997), they rely on quasi recurrent neural networks (Bradbury et al.) (QRNN).QRNN are similar to CNN, with the difference that the convoluational operators are done at each timestep, however, due to parallelization, they can be computed much more efficiently than LSTMs.
QRNNs are up to 16x faster than LSTMs, however, for our case, we find that ConvNets are more efficient than QRNNs, as they do not perform stepwise computations as QRNNs do.We compare the inference time of QRNNs and our proposed student model, and conclude that simple ConvNets have significantly lower inference time.
Model Compression.Ganesh et al. (2021) systematically review approaches for compressing transformers.To reduce memory usage, quantization is often applied (Vargaftik et al., 2021).Quantization reduces the amount of bits required to store network parameters.For example, parameters represented using float32, can instead be stored using only 16 or fewer bits, reducing memory usage significantly.This allows deploying larger models in compute infrastructure with limited resources.
Model pruning is a widely explored research direction for compression, mainly consisting of two techniques.First, in unstructured pruning, weights are zeroed out using different strategies (Gordon et al., 2020).Second, in structured pruning either the self-attention heads (Fan et al., 2019) or the encoder layers (Hou et al., 2020) are pruned.
Quantization and pruning facilitate usage of large transformers without the requirement of very high memory capacity (GPU or CPU) machines.Quantization, and unstructured model pruning, mainly reduce memory usage.Structured pruning, where encoder and self-attention layers are dropped, can improve efficiency.Yet, for many real-world applications the latency needs cannot be met (with few milliseconds, as is our case).For instance, pruning more than 50% of attention heads can lead to performance loss (Fan et al., 2019).
Knowledge Distillation (KD).Hinton et al. (2015) discuss the trade-offs between model size and performance.Training a larger model, and distilling its knowledge to a smaller model, either using the same training data or unsupervised training data, yields identical performance.The contrary cannot be said when training a small model directly, where the performance is significantly worse than its bigger counterpart.KD works under the teacherstudent paradigm, where the teacher's output is used to train the student model such that it mimics the teacher model in terms of the output.
There are several efforts in distilling transformers into recurrent (Wasserblat et al., 2020) and convolutional architectures (Chia et al., 2019).While recurrent models like LSTMs can significantly re-duce memory footprint and latency, the step-wise sequential computation induces a large latency overhead that cannot be overcome.Conversely, ConvNets are highly efficient for text classification, both in terms of performance and latency.
Our approach is similar to that of Chia et al. (2019), in that we use CNNs as the main building block of the student model, However, we differ in several fundamental aspects and make contributions that further push the application of knowledge distillation.First, we deal with a multilingual task, which increases the complexity of the knowledge transfer from the teacher to the student model.Second, our ConvNet architecture is different to account for the multilingual requirement.Thirdly, we rely on unsupervised data for distillation, where we show how much data is necessary across different languages to have identical performance between the teacher and student models.

Multilingual Distillation Method
We now describe the KD approach: the IC task, the teacher/student models, and the learning objective.

IC Task
Our intent classification task requires categorizing utterances into two intents: Commerce Question (CQ), which are questions to the voice assistant about consumer products, and Non-Commerce Question (NCQ), which are all other questions.

Teacher and Student Models
Teacher Model: As our classifier is deployed globally in many languages, we use the multilingual XLM-RoBERTa (XLMR) transformer (Conneau et al., 2020) as our teacher model.
Given an utterance w = (w 1 , . . ., w n ), consisting of n tokens, the teacher model is used to encode the input, T(w) = h T (w), where h T (w) ∈ R m represents the [CLS] pooling representation from the last XLMR layer.This is fed to a softmax classification head, consisting of a dense projection that yields the raw activations of the network (i.e.unscaled log probabilities, or logits), which are then normalized to probabilities via softmax: where W ∈ R m×C , C is the number of intent classes, and logits T (w) captures the intent of the utterance, and is used to student training.Tokenization and Word Representations: Utterances are tokenized using the byte-pair encoding tokenizer model (Sennrich et al., 2016).To create a multilingual ConvNet, we leverage pretrained multilingual subword embeddings (Heinzerling and Strube, 2018).This approach allows representations of all languages, with a small vocabulary.
Encoder: Five 1D kernels of size 2-6 tokens, each with 500 filters, are aggregated with maxpooling.The pooled outputs are concatenated to form the final text representation.
Next, the student model computes the utterance representation (cf. Figure 1 (e)), S(w; θ) = h S ∈ R m , that is used to predict the intent probability: where W S ∈ R m×C and θ represent the student model parameters that need to be optimized.

Distillation Learning Objective
We use soft targets from the teacher, i.e. the unscaled log probabilities prior to softmax normalization (the logits), to train the student.We directly supervise the training of the student model S(w; θ) such that logits S (w) ≈ logits T (w).
To this end, our learning objective is to minimize the Mean Squared Error (MSE) loss over the logits (Mukherjee and Awadallah, 2020), computed over the N unlabelled instances: This logit loss encourages the student to output the same unnormalized activations as the teacher, which result in the same probabilities when normalized, and is more numerically stable to train.By minimizing L on a large sample of unlabelled data, the distillation process can successfully transfer the intent classification knowledge from the teacher to the student.In this aspect, it is important to consider a large and representative sample given that L can be minimal for a specific set of utterances, i.e. |logits S (w) − logits T (w)| < ǫ, however, for unseen utterances the difference between |logits S (w) − logits T (w)| ≫ ǫ (for some value that induces change in utterance's label.)

Experimental Setup
We now describe the datasets used to train the teacher model, and for distillation.We also define the evaluation metrics used to asses how well the student model mimics its teacher.

Datasets
We use 3 types of data: (i) teacher datasets (supervised IC training data); (ii) student datasets, unannotated utterances to train the student; and (iii) test data used to evaluate the teacher and student.
Teacher Datasets: Distillation Datasets: Table 1 (b) shows the statistics of the distillation data.We randomly sample a target number of utterances from each locale over a 1-month period.The data is unlabelled.Using unsupervised data allows the KD process to transfer any of Transformer's pretrained knowledge that may not overlap with our supervised set.
Test Datasets Table 1 (c) shows the test datasets used to evaluate the performance of our teacher and student models.In total, our test set across all locales consists of 1.7M labelled instances.

Teacher and Student Configuration
Teacher Model: Model T is based on XLMR base model3 with a total of 278M parameters, is fine-tuned on data from Table 1 for our multilingual IC task.The model is trained by minimizing the cross-entropy loss function using the AdamW (Loshchilov and Hutter, 2019) optimizer with a learning rate of lr = 3e − 5. Distilled Student Model: Model S is described in Figure 1, and consists of a total of 103M parameters.It is trained using the data in Table 1 (b) by minimizing the loss in Equation (5).A dropout rate of 10% is applied to the embeddings and CNN filters for regularization.We fine-tune the pretrained embeddings, and apply learning rate warmup over the first 2 epochs to prevent catastrophic forgetting.We train for 50 epochs (via the Adam optimizer), with an early stopping criterion of 3 consecutive epochs of non-decreasing loss.

Baseline
Our main objective is minimizing inference latency of Transformer models for IC.IC accuracy is not problematic for in-domain data, and most models achieve high performance (Larson et al., 2019).QRNN: We focus in comparing only w.r.t the inference time between different approaches. 4We compare S to QRNN, proposed in (Bradbury et al.), and consider two configurations: (i) QRNN 4 : with 4 ConvNet layers (as reported in Bradbury et al.), and (ii) QRNN 5 : with 5 ConvNet layers, equivalent to the layers used in S. Supervised Student Model: To assess whether distillation of teacher's knowledge into S using unlabelled data is needed in the cases of abundance of labelled training data, we additionally train an identical model to S using the supervised training data in Table 1 instead, which we denote with S sup .The training loss for S sup is the cross-entropy loss.

Evaluation Metrics
Accuracy: We measure performance based on Precision (P) and Recall (R).Specifically, we compare the models at the threshold-agnostic P/R Break-Even Point (PR-BEP) (Joachims, 2005), where the precision and recall of the model are equal.To compare performance over all thresholds, we report PR-AUC (area under the PR Curve) which is a meaningful metric for imbalanced tasks (Liu et al., 2019).Due to confidentiality, we report only the gap of S and S sup to T, as their absolute difference.
Efficiency: We measure wall-time t to compute the inference latency in milliseconds (ms).All measurements are the averaged latency over 100 trials, computed on an m5.4xlarge instance.5 5 Results

Model Accuracy
Overall Performance: Table 2 shows the performance difference between the teacher and student models.The overall PR-BEP gap across locales with 0.1% between T and S is negligible. 6ontrary to S, S sup has a large gap to T, with an overall difference of 6%, and in certain locales, exceeding 30% in terms of PR-AUC.This gap remains large in the languages with the largest amount of supervised data, but is much more prominent in those with little data.This highlights two main findings: (i) T due to its Transformer architecture has a superior learning capacity when compared to directly training S sup in a supervised manner; (ii) knowledge distillation allows us to successfully transfer the teacher's pretrained knowledge to the student, allowing the student to acquire knowledge not present in the labeled data, and achieve similar generalization as the teacher (cf.Appendix A).
Incremental Distillation Performance: Table 3 shows the gap in performance between the student and teacher models, for varying amount of data used to train the student model.The data from Table 1 (b) is sampled using stratified sampling, with the locales representing the groups.
With only 1% of the data, the gap in terms of PR-BEP is 8.7% absolute points.Increasing this to 10% or more, the gap closes to less than 1%.Concretely, 10% represents 2.2M instances across all locales.In real-world settings it is reasonably cheap to obtain such amounts of unlabelled data.
Results indicate that with appropriate data, logit loss is highly effective for capturing the teacher's knowledge.The student, using a different tokenizer and subword embeddings, is able to match teacher performance.Relative to other methods, logit loss is simpler to implement, and faster to train.For the IC task, we did not need to distill internal model  values (e.g.representation loss).We did not use the supervised data for student training (e.g. with crossentropy loss); our finding is that a sufficiently large and representative unsupervised sample will contain samples similar to those in the supervised set, as well as dissimilar ones, thus allowing the transfer of knowledge represented by both the labelled data and the Transformer's pretrained knowledge.

Inference Latency
A drawback in deploying transformers is their prohibitive inference latency, mainly impacted by: (i) model size, and (ii) number of encoder layers.
Figure 2 shows the latency for different ablations of T (with varying numbers of encoder layers), and the latency of S, as the model with the lowest latency.Comparing T and S, our student model has nearly 18x lower inference latency, with only 2.7ms.This represents a drastic latency reduction, allowing us to process inputs extremely quickly.
For the teacher ablations, even for T 1 , the inference latency is still higher than that of S, with an additional +1.24ms latency per utterance.Furthermore, pruning layers is not lossless in terms of performance, especially in this case where only one layer is retained (Fan et al., 2019).The bottom part of Figure 2 shows the gap of the different pruned teacher models T l w.r.t the full model T. The gap is high when we use fewer than 8 layers, with more than 12% drop in PR-BEP.It is clear that there is no clear trade-off between self-attention layer pruning and inference latency reduction.
Finally, comparing the baseline QRNN 4 and QRNN 5 , we note that the proposed student architecture, relying solely on ConvNets, results in a significantly lower inference latency.Our student architecture has 3.8x and 2.95x lower latency than QRNN 5 and QRNN 4 , respectively.This significant increase in terms of latency can be explained by the fact that QRNN applies its convolutional operators for timestep (each token in an utterance), which although more efficient than LSTMs (due to parallelization), it introduces a significant overhead over the traditional ConvNet architectures.

Conclusions
We described an approach for distilling knowledge from Transformer into a single multilingual CNN.To our knowledge this is the first detailed exposition of cross-architecture KD to multilingual student models.We leverage the outlined framework to accurately serve predictions for our customers at speed, scale, low-cost, and across all languages.Empirically we showed how such a KD framework can be utilized in practice: 1.With sufficient unsupervised data, leveraging logits is an optimal distillation strategy for training smaller and more efficient student models, without significant performance loss.
2. KD allows smaller and more efficient models to mimic the performance of their teacher counterparts, which is not the case if similar architectures are directly trained using labelled data.
3. KD is highly preferred over other techniques such as pruning.Transformers, even with a single encoder layer have higher inference latency, and the performance drop with pruning is large, where T 1 has a 23% and 22% gap in terms of PR-BEP w.r.t T and S, respectively.4. For IC, a single multilingual CNN using multilingual subword embeddings can match the teacher performance despite using a different tokenizer.It is highly efficient, decreasing the latency by nearly 18x relative to the teacher.
5. Using as few as 2-3M distillation instances, S achieves highly comparable performance as T, with less than 1% PR-BEP difference.The gap diminishes to 0.95% with just 40% of distillation data, specifically 8.8M instances.
6. S achieves the same generalization power as T.
On a held out test set (unseen during training and distillation), the output probabilities have a very low KL divergence (cf.Appendix A).With increasing amounts of distillation data, the probability distributions become highly similar.
Figure 4: NCQ/CQ confidence distribution (x-axis) as a function of how likely an utterance is to be CQ.For NCQ, this probability ideally should be close to zero, and vice-versa for CQ (close to one).The skewedness score G1 measures the concentration of the probability mass for NCQ and CQ, respectively.For NCQ the higher score the better (in the positive range), whereas for CQ, the lower the score the better (negative range).The results are shown for the student models: S 1% , S 10% , S 50% (distilled with 1%, 10%, and 50% of the data, respectively), and for the teacher model T.

Figure 2 :
Figure2: The upper plot shows the inference latency (in milliseconds) for the teacher models, where T l (l ∈ {1, . . ., 11}).T 1 is a single encoder layer, with the other 11 layers pruned.The bottom plot shows the gap in terms of PR-BEP of the T l models to the full teacher model.Note that, T 1 which has the closest inference time to S (with 2.71ms latency), has a 22.8% and 22.5% drop in terms of PR-BEP w.r.t T and S, respectively.Similarly, for QRNN, the inference time is shown in the orange and yellow dashed lines, with a latency of QRNN 4 = 8.02 ms and QRNN 5 = 10.02 ms.

Figure 3 :
Figure 3: KL-Divergence of the confidence score distribution between teacher and student models on unseen data.With increasing amounts of distillation data, the probability distributions become highly similar.
Table 1 (a) shows details of the training data used for the teacher model.Utterances come from 7 different languages and 11 locales.The task is imbalanced, but for confidentiality, the class distribution cannot be disclosed.

Table 2 :
The gap between the student models (S and S sup ) to T is reported for the same test set.For S the gap of 0.2% is marginal, whereas for S sup the gaps are highly significant according to the Binomial proportion test (p-value < 0.01).
PR-BEP Absolute Percentage Difference to the Teacher Model.

Table 3 :
PR-BEP performance of the student model trained on varying portion of the distillation data from Table1(b).Overall, with 1% of the data used for distillation, the student model has an average gap in terms of PR-BEP of 8.7%.With increasing percentage of data used for distillation the gap is shrunk to 0.6% for 40% of the data.