Enhancing the generalization for Intent Classification and Out-of-Domain Detection in SLU

Intent classification is a major task in spoken language understanding (SLU). Since most models are built with pre-collected in-domain (IND) training utterances, their ability to detect unsupported out-of-domain (OOD) utterances has a critical effect in practical use. Recent works have shown that using extra data and labels can improve the OOD detection performance, yet it could be costly to collect such data. This paper proposes to train a model with only IND data while supporting both IND intent classification and OOD detection. Our method designs a novel domain-regularized module (DRM) to reduce the overconfident phenomenon of a vanilla classifier, achieving a better generalization in both cases. Besides, DRM can be used as a drop-in replacement for the last layer in any neural network-based intent classifier, providing a low-cost strategy for a significant improvement. The evaluation on four datasets shows that our method built on BERT and RoBERTa models achieves state-of-the-art performance against existing approaches and the strong baselines we created for the comparisons.


Introduction
Spoken language understanding (SLU) systems play a crucial role in ubiquitous artificially intelligent voice-enabled personal assistants (PA). SLU needs to process a wide variety of user utterances and carry out user's intents, a.k.a. intent classification. Many deep neural network-based SLU models have recently been proposed and have demonstrated significant progress (Guo et al., 2014;Liu and Lane, 2016;Zhang and Wang, 2016;Wang et al., 2018;Goo et al., 2018;Chen et al., 2019) in classification accuracy. These models usually apply the closed-world assumption, in which the SLU model is trained with predefined domains, and the model expects to see the same data distribution  during both training and testing. However, such an assumption is not held in the practical use case of PA systems, where the system is used under a dynamic and open environment with personal expressions, new vocabulary, and unknown intents that are out of the design scope.
To address the challenges in open-world settings, previous works adopt varied strategies. Shen et al. (2018aShen et al. ( , 2019c) use a cold-start algorithm to generate additional training data to cover a larger variety of utterances. This strategy relies on the software developers to pre-build all possible skills. Shen et al. (2019b,a) introduce a SkillBot that allows users to build up their own skills. Recently, Ray et al. (2018Ray et al. ( , 2019; Shen et al. (2018bShen et al. ( , 2019d enables an SLU model to incorporate user personalization over time. However, the above approaches do not explicitly address unsupported user utterances/intents, leading to catastrophic failures illustrated in Figure 1. Thus, it is critically desirable for an SLU system to classify the supported intents (in-domain (IND)) and reject unsupported ones (out-of-domain (OOD)) correctly.
A straightforward solution is to collect OOD data and train a supervised binary classifier on both IND data and OOD data (Hendrycks et al., 2018). However, collecting a representative set of OOD data could be impractical due to the infinite compositionality of language. Arbitrarily selecting a subset could incur the selection bias, causing the learned model might not generalize to unseen OOD data. Ryu et al. (2017Ryu et al. ( , 2018 avoid learning with OOD data by using generative models (e.g., autoencoder and GAN) to capture the IND data distribution, then judge IND/OOD based on the reconstruction error or likelihood. Recently, Tan et al. (2019) utilizes a large training data to enable the meta-learning for OOD detection. Zheng et al. (2020) generates pseudo OOD data to learn the OOD detector. The above-discussed approaches require additional data or training procedures beyond the intent classification task, introducing significant data collection effort or inference overhead. This paper proposes a strategy based on neural networks to use only IND utterances and their labels to learn both the intent classifier and OOD detector. Our strategy modifies the structure of the classifier, introducing an extra branch as a regularization target. We call the structure a Domain-Regularized Module (DRM). This structure is probabilistically motivated and empirically leads to a better generalization in both intent classification and OOD detection. Our analysis focuses more on the latter task, finding that DRM not only outputs a class probability that is a better indicator for judging IND/OOD, but also leads to a feature representation with a less distribution overlap between IND and OOD data. More importantly, DRM is a simple drop-in replacement of the last linear layer, making it easy to plug into any off-the-shelf pretrained models (e.g. BERT (Devlin et al., 2019)) to fine-tune for a target task. The evaluation on four datasets shows that DRM can consistently improve upon previous state-of-the-art methods.

Problem Definition
In the application of intent classification, a user utterance will be either an in-domain (IND) utterance (supported by the system) or an out-of-domain (OOD) utterance (not supported by the system). The classifier is expected to correctly (1) predict the intent of supported IND utterances; and (2) detect to reject the unsupported OOD utterances.
The task is formally defined below. We are given a closed world IND training set an utterance x i and its intent class label y i ∈ {1 . . . C} for C predefined in-domain classes, is drawn from a fixed but unknown IND distribution P IN D (x, y). We aim to train an intent classifier model only on IND training data D IN D such that the model can perform: (1) Intent Classification: classify the intent class label y of an utterance x if x is drawn from the same distribution P IN D as the training set D IN D ; (2) OOD Detection: detect an utterance x to be an abnormal/unsupported sample if x is drawn from a different distribution P OOD .
OOD Detection has been studied for many years (Hellman, 1970).  explores its combination with intent classification by learning an SVM classifier on the IND data and randomly sampled OOD data. Ryu et al. (2017) detects OOD by using reconstruction criteria with an autoencoder. Ryu et al. (2018) learns an intent classifier with GAN and uses the discriminator as the classifier for OOD detection. Zheng et al. (2020) leverages extra unlabeled data to generate pseudo-OOD samples using GAN via auxiliary classifier regularization. Tan et al. (2019) further incorporates the few-shot setting, learning the encoding of sentences with a prototypical network that is regularized with the OOD data outside a learning episode. Other researchers developed methods in computer vision based on the rescaling of the predicted class probabilities (ODIN) (Liang et al., 2017) or building the Gaussian model with the features extracted from the hidden layers of neural networks (Mahalanobis) . Recently, (Hsu et al., 2020) proposed Generalized-ODIN with decomposed confidence scores. However, both approaches also heavily depend on the image input perturbation to achieve good performance. Unfortunately, such perturbation cannot be applied to discrete utterance data in SLU.

Our Method
Our method is inspired by the decomposed confidence of Generalized-ODIN (Hsu et al., 2020), but we leverage the fact that the training data are all from IND to introduce an extra regularization. This regularization leads to a better generalization (lower classification error) on the intent classification. The improvement is in contrast to the original Generalized-ODIN, which has its classification error slightly increased. Since the improved generalization is likely due to a more generalizable feature representation, we leverage this observation, providing a modified Mahalanobis , which we called L-Mahalanobis, for a transformerbased model to detect OOD data. In the following sections, we first describe the DRM and then elaborate on using the outputs of a DRM-equipped model to detect OOD data.

Domain-Regularized Module (DRM)
The motivation begins with introducing the domain variable d (d = 1 means IND, while d = 0 means OOD) following the intuition in (Hsu et al., 2020), then rewrite the posterior of class y given x with domain d as follows: where the last step holds since p(y, d = 0|x) is close to 0 with the intrinsic conflict between IND classes y and random variable d = 0 for OOD.

DRM Design
Motivated by the above Equation 1, we design the DRM to mitigate overconfidence by decomposing the final logits f into two branches. Figure 2 illustrates the architecture. Domain Logits f d models p(d = 1|x) before normalization. It projects from hidden state h to a scalar w.r.t. d: where W d ∈ R |h|×1 . Since p(d = 1|x) is a probability between 0 and 1, Section 3.1.2 will describe the training details of domain loss via the sigmoid function. Classification Logits f c models the probability posterior p(y|x) before normalization. It follows the conventional linear projection from hidden state h to the number of classes: At the end, we obtain the final logits f to represent p(y|d = 1, x) by putting f d and f c together following the dividend-divisor structure of Equation 1: where each element of f c is divided by the same scalar f d .

DRM Training
We propose two training loss functions to train a model with DRM. The first training loss aims to minimize a cross-entropy between the predicted intent class and ground truth IND class labels.
where p(f ) is the softmax of logits f : The second training loss aims to ensure that the domain component f d is close to 1 since all utterances in the training set are IND.
We first restrict f d between 0 and 1 by using sigmoid activation function. Then, this loss function encourages sigmoid(f d ) close to 1 for training on IND utterances. In order to avoid f d to be very large values and affect the training convergence, we further apply clamp function on f d before it feeds to Equation 4: Thus, we sum them up to optimize the model: Remarks: It is important to note that the design of L domain is to introduce extra regularization to mitigate the overconfidence in standard posterior probability p(f ). sigmoid(f d ) is not used to directly predict if an utterance is IND or OOD.

IND Intent Classification Method
Following Equation 1 and our DRM design, it is straightforward to use the confidence score of softmax(f ) to predict the IND intent class.

OOD Detection Methods
There are two types of strategies to utilize the outputs of a classifier to perform OOD detection. One is based on the confidence which is computed from logits, the other is based on the features. In the below, we describe how to compute different OOD scores with our DRM.

Confidence-based Methods
Recent works (Liang et al., 2017) has shown that the softmax outputs provide a good scoring for detecting OOD data. In our DRM model, we use the decomposed softmax outputs for the score. The logits f c w.r.t. the true posterior distribution in openworld can be combined with varied approaches: DRM Confidence Score: DRM ODIN Confidence Score: with large T = 1000 (Liang et al., 2017). DRM Entropy Confidence Score: The OOD utterances have low Conf DRM , ODIN DRM scores and high ENT DRM score.

Feature-based Method
While our DRM confidence already outperforms many existing methods (later shown in experiments), we further design the feature-based Mahalanobis distance score, inspired by the recent work  for detecting OOD images.
We first recap the approach in ) which consists of two parts: Mahalanobis distance calculation and input preprocessing. Mahalanobis distance score models the class conditional Gaussian distributions w.r.t. Gaussian discriminant analysis based on both low-and upper-level features of the deep classifier models. The score on layer is computed as follows: where f (x) represents the output features at the th -layer of neural networks; µ i and Σ are the class mean representation and the covariance matrix. Thus, the overall score is their summation: In addition, the input preprocessing adds a small controlled noise to the test samples to enhance the performance.
Although Mahalanobis distance score can be applied only to the last feature layer without input preprocessing S last M aha (x), the analysis ( Table 2 in ) shows that either input preprocessing or multi-layer scoring mechanism is required to achieve decent OOD detection performance. Unfortunately, neither of the above two mechanisms is applicable in the intent classifier for SLU. First, unlike image data, noise injection into discrete natural language utterances has been shown not to perform well. Second, in most cutting-edge intent classifier models, low-and upper-level network layers are quite different. The direct application of multilayer Mahalanobis distance leads to much worse OOD detection performance.
Since BERT-based models showed significant performance improvement for intent classification in SLU (Chen et al., 2019), we focus on designing the multi-layer Mahalanobis score for BERT-based classifier models. In existing BERT-based text classification models, such as BERT, RoBERTa, Distil-BERT, ALBERT, etc., there are different designs between the last transformer layer and the classification layer. Figure 3 shows our generic design of  Mahalanobis score computation (blue) for various BERT-based models.
Our design is based on our extensive experiments and understanding of the common insights in different BERT-based models. Specifically, we use the features from different layers between the last transformer layer and the classification layer. We empirically found that the nonlinear tanh layer plays an important role. Thus, to map the features of each transformer layer and last layer into the same semantic space, we pass the features of each layer through tanh function and sum them up to compute our Mahalanobis score: where f and f n are the features of each layer and last layer n in a BERT-based intent classifier model. We refer to our proposed approach as L-Mahalanobis.
4 Experimental Evaluation

Datasets
We evaluate our proposed approach on three benchmark SLU datasets and one in-house SLU dataset. Table 1 provides an overview of all datasets.
Among all these datasets, the recently released CLINC dataset serves as a benchmark for OOD detection in SLU. For the other three datasets, we treat them mutually OOD due to non-overlapping domains.
We crowdsourced the in-house Movie dataset containing common questions that users may ask regarding movies. This dataset mainly consists of queries a user may ask in the movie domain. The dataset consists of 38 different intents (e.g. rating information, genre information, award information, show trailer) and 20 slots or entities (e.g., director, award, release year). This dataset was collected using crowdsourcing as follows. At first, some example template queries were generated by linguistic experts for each intent, along with intent and slot descriptions. Next, a generation crowdsourcing job was launched where a crowd worker was assigned a random intent, a combination of entities, and few slots generally associated with the intent. To better understand the intent and slots, the worker was asked to review the intent and slot descriptions, and example template utterances. The first task of the worker was to provide 3 different queries corresponding to the given intent, which also contains the provided entities. The second task of the worker was to provide additional entities corresponding to the same slot type. A subsequent validation crowdsourcing job was launched where these crowdsourced queries were rated by validation workers in terms of their accuracy with the provided intent and entities. Each query was rated by 5 different validation workers, and the final validated dataset contains a subset of crowdsourced queries with high accuracy score and high interrater agreement.

Implementation and Training Details
We implemented our method using PyTorch on top of the Hugging Face transformer library (Wolf et al., 2019). We follow the hyperparameters in the original models. For the only hyperparameter δ, we experimented only on CLINC dataset from 2.2 to 4 with uniform interval 0.2 (we try 10 values of δ) based on sigmoid(2.2) ≈ 0.9 and sigmoid(4) ≈ 0.982. We used δ = 3 which gives the best performance in our experiment for all datasets. We train each model with 3 epochs using 4 NVIDIA Tesla V100 GPUs (16GB) for each training. We conducted experiments on two transformer-based models, BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019).
Remarks: All experiments only use IND data for both training and validation. We use the same hyperparameters in all datasets and validate the generalizability of our method.

IND Intent Classification Baselines
We consider the strongest baseline BERT-Linear (the last layer is linear) fine-tuned on the pre-trained BERT-based models (Chen et al., 2019).

OOD Detection Baselines
We consider the existing OOD detection methods: ConGAN (Ryu et al., 2018): a GAN-based model based on given sentence representations to generate OOD features with additional feature matching loss. OOD utterances are expected to have low discriminator confidence scores.
Autoencoder (AE) (Ryu et al., 2017): first uses an LSTM based classifier model to train sentence representations; then train an autoencoder on the above sentence embeddings. OOD utterances are expected to have high reconstruction error.
ODIN (Liang et al., 2017): we only use the temperature scaling on logits. OOD utterances are expected to have a low scaled confidence score.
Generalized-ODIN (G-ODIN) (Hsu et al., 2020): we fine-tune on pre-trained BERT models with replaced last layer and only use the decomposed confidence. We evaluate all three variations proposed in the paper h I , h E and h C and report the best one. OOD utterances are expected to have low scaled confidence score.
Mahalanobis : we only use the feature of BERT's last layer to compute Mahalanobis distance score. OOD utterances are expected to have a low scaled confidence score.
For ConGAN and AE, we evaluate the model in the original paper as well as customized BERTbased backbone models as strong baselines. Specifically, we customize En-ConGAN and En-AE as follows: En-ConGAN uses BERT sentence representation as input; En-AE applies a BERT classi-fier model to train the sentence representation and then use them to further train an autoencoder. Thus, En-ConGAN and En-AE are not existing baselines.
Note that ERAEPOG (Zheng et al., 2020) and O-Proto (Tan et al., 2019) are not comparable since they require additional unlabeled data and labels. We only put the ERAEPOG results on CLINC dataset (from the original paper) for reference.

IND Intent Classification Metrics
We evaluate IND performance using the classification accuracy metric as in literature (Liu and Lane, 2016;Wang et al., 2018;Chen et al., 2019).

OOD Detection Metrics
we follow the evaluation metrics in literature (Ryu et al., 2018) and (Liang et al., 2017;. Let TP, TN, FP, and FN denote true positive, true negative, false positive, and false negative. We use the following OOD evaluation metrics: EER (lower is better): (Equal Error Rate) measures the error rate when false positive rate (FPR) is equal to the false negative rate (FNR). Here, FPR=FP/(FP+TN) and FNR=FN/(TP+FN).
FPR95 (lower is better): (False Positive Rate (FPR) at 95% True Positive Rate (TPR)) can be interpreted as the probability that an OOD utterance is misclassified as IND when the true positive rate (TPR) is as high as 95%. Here, TPR=TP/(TP+FN).
Detection Error (lower is better): measures the misclassification probability when TPR is 95%. Detection error is defined as follows: where s is a confidence score. We follow the same assumption that both IND and OOD examples have an equal probability of appearing in the testing set.
AUROC (higher is better): (Area under the Receiver Operating Characteristic Curve) The ROC curve is a graph plotting TPR against the FPR=FP/(FP+TN) by varying a threshold.
AUPR (higher is better): (Area under the Precision-Recall Curve (AUPR)) The PR curve is a graph plotting the precision against recall by varying a threshold. Here, precision=TP/(TP+FP) and recall=TP/(TP+FN). AUPR-IN and AUPR-OUT is AUPR where IND and OOD distribution samples are specified as positive, respectively. Our best method (DRM+L-Mahalanobis) is significantly better than each baseline model (without *) with p-value < 0.01 (marked by §) and p-value < 0.05 (marked by †) using t-test. All methods with * are our proposed methods.
Note that EER, detection error, AUROC, and AUPR are threshold-independent metrics.

Statistical Significance
We also evaluate the statistical significance between all baselines and our best result (DRM + L-Mahalanobis) on all the above metrics. We train each model 10 times with different PyTorch random seeds. We report the average results and t-test statistical significance results. Table 3 reports the IND intent classification results on each dataset finetuned using BERT and RoBERTa pre-trained models. It is interesting to observe that all DRM combined models consistently achieve better classification accuracy with up to 0.8% improvement (reproduced "No joint" row in Table 3 in (Chen et al., 2019) on Snips dataset). This is because the domain loss forces sigmoid(f d ) close to 1 and therefore also slightly mitigates its impact to IND classification. Thus, the true posterior distribution of IND data is also modeled more precisely. For both BERT and RoBERTa Our DRM methods (marked by *) are significantly better than baseline model on all datasets with p-value < 0.05 (marked by †) using t-test.

IND Classification Results
backbones, DRM models are significantly better than conventional BERT-linear classification models with p-value < 0.05.

OOD Detection Results
Results on CLINC Dataset: Table 2 reports the OOD detection results on CLINC dataset. This result covers all existing work and our enhanced baselines. We focus on analyzing the contribution by each of our proposed techniques, DRM and L-Mahalanobis. The first three rows report the performance of existing approaches based on the original designs in their papers (ERAEPOG in grey uses additional unlabeled data). Unfortunately, we observe that their performance is even worse than the simple confidence-based approach via BERT finetuning baseline (row 5). Thus, we mainly focus on comparing our method with strong baselines with BERT and RoBERTa models.
For a given OOD detection method, we find that their combinations with DRM consistently perform better than those with standard models. The improvement is at least 1-2% for all metrics against our enhanced baselines. Among all OOD detection approaches, our proposed L-Mahalanobis OOD detection approach achieves the best performance for both linear and DRM combined BERT and RoBERTa models. It is not surprising to observe that our DRM method combined with a better pretrained RoBERTa model achieves larger OOD detection performance improvement. Note that our customized En-AE performs much better than most other methods since we incorporated the enhanced reconstruction capability with pre-trained BERT models. However, En-AE cannot utilize all BERT layers as our proposed L-Mahalanobis method, resulting in worse performance.
In addition, DRM+L-Mahalanobis models are significantly better than existing methods and enhanced baselines with p-value < 0.01 on most metrics for both BERT and RoBERTa backbones.
Ablation Study on CLINC Dataset: We analyze how our two novel components, DRM model and L-Mahalanobis, impact the performance.
The rows with "DRM" in "Last Layer" column of Table 2 show the performance of DRM model. As one can see, for all OOD methods, DRM consistently performs better than the conventional "Linear" last layer. Specifically, the DRM and Confidence combo also outperforms its closest baseline G-ODIN. This validates the effectiveness of our disentangled logits design in DRM based on the mathematical analysis of overconfidence. It also shows that our new domain loss can indeed enhance the model awareness that all training data is IND.
The rows with "L-Mahalanobis" in "OOD Method" column of Table 2   model. When taking each dataset as IND, we use the other two mutually exclusive datasets and CLINC OOD as OOD datasets for evaluating OOD detection performance. As one can see in Table 4, our method outperforms other approaches on both Snips and movie IND datasets. For ATIS IND dataset, En-AE for Snips OOD dataset achieves almost perfect performance. This is because ATIS and Snips are almost completely non-overlapping and ATIS is well designed with carefully selected varieties and entities in the airline travel domain. When taking Snip as IND and ATIS as OOD, it is interesting to see that our method achieves better performance than En-AE. This is because that Snips contains a large number of entities such that the reconstruction error will be lower and become less separable than that in ATIS OOD utterances. For both Snips and Movie IND datasets, DRM+L-Mahalanobis are significantly better than baseline methods with p-value < 0.01 in most cases for all OOD datasets. For ATIS IND dataset, DRM+L-Mahalanobis shows similar behavior except En-AE since it is easier to train an autoencoder model for ATIS IND dataset due to its carefully collected clean training utterances.

Qualitative Analysis
We provide a quantitative analysis by visualizing our two methods, DRM and L-Mahalanobis.  Figure 4(c). Moreover, the score distributions from left to right in Figure 4, imply that a larger entropy of all score reflects a better uncertainty modeling. Figure 5 visualizes the utterance representations learned with or without DRM. The red IND data are tightly clustered within classes (totally 150 CLINC IND classes), while the blue OOD data spread arbitrarily. As one can see, the blue dots in Figure 5(b) have less overlap with red dots, indicating the DRM helps to learn the utterance representation to better disentangle IND and OOD data.

Conclusion
This paper proposes using only IND utterances to conduct intent classification and OOD detection for SLU in an open-world setting. The proposed DRM has a structure of two branches to avoid overconfidence and achieves a better generalization. The evaluation shows that our method can achieve stateof-the-art performance on various SLU benchmark and in-house datasets for both IND intent classification and OOD detection. In addition, thanks to the generic of our DRM design and with the recent extensive use of BERT on different data modalities, our work can contribute to improving both in-domain classification robustness and outof-domain detection robustness for various classification models such as image classification, sound classification, vision-language classifications.

Impact Statement
Our proposed method in this paper has been deployed in the domain classification SLU model for Samsung Bixby voice assistant. In addition to SLU, our work could have a broader impact on other applications, which can be benefited from having a more robust classification system. For example, our method can help the robot to detect objects more accurately or stop safely by correctly identifying unknown objects, classify environmental sounds or detect anomaly sounds, and so on. Moreover, by better detecting the OOD samples that are different from the training data distribution, our method can facilitate to handle distributional shifts between training data and practical usage data.