Out-of-Scope Intent Detection with Self-Supervision and Discriminative Training

Out-of-scope intent detection is of practical importance in task-oriented dialogue systems. Since the distribution of outlier utterances is arbitrary and unknown in the training stage, existing methods commonly rely on strong assumptions on data distribution such as mixture of Gaussians to make inference, resulting in either complex multi-step training procedures or hand-crafted rules such as confidence threshold selection for outlier detection. In this paper, we propose a simple yet effective method to train an out-of-scope intent classifier in a fully end-to-end manner by simulating the test scenario in training, which requires no assumption on data distribution and no additional post-processing or threshold setting. Specifically, we construct a set of pseudo outliers in the training stage, by generating synthetic outliers using inliner features via self-supervision and sampling out-of-scope sentences from easily available open-domain datasets. The pseudo outliers are used to train a discriminative classifier that can be directly applied to and generalize well on the test task. We evaluate our method extensively on four benchmark dialogue datasets and observe significant improvements over state-of-the-art approaches. Our code has been released at https://github.com/liam0949/DCLOOS.


Introduction
Conversational system is becoming an indispensable component in a variety of AI applications and acts as an interactive interface provided to users to improve user experience. Language understanding is essential for conversational systems to provide appropriate responses to users, and intent detection is usually the first step of language understanding. The primary goal is to identify diverse intentions behind user utterances, which is often formalized as a classification task. However, intent classes defined during training are inevitably inadequate to cover all possible user intents at the test stage due to the diversity and randomness of user utterances. Hence, out-of-scope (or unknown) intent detection is essential, which aims to develop a model that can accurately identify known (seen in training) intent classes while detecting the out-of-scope classes that are not encountered during training.
Due to the practical importance of out-of-scope intent detection, recent efforts have attempted to solve this problem by developing effective intent classification models. In general, previous works approach this problem by learning decision boundaries for known intents and then using some confidence measure to distinguish known and unknown intents. For examples, LMCL (Lin and Xu, 2019) learns the decision boundaries with a margin-based optimization objective, and SEG (Yan et al., 2020b) assumes the known intent classes follow the distribution of mixture of Gaussians. After learning the decision boundaries, an off-the-shell outlier detection algorithm such as LOF (Breunig et al., 2000) is commonly employed to derive confidence scores (Yan et al., 2020b;Shu et al., 2017;Lin and Xu, 2019;Hendrycks and Gimpel, 2017). If the confidence score of a test sample is lower than a predefined threshold, it is identified as an outlier.
However, it may be problematic to learn decision boundaries solely based on the training examples of known intent classes. First, if there are sufficient training examples, the learned decision boundaries can be expected to generalize well on known intent classes, but not on the unknown. Therefore, extra steps are required in previous methods, such as using an additional outlier detection algorithm at the test stage or adjusting the confidence threshold by cross-validation. On the other hand, if there are not sufficient training examples, the learned boundaries may not generalize well on both known and unknown intents. As a result, these methods often underperform when not enough training data is given. Hence, it is important to provide learning signals of unknown intents at the training stage to overcome these limitations.
In contrast to previous works, we adopt a different approach by explicitly modeling the distribution of unknown intents. Particularly, we construct a set of pseudo out-of-scope examples to aid the training process. We hypothesize that in the semantic feature space, real-world outliers can be well represented in two types: "hard" outliers that are geometrically close to the inliers and "easy" outliers that are distant from the inliners. For the "hard" ones, we construct them in a self-supervised manner by forming convex combination of the features of inliers from different classes. For the "easy" ones, the assumption is that they are very unrelated to the known intent classes, so they can be used to simulate the randomness and diversity of user utterances. They can be easily constructed using public datasets. For example, in our experiments, we randomly collect sentences from datasets of other NLP tasks such as question answering and sentiment analysis as open-domain outliers.
In effect, by constructing pseudo outliers for the unknown class during training, we form a consistent (K + 1) classification task (K known classes + 1 unknown class) for both training and test. Our model can be trained with a cross-entropy loss and directly applied to test data for intent classification and outlier detection without requiring any further steps. As shown in Figure 1 (better view in color and enlarged), our method can learn better utterance representations, which make each known intent class more compact and push the outliers away from the inliers. Our main contributions are summarized as follows.
• We propose a novel out-of-scope intent detection approach by matching training and test tasks to bridge the gap between fitting to training data and generalizing to test data.
• We propose to efficiently construct two types of pseudo outliers by using a simple selfsupervised method and leveraging publicly available auxiliary datasets.
• We conduct extensive experiments on four real-world dialogue datasets to demonstrate the effectiveness of our method and perform a detailed ablation study.
2 Related Work

Out-of-Distribution Detection
Early studies on outlier detection often adopt unsupervised clustering methods to detect malformed data (Hodge and Austin, 2004;Chandola et al., 2009;Zimek et al., 2012). In recent years, a substantial body of work has been directed towards improving the generalization capacity of machine learning models on out-of-distribution (OOD) data (Ruff et al., 2021;Hendrycks et al., 2020a). Hendrycks and Gimpel (2017) find that simple statistics derived from the outputting softmax probabilities of deep neural networks can be helpful for detecting OOD samples. Following this work,  propose to use temperature scaling and add small perturbation to input images to enlarge the gap between in-scope and OOD samples.  propose to add a Kullback-Leibler divergence term in the loss function to encourage assigning lower maximum scores to OOD data.
Recently, there is a line of work that employs synthetic or real-world auxiliary datasets to provide learning signals for improving model robustness under various forms of distribution shift (Goodfellow et al., 2015;Orhan, 2019;Hendrycks et al., 2019;. Particularly, Hendrycks et al. (2018) propose to leverage large-scale public datasets to represent outliers during training time and form a regularization term based on that. This idea is similar to our proposal of constructing opendomain outliers, but we use a simpler, end-to-end, (K+1)-way discriminative training procedure without any regularization term or threshold parameter.

Out-of-Scope Intent Detection
While Hendrycks et al. (2020b) find that pretrained transformer-based models like BERT are intrinsically more robust to OOD data, they suggest that there are still margins for improvement. Therefore, we build our model on top of BERT to improve intent detection under significant distribution shift. Previous methods for out-of-scope (or out-of-distribution) intent detection are commonly threshold-based, where models output a decision score and then compare it with a threshold that is predefined or selected by cross-validation.
There are mainly three branches of related work. The first group uses a confidence score which determines the likelihood of an utterance being outof-scope. For example, Shu et al. (2017) build m binary Sigmoid classifiers for m known classes respectively and select a threshold to reject OOD inputs that may have lower probabilities than the threshold across all m classifiers. Similar to the OOD data generation method used in , Ryu et al. (2018) employ GAN (Goodfellow et al., 2014) to generate simulated OOD examples with the generator and learn to reject simulated OOD examples with the discriminator.
The second group identifies out-of-scope sentences through reconstruction loss. For example, Ryu et al. (2017) build an autoencoder to encode and decode in-scope utterances and obtain reconstruction loss by comparing input embeddings with decoded ones. Out-of-scope utterances result in higher reconstruction loss.
The third group leverages off-the-shell outlier detection algorithms such as local outlier factor (LOF) (Breunig et al., 2000), one-class SVM (Schölkopf et al., 2001), robust covariance estimators (Rousseeuw and Driessen, 1999), and isolation forest (Liu et al., 2008) to detect out-ofscope examples. Utterance embeddings belonging to a specific class will be mapped to the corresponding cluster (usually modeled by a Gaussian distribution) while out-of-scope samples will be pushed away from all in-scope clusters. Examples of this kind include SEG (Yan et al., 2020a) and LMCL (Lin and Xu, 2019). Very recently, Zhang et al. (2021) propose to learn adaptive decision boundaries after pre-training instead of using offthe-shell outlier detection algorithms.
In addition, some other work focuses on outof-scope detection in few-shot scenarios. Tan et al. (2019) leverage independent source datasets as simulated OOD examples to form a hinge loss term.  propose to pretrain BERT by a natual language understanding task with largescale training data to transfer useful information for few-shot intent detection.
Finally, for our proposal of constructing synthetic outliers, the most similar method is Mixup proposed by . However, their method is designed for data augmentation to enhance in-distribution performance and requires corresponding combinations in the label space (Thulasidasan et al., 2019).

Methodology
Problem Statement In a dialogue system, given , an unknown intent detection model aims at predicting the category of an utterance u, which may be one of the known intents or an out-of-scope intent C oos . Essentially, it is a K + 1 classification problem at the test stage. At the training stage, a set of N labeled utterances is provided for training. Previous methods typically train a K-way classifier for the known intents.
Overview of Our Approach The mismatch between the training and test tasks, i.e., K-way classification vs. (K + 1)-way classification, leads to the use of strong assumptions and additional complexity in previous methods. Inspired by recent practice in meta learning to simulate test conditions in training (Vinyals et al., 2016), we propose to match the training and test settings. In essence, as shown in Figure 2, we formalize a (K + 1)-way classification task in the training stage by constructing out-of-scope samples via self-supervision and from open-domain data. Our method simply trains a (K + 1)-way classifier without making any assumption on the data distribution. After training, the classifier can be readily applied to the test task without any adaptation or post-processing. In the following, we elaborate on the details of our proposed method, including representation learning, construction of pseudo outliers, and discriminative training.

Representation Learning
We employ BERT (Devlin et al., 2019) -a deep Transformer network as text encoder. Specifically, we take the d-dimensional output vector of the special classification token [CLS] as the representation of an utterance u, i.e.,

Construction of Outliers
We construct two different types of pseudo outliers to be used in the training stage: synthetic outliers that are generated by self-supervision, and opendomain outliers that can be easily acquired.
Synthetic Outliers by Self-Supervision To improve the generalization ability of the unknown intent detection model, we propose to generate "hard" outliers in the feature space, which may have similar representations to the inliers of known intent classes. We hypothesize that those outliers may be geometrically close to the inliers in the feature space. Based on this assumption, we propose a selfsupervised method to generate the "hard" outliers using the training set D tr l .
Specifically, in the feature space, we generate synthetic outliers by using convex combinations of the features of inliers from different intent classes: where h β and h α are the representations of two utterances which are randomly sampled from different intent classes in D tr l , i.e., c β = c α , and h oos is the synthetic outlier. For example, θ can be sampled from a uniform distribution U (0, 1). In this case, when θ is close to 0 or 1, it will generate "harder" outliers that only contain a small proportion of mix-up from different classes. In essence, "hard" outliers act like support vectors in SVM (Cortes and Vapnik, 1995), and "harder" outliers could help to train a more discriminative classifier.
The generated outliers h oos are assigned to the class of C oos , the (K + 1)-th class in the feature space, forming a training set Notice that since the outliers are generated in the feature space, it is very efficient to construct a large outlier set D tr co . Open-Domain Outliers In practical dialogue systems, user input can be arbitrary free-form sentences. To simulate real-world outliers and provide learning signals representing them in training, we propose to construct a set of open-domain outliers, which can be easily obtained. Specifically, the set of free-form outliers D f o can be constructed by collecting sentences from various public datasets that are disjoint from the training and test tasks. There are many datasets available, including the question answering dataset SQuaD 2.0 (Rajpurkar et al., 2018), the sentiment analysis datasets Yelp (Meng et al., 2018) and IMDB (Maas et al., 2011), and dialogue datasets from different domains.
In the feature space, Both synthetic outliers and open-domain outliers are easy to construct. As will be demonstrated in Section 4, both of them are useful, but synthetic outliers are much more effective than open-domain outliers in improving the generalization ability of the trained (K + 1)-way intent classifier.

Discriminative Training
After constructing the pseudo outliers, in the feature space, our training set D tr now consists of a set of inliers D tr l and two sets of outliers D tr co and D tr f o , i.e., D tr = D tr l ∪ D tr co ∪ D tr f o and |D tr | = N + M + H. Therefore, in the training stage, we can train a (K + 1)-way classifier with the intent label set S = S known ∪ {C oos }, which can be directly applied in the test stage to identify unknown intent and classify known ones. In particular, we use a multilayer perceptron network, Φ(·), as the classifier in the feature space. The selection of the classifier is flexible, and the only requirement is that it is differentiable. Then, we train our model using a cross-entropy loss: for the ground-truth class c i , and τ ∈ R + is an adjustable scalar temperature parameter.

Experiments
In this section, we present the experimental results of our proposed method on the targeted task of unknown intent detection. Given a test set comprised of known and unknown intent classes, the primary goal of an unknown intent detection model is to assign correct intent labels to utterances in the test set. Notice that the unknown intent label C oos is also included as a special class for prediction.

Datasets and Baselines
We evaluate our proposed method on four benchmark datasets as follows, three of which are newly released dialogue datasets designed for intent detection. The statistics of the datasets are summarized in Table 2. CLINC150 (Larson et al., 2019) is a dataset specially designed for out-of-scope intent detection, which consists of 150 known intent classes from 10 domains. The dataset includes 22, 500 in-scope queries and 1, 200 out-of-scope queries. For the in-scope ones, we follow the original splitting, i.e., 15, 000, 3, 000 and 4, 500 for training, validation, and testing respectively. For the out-of-scope ones, we group all of the 1, 200 queries into the test set.
Banking (Casanueva et al., 2020) is a finegrained intent detection dataset in the banking domain. It consists of 9, 003, 1, 000, and 3, 080 user queries in the training, validation, and test sets respectively.
M-CID (Arora et al., 2020) is a recently released dataset related to Covid-19. We use the English subset of this dataset referred to as M-CID-EN in our experiments, which covers 16 intent classes. The splitting of M-CID-EN is 1, 258 for training, 148 for validation, and 339 for test.
We extensively compare our method with the following unknown intent detection methods.
• Maximum Softmax Probability (MSP) (Hendrycks and Gimpel, 2017) employs the confidence score derived from the maximum softmax probability to predict the class of a sample. The idea under the hood is that the lower the confidence score is, the more likely the sample is of an unknown intent class.
• DOC (Shu et al., 2017) considers to construct m 1-vs-rest sigmoid classifiers for m seen classes respectively. It uses the maximum probability from these classifiers as the confidence score to conduct classification.
• SEG (Yan et al., 2020a) models the intent distribution as a margin-constrained Gaussian mixture distribution and uses an additional outlier detector -local outlier factor  (LOF) (Breunig et al., 2000) to achieve unknown intent detection.
• LMCL (Lin and Xu, 2019) considers to learn discriminative embeddings with a large margin cosine loss. It also uses LOF as the outlier detection algorithm.
• Softmax (Yan et al., 2020a) uses a softmax loss to learn discriminative features based on the training dataset, which also requires an additional outlier detector such as LOF for detecting the unknown intents.

Experimental Setup and Evaluation Metrics
To compare with existing methods, we follow the setting in LMCL (Lin and Xu, 2019). Specifically, for each dataset, we randomly sample 75%, 50%, and 25% of the intent classes from the training set as the known classes to conduct training, and we set aside the rest as the unknown classes for test.
Notice that for training and validation, we only use data within the chosen known classes and do not expose our model to any of test-time outliers. Unless otherwise specified, in each training batch, we keep the ratio of inliers, open-domain outliers and self-supervised outliers roughly as 1 : 1 : 4. This setting is empirically chosen and affected by the memory limit of NVIDIA 2080TI GPU, which we use for conducting the experiments. The number of pseudo outliers can be adjusted according to different environments, and a larger number of self-supervised outliers typically takes more time to converge.
We use Pytorch (Paszke et al., 2019) as the backend to conduct the experiments. We use the pretrained BERT mdoel (bert-base-uncased) provided by Wolf et al. (2019) as the encoder for utterances. We use the output vector of the special classification token [CLS] as the utterance embedding and fix its dimension as 768 by default throughout all of our experiments. To ensure a fair comparison, all baselines and our model use the same encoder.
For model optimization, we use AdamW provided by Wolf et al. (2019) to fine-tune BERT and Adam proposed by Kingma and Ba (2015) to train the MLP clasisfier Φ(·). We set the learning rate for BERT as 1e −5 as suggested by Devlin et al. (2019). For the MLP clasisfier, the learning rate is fixed as 1e −4 . Notice that the fine-tuning of BERT  Table 3: Macro f1-score of the known classes and f1-score of the unknown class with different proportion of seen classes. For each setting, the best result is marked in bold.
is conducted simultaneously with the training of the classifier Φ(·) with the same cross-entropy loss. The MLP classifier Φ(·) has a two-layer architecture with [1024, 1024] as hidden units. The temperature parameter τ is selected by cross-validation and set as 0.1 in all experiments. Following LMCL (Lin and Xu, 2019), we use overall accuracy and macro f1-score as evaluation metrics. All results reported in this section are the average of 10 runs with different random seeds, and each run is stopped until reaching a plateau on the validation set. For baselines, we follow their original training settings except using the aforementioned BERT as text encoder.

Result Analysis
We present our main results in Table 1 and Table 3. Specifically, Table 1 gives results in overall accuracy and macro f1-score for all classes including the outlier class, while Table 3 shows results in macro f1-score for the known classes and f1-score for the outlier class respectively. It can be seen that, on all benchmarks and in almost every setting, our model significantly outperforms the baselines. As shown in Table 3, our method achieves favorable performance on both unknown and known intent classes simultaneously.
It is worth mentioning that the large improve-ments of our method in scenarios with small labeled training sets (25% and 50% settings) indicate its great potential in real-life applications, since a practical dialogue system often needs to deal with a larger proportion of outliers than inliers due to different user demographic, ignorance/unfamiliarity of/with the platform, and limited intent classes recognized by the system (especially at the early development stage). More importantly, referring to Table 3, as the proportion of known intents increases, it can be seen that the performance gains of the baselines mainly lie in the known classes. In contrast, our method can strike a better balance between the known and unknown classes without relying on additional outlier detector, margin tuning, and threshold selection, demonstrating its high effectiveness and generality. Take the Softmax baseline for example, in the 75% case of CLINC150, it achieves a slightly higher result than our model on the known classes but a substantially lower result on the unknown ones.

Effect of Pseudo Outliers
We conduct an ablation study on the effectiveness of the two kinds of pseudo outliers and summarize the results in Table 4. The first row of the three settings (25%, 50%, and 75%) stands for training solely with the labeled examples of CLINC150  without using any pseudo outliers. In general, selfsupervised synthetic outliers and open-domain outliers both lead to positive effects on classification performance. For each setting, comparing the second row with the third, we can observe that the synthetic outliers produced by convex combinations lead to a much larger performance gain than that of pre-collected open-domain outliers. Finally, combining them for training leads to the best results, as shown in the fourth row of each setting. Next, we conduct experiments to study the impact of varying the number of the two kinds of pseudo outliers separately, as shown in Figure 3. We first fix the number of open-domain outliers as zero and then increase the number of selfsupervised outliers. The results are displayed in Figure 3 (a), (b) and (c). In particular, as the number of self-supervised outliers grows, the performance first increases quickly and then grows slowly. On the other hand, we fix the number of self-supervised outliers as zero and then increases the number of open-domain outliers. The results are shown in Figure 3 (d), (e) and (f), where it can be seen that dozens of open-domain outliers already can bring significant improvements, though the gain is much smaller compared to that of the self-supervised outliers.
Finally, we investigate the impact of the number of self-supervised outliers on overall intent detection accuracy with both the number of inliers and the number of open-domain outliers fixed as 100 per training batch. As shown in Figure 4, we increase the number of self-supervised outliers from 0 to 5000. Note that 400 is the default setting used in Table 1 and Table 3. We can see that comparable results can be obtained for a wide range of numbers. However, when the number grows to 5000, the performance exhibits a significant drop. We hypothesize that as the number increases, the   generated synthetic outliers may be less accurate, because some convex combinations may fall within the scope of known classes. To summarize, self-supervised outliers play a much more important role than open-domain outliers for unknown intent classification. Selfsupervised outliers not only provide better learning signals for the unknown intents, but also impose an important positive effect on the known ones. For the open-domain outliers, if used alone, they can only provide limited benefit. But in combination with the self-supervised ones, they can further enhance the performance.

Selection of Open-Domain Outliers
To demonstrate the flexibility of our method in selecting open-domain outliers as described in Section 3.2, we train our model on CLINC150 using open-domain outliers from different sources. The results are summarized in Table 5. Specifically, Open-bank and Open-stack stand for using  (Rajpurkar et al., 2018), Yelp (Meng et al., 2018), and IMDB (Maas et al., 2011). It can be seen that the performance of our model is insensitive to the selection of open-domain outliers.

Efficiency
We provide a quantitative comparison on the training and test efficiency for our method and the baselines, by calculating the average time (in seconds) for training per epoch and the total time for testing under the 75% setting. Here, we only compare with the strongest baselines. As shown in Figure 5, even with the pseudo outliers, the training time of our method is comparable to that of the baselines. Importantly, in the test stage, our method demonstrates significant advantages in efficiency, which needs much less time to predict intent classes for all samples in the test set.

Conclusion
We have proposed a simple, effective, and efficient approach for out-of-scope intent detection by overcoming the limitation of previous methods via matching train-test conditions. Particularly, at the training stage, we construct self-supervised and open-domain outliers to improve model generalization and simulate real outliers in the test stage. Extensive experiments on four dialogue datasets show that our approach significantly outperforms state-of-the-art methods. In the future, we plan to investigate the theoretical underpinnings of our approach and apply it to more applications.