Modeling Discriminative Representations for Out-of-Domain Detection with Supervised Contrastive Learning

Detecting Out-of-Domain (OOD) or unknown intents from user queries is essential in a task-oriented dialog system. A key challenge of OOD detection is to learn discriminative semantic features. Traditional cross-entropy loss only focuses on whether a sample is correctly classified, and does not explicitly distinguish the margins between categories. In this paper, we propose a supervised contrastive learning objective to minimize intra-class variance by pulling together in-domain intents belonging to the same class and maximize inter-class variance by pushing apart samples from different classes. Besides, we employ an adversarial augmentation mechanism to obtain pseudo diverse views of a sample in the latent space. Experiments on two public datasets prove the effectiveness of our method capturing discriminative representations for OOD detection.


Introduction
Detecting Out-of-Domain (OOD) or unknown intents from user queries is an essential component in a task-oriented dialog system (Gnewuch et al., 2017;Akasaki and Kaji, 2017;Tulshan and Dhage, 2018;Shum et al., 2018). It aims to know when a user query falls outside their range of predefined supported intents to avoid performing wrong operations. Different from normal intent detection tasks, we do not know the exact number of unknown intents in practical scenarios and can barely annotate extensive OOD samples. Lack of real OOD examples leads to poor prior knowledge about these unknown intents, making it challenging to identify OOD samples in the task-oriented dialog system.
Previous methods of OOD detection can be generally classified into two types: supervised and unsupervised OOD detection. Supervised OOD detection (Scheirer et al., 2013;Fei and Liu, 2016;Kim and Kim, 2018;Larson et al., 2019;Zheng et al., 2020;Zeng et al., 2021b) represents that there are extensive labeled OOD samples in the training data. In contrast, unsupervised OOD detection (Bendale and Boult, 2016;Hendrycks and Gimpel, 2017;Shu et al., 2017;Lee et al., 2018;Ren et al., 2019;Lin and Xu, 2019;Zeng et al., 2021a) means no labeled OOD samples except for labeled in-domain data. Specifically, for supervised OOD detection, Fei and Liu (2016); Larson et al. (2019), form a (N +1)-class classification problem where the (N + 1)-th class represents the unseen intents. Further, Zheng et al. (2020) uses labeled OOD data to generate an entropy regularization term to enforce the predicted distribution of OOD inputs closer to the uniform distribution. However, these methods heavily rely on large-scale time-consuming labeled OOD data. Compared to these supervised methods, unsupervised OOD detection first learns discriminative intent representations via in-domain (IND) data, then employs detecting algorithms, such as Maximum Softmax Probability (MSP) (Hendrycks and Gimpel, 2017), Local Outlier Factor (LOF) (Lin and Xu, 2019), Gaussian Discriminant Analysis (GDA)  to compute the similarity of features between OOD samples and IND samples. In this paper, we focus on the unsupervised OOD detection.
A key challenge of unsupervised OOD detection is to learn discriminative semantic features via IND data. We hope to cluster the same type of IND intents more tightly and separate different types of IND intents further. Traditional softmax loss (Hendrycks and Gimpel, 2017)  categories. Further, Lin and Xu (2019) proposes a large margin cosine loss (LMCL) (Wang et al., 2018) which maximizes the decision margin in the latent space. LMCL forces the model to not only classify correctly but also maximize inter-class variance and minimize intra-class variance. Following the similar motivation, we aim to pull intents belonging to the same class together while simultaneously pushing apart samples from different classes to further model discriminative semantic features.
In this paper, we propose a supervised contrastive learning (SCL) model to learn discriminative semantic intent representation for OOD detection. SCL aims to minimize intra-class variance by pulling together IND intents belonging to the same class and maximize inter-class variance by pushing apart samples from different classes. Empirical results demonstrate the effectiveness of discriminative representation for OOD detection. Besides, to enhance the diversity of data augmentation in SCL, we employ an adversarial attack mechanism to obtain pseudo hard positive samples in the latent space by computing model-agnostic adversarial worst-case perturbations to the inputs. Our contributions are three-fold: (1) To the best of our knowledge, we are the first to apply supervised contrastive learning to OOD detection. (2) Compared to cross-entropy (CE) loss, SCL+CE can maximize inter-class variance and minimize intra-class variance to learn discriminative semantic representation. (3) Extensive experiments and analysis on two public datasets demonstrate the effectiveness of our method.

Methodology
Overall Architecture Fig 1 shows the overall architecture of our proposed method. As Fig 1(a) displays, we first train an IND intent classifier us-ing CE or SCL+CE objectives in the training stage. Then in the test stage, we extract the intent feature of a test sample and employ the detection algorithms MSP (Hendrycks and Gimpel, 2017), LOF (Lin and Xu, 2019) or GDA  to detect OOD. 2 Fig 1(b) demonstrates the effectiveness of our method capturing discriminative intent representations, where SCL+CE can maximize interclass variance and minimize intra-class variance.
Supervised Contrastive Learning We first review the classic cross-entropy (CE) loss and its improved version, large margin cosine loss (LMCL). Then we explain our supervised contrastive loss (SCL) in detail. Given an IND sample x i and its intent label y i , we adopt a BiLSTM (Hochreiter and Schmidhuber, 1997) or BERT (Devlin et al., 2019) encoder to get the intent representation s i . The CE loss and LMCL are defined as follows 3 : (1) (2) where N denotes the number of training samples, y i is the ground-truth class of the i-th sample, τ is the temperature factor, W j is the weight vector of the j-th class, and m is the cosine margin. Compared to CE, LMCL adds a normalized decision margin on the negative classes and forces the model to explicitly distinguish positive class and negative classes. Our experiment 3.2 shows LMCL can slightly improve the performance of OOD detec- tion. To further model discriminative intent representations, motivated by recent contrastive learning work Khosla et al., 2020;Gunel et al., 2020), we propose a supervised contrastive learning objective to minimize intra-class variance and maximize inter-class variance: where N y i is the total number of examples in the batch that have the same label as y i and 1 is an indicator function. Note that we only perform SCL on the IND data since we focus on the unsupervised OOD detection where no labeled OOD data exists. As Fig 1(b) shows, SCL aims to pull together IND intents belonging to the same class and pushing apart samples from different classes, which helps recognize OOD intents near the decision boundary.
In the implementation, we first pre-train the intent classifier using SCL, then finetune the model using CE or LMCL, both on the IND data. We compare iterative training and joint training in the appendix.
Adversarial Augmentation Chen et al. (2020) has proved the necessity of data augmentation for contrastive learning. However, there is no simple and effective augmentation strategy in the NLP area, which requires much handcrafted engineering. Thus, we apply adversarial attack (Goodfellow et al., 2015;Kurakin et al., 2016;Jia and Liang, 2017;Zhang et al., 2019; to generate pseudo positive samples to increase the diversity of views for contrastive learning. Specifically, we need to compute the worst-case perturbation δ that maximizes the original cross-entropy loss represents the parameters of the intent classifier and x denotes a given sample. is the norm bound of the perturbation δ. We apply Fast Gradient Value (FGV) (Rozsa et al., 2016) to approximate the perturbation δ: We perform normalization to g and then use a small to ensure the approximate is reasonable. Finally, we can obtain the pseudo augmented sample x adv = x + δ in the latent space. The pseudo samples are applied to augment positive views per anchor in SCL. Ablation study 3.3 shows adversarial augmentation significantly improves the performance of SCL for OOD detection.

Setup
Datasets We use two benchmark OOD datasets, CLINC-Full and CLINC-Small (Larson et al., 2019). We report IND metrics: Accuracy(Acc) and F1, and OOD metrics: Recall and F1. OOD Recall and F1 are the main evaluation metrics in this paper. Baselines We adopt LSTM and BERT as our intent classifier and compare SCL with CE and LMCL. Since only using SCL can't classify in-domain intents directly, we first pre-train the classifier using SCL, then finetune the model using CE or LMCL, both on the IND data. We use three OOD detection algorithms MSP, LOF and GDA to verify the generalization capability of SCL. We present dataset statistics, implementation details, and results on MSP and LOF in the appendix.

Main Results
Tab 1    The results confirm SCL also helps IND intent detection. The difference between OOD and IND improvements is probably attributed to metric scale and data imbalance in the original test set. Besides, SCL gains higher improvements on CLINC-Small than CLINC-Full, which displays the advantage of our approach in the few-shot scenario (see details in Section 3.3). SCL also gets consistent improvements on BERT by 2.60%(Recall) and 1.86%(F1) on CLINC-Full OOD metrics, 0.84%(ACC) and 0.59%(F1) on CLINC-Full IND metrics, substantiating our method is model-agnostic for different OOD detection architectures.

Analysis
Analysis of IND feature distribution. We analyze the representation distribution of IND data on CLINC-Full dataset from two perspectives, intraclass and inter-class. We choose SCL+CE based on GDA to perform analysis. Tab 2 shows the statistics of intra-class variance, which can indicate the degree of clustering of intra-class data representations. Specifically, we average the variances of each sample normalized representation with the same intent label to its cluster center in the test set as cluster intra-class variance, then report min/max/mean/median values on all cluster intra-class variances. Results show SCL effectively decreases intra-class variances, especially in terms of max and mean values, which confirms SCL can   converge intra-class intent representations. Fig 2 shows the inter-class distances. We average dot product distances between each class center to its k nearest class centers, then average results of all classes as inter-class distance. The X-axis denotes the number of k. We observe a significant increase in SCL+CE compared to CE. When k is smaller, the increase is more obvious. It verifies SCL can maximize inter-class variance and distinguish intent classes. We also provide visualization analysis in the appendix. In summary, SCL can pull together IND intents belonging to the same class and push apart samples from different classes, which makes representations more discriminative. Effect of IND Training Data Size. Tab 3 shows the effect of IND training data size. We randomly choose training data with a certain proportion from CLINC-Full IND data and use the original test set for evaluation. We use the LSTM+GDA setting. Results show SCL+CE consistently outperforms CE. Besides, with the decrease of training data size, the relative improvements gradually increase. It proves SCL has strong robustness for improving OOD detection, especially in the few-shot scenario. Analysis of Adversarial Perturbation Norm. Fig 3 shows the effect of adversarial perturbation norm on OOD detection performance. We conduct the experiments on CLINC-Full dataset, using LSTM and GDA. The X-axis denotes the value of . The CE+GDA dashed line means no SCL pre-training and = 0.0 in the SCL+CE+GDA solid line means no adversarial augmentation. In general, both SCL and adversarial augmentation contribute to the improvements and ∈ (1.0, 2.0) achieves better performances. Compared with the baseline without SCL, the SCL+CE method with a smaller adversarial perturbation can still obtain   better results but lower than the results with an optimal range of perturbation, while large norms tend to damage the effect of SCL. Our method still performs well with a broad range of adversarial perturbation and is insensitive to hyperparameters. Parameter Analysis. As our proposed SCL is a method involving contrastive learning, we analyze batch sizes and representation dimensions to further verify the effectiveness, whose results are presented in Table 4. We conduct experiments in CLINC-Full dataset, using LSTM and SCL+CE objective for training and GDA for detection. With the increase of batch size and representation dimension, both in-domain and OOD metrics are slightly improved. However, compared with the method proposed in this paper, the improvement is relatively limited. In general, our proposed method is not sensitive to hyperparameters and can show the expected effect under a wide range of reasonable settings. Feature Visualization. As shown in Fig 4, we extract several groups of similar classes for PCA visualization analysis. The three pictures in the upper part represent training using only CE, while the three pictures in the lower part use SCL+CE for training. In the same column, we sample the same classes for observation. It is worth noting that the scale of the image has been adjusted adaptively in order to display all the data. The actual distance can be sensed by observing the marking of the coordinate axis. After SCL is added, the distance between similar classes is significantly expanded, and the data in the same classes are more closely clustered.

Conclusion
In this paper, we focus on the unsupervised OOD detection where no labeled OOD data exist. To learn discriminative semantic intent representations via in-domain data, we propose a novel supervised contrastive learning loss (SCL). SCL aims to minimize intra-class variance by pulling together indomain intents belonging to the same class and maximize inter-class variance by pushing apart samples from different classes. Experiments and analysis confirm the effectiveness of SCL for OOD detection. We hope to provide new guidance for future OOD detection work.

Broader Impact
Task-oriented dialog systems have demonstrated remarkable performance across a wide range of applications, with the promise of a significant positive impact on human production mode and lifeway. However, in scenarios where information is complex and rapidly changing, models usually face input that is meaningfully different from typical examples encountered during training. Current models are prone to make unfounded predictions on these inputs, which may affect human judgment and thus impair the safety of models in practical applications. In domains with the greatest potential for societal impacts, such as navigation or medical diagnosis, models should be able to detect potentially agnostic OOD and be robust to high-entropy inputs to avoid catastrophic errors. This work proposes a novel unsupervised OOD detection method that using supervised contrastive learning to learn discriminative semantic intent representations. The effectiveness and robustness of the model are significantly improved by adding a supervised contrastive learning pre-training stage, which takes a step towards the ultimate goal of enabling the safe real-world deployment of task-oriented dialog systems in safety-critical domains. The experimental results have been reported on standard benchmark datasets for considerations of reproducible research.

B Baseline Details
We compare many types of unsupervised OOD detection models. Therefore, the model proposed in this paper can be divided into the training stage and detection stage. For each model LSTM or BERT, we use different detection methods to verify its performance. The innovation of this paper focuses mainly on the training stage. Due to the limitation of space, we do not detailed introduce the detection methods in the main body. We will supplement the relevant contents as follows: MSP (Maximum Softmax Probability) (Hendrycks and Gimpel, 2017) applies a threshold on the maximum softmax probability where the threshold is set to 0.5 according to the dev set. LOF (Local Outlier Factor)(Lin and Xu, 2019) uses the local outlier factor to detect unknown intents. The motivation is that if an example's local density is significantly lower than its k-nearest neighbor's, it is more likely to be considered as the unknown intents. GDA (Gaussian Discriminant Analysis) ) is a generative distance-based classifier for out-of-domain detection with Euclidean space. They estimate the class-conditional distribution on feature spaces of DNNs via Gaussian discriminant analysis (GDA) to avoid over-confidence problems 4 https://github.com/clinc/oos-eval and use Mahalanobis distance to measure the confidence score of whether a test sample belongs to OOD. GDA is the state-of-the-art detection methods till now, so we adopt GDA as our main detection algorithm. We also report MSP and LOF results in Section D.

C Implementation Details
We use the public pre-trained 300 dimensions GloVe embeddings (Pennington et al., 2014)    ways of two training stages on CLINC-Full dataset using LSTM and GDA detection method in Table 7. CE is the baseline that only uses the crossentropy loss function to train the feature extractor. SCL+CE follows the paradigm of pre-training first and then finetuning, which achieves the best performance. Besides, we try two different combinations to explore the relationship between the two training stages. CE+SCL means that we first conduct training to minimize cross-entropy loss, and then conduct supervised contrastive learning. The results show that the subsequent SCL leads to a decline in metrics, especially on in-domain. This is because SCL, while optimizing the representation distribution, compromises the mapping relationship with labels. Multitask means to optimize two losses simultaneously. This setting leads to mutual interference between two tasks, which affects the convergence effect and damages the performance and stability of the model. In general, SCL should be used as a pre-training method and CE as a finetuning method. The best results can be achieved by first using SCL to learn discriminative representation and then finetuning the model by CE.