Disentangling Confidence Score Distribution for Out-of-Domain Intent Detection with Energy-Based Learning

Detecting Out-of-Domain (OOD) or unknown intents from user queries is essential in a taskoriented dialog system. Traditional softmaxbased confidence scores are susceptible to the overconfidence issue. In this paper, we propose a simple but strong energy-based score function to detect OOD where the energy scores of OOD samples are higher than IND samples. Further, given a small set of labeled OOD samples, we introduce an energy-based margin objective for supervised OOD detection to explicitly distinguish OOD samples from INDs. Comprehensive experiments and analysis prove our method helps disentangle confidence score distributions of IND and OOD data.


Introduction
Detecting Out-of-Domain (OOD) or unknown intents from user queries is crucial to a task-oriented dialog system (Akasaki and Kaji, 2017;Tulshan and Dhage, 2018;Shum et al., 2018;Lin and Xu, 2019;Xu et al., 2020;Zeng et al., 2021a;Wu et al., 2022b).It can avoid performing wrong operations and provide potential directions of future development when an input query falls outside the range of predefined intents.Since the exact number of unknown intents in practical scenarios is hard to know and annotate, the lack of real OOD examples makes it challenging to identify these samples in dialog systems.
Depending on whether labeled OOD samples are available, previous OOD detection work can be generally classified into two types: unsupervised (Bendale and Boult, 2016;Hendrycks and Gimpel, 2017;Shu et al., 2017;Lee et al., 2018;Ren et al., 2019;Lin and Xu, 2019;Xu et al., 2020; Zeng et al., 2021a,b;Wu et al., 2022a) and supervised (Fei and Liu, 2016;Kim and Kim, 2018;Larson et al., 2019a;Zheng et al., 2020).The former firstly learn an in-domain (IND) intent classifier only using labeled IND data and then estimates the confidence score of a test query.For example, Maximum Softmax Probability (MSP) (Hendrycks and Gimpel, 2017) uses maximum softmax probability as the confidence score and regards an intent as OOD if the score is below a fixed threshold.The assumption is that OOD intents should produce a lower softmax probability than INDs.However, neural networks can produce arbitrarily high softmax confidence even for such abnormal OOD samples (Guo et al., 2017;Liang et al., 2018), as shown in Fig 1&2, which we call overconfidence.Further, another distance-based method, Gaussian discriminant analysis (GDA) (Xu et al., 2020), is proposed to use the maximum Mahalanobis distance (Mahalanobis, 1936) to all in-domain classes centroids as the confidence score.Compared to MSP, GDA gets better OOD performance but requires expensive computation for complex Mahalanobis distance.In this paper, we aim to use simple softmax confidence scores for both higher performance and efficiency.For supervised OOD detection, Fei and Liu ( 2016 2020) uses labeled OOD data to generate an entropy regularization term.But these methods require numerous labeled OOD intents to get superior results.We focus on using fewer labeled OOD data (like 20 or 30) to achieve comparable even better performance.
In this paper, we propose an energy-based score function to detect OOD in an unsupervised manner.The energy-based score function maps each query to a single energy scalar which is lower for IND samples and higher for OOD samples based on the energy theory (LeCun et al., 2006).We first train an in-domain intent classifier via IND data, then replace the original softmax layer with the energybased score function.Our method can not only mitigate the issue of overconfident softmax probability but also reduce expensive post-processing computation.Further, given a small portion of labeled OOD samples, we propose an energy-based margin objective to explicitly distinguish OOD samples from IND samples.Our contributions are threefold: (1) We propose an energy-based learning method for OOD intent detection to achieve higher performance and efficiency.The previous energy work (LeCun et al., 2006;Zhai et al., 2016;Grathwohl et al., 2020;Liu et al., 2020b;Kaur et al., 2021) aims to build a function E(x) : R D → R which maps a sample x to a single scalar called the energy.Given a data point x ∈ R D , the energy function can be defined as follows: where T is the temperature parameter and E(x, y ) is the marginal energy over label y .Essentially, energy scores can be transfered to the likelihood probability: For OOD detection, since we focus on the detection algorithms for the test stage in this paper, we train the same BiLSTM in-domain intent classifier f (x) via IND data as Lin and Xu (2019) in the training stage.Then given a test query, we simply use the logits from the intent classifier to represent E(x, y ).Therefore, the energy score function Eq 1 can be formulated as: where K is the size of IND intent classes and f i (x) is the logit of x belonging to i-th class.We simply use a threshold on the energy score to consider whether a test query belongs to OOD.Intuitively, the reason why the energy score works for OOD detection is that higher energy represents a lower likelihood of occurrence according to LeCun et al. (2006) gap between OOD and IND.We aim to learn more discriminative representations for energy score distributions in the training stage.The energy margin loss is formulated as: where m is the energy margin and E is the energy score of IND or OOD samples in train set.Then in the test stage, we still use the energy score to detect OOD.Analysis 4.1 displays the effectiveness of the margin loss over unsupervised OOD.

Datasets
We use two public benchmark OOD datasets3 , CLINC-Full and CLINC-Small (Larson et al., 2019b).We show the detailed statistic of these

Metrics
We report both OOD metrics: Recall and F1score(F1) and in-domain metrics: F1-score(F1) and Accuracy(ACC).Since we aim to improve the performance of detecting out-of-domain intents from user queries, OOD Recall and F1 are the main evaluation metrics in this paper.

Baselines
For detection algorithms, we use MSP, LOF and GDA as baselines.For training objectives, we use N+1, entropy and bound as baselines.We present dataset statistics, baselines and implementation details in the appendix.We will release our code after blind review.

Main Results
Table 1 shows the main results.( 1   We conduct the experiments on the CLINC-Full dataset, using Mar-gin+Energy for supervised OOD.The X-axis denotes the value of margin m. Results show that m = 19.0achieves the best performance and is robust to minor changes.

Conclusion
Traditional softmax-based OOD detection methods are susceptible to the overconfidence issue.Therefore, we propose a novel energy-based score function to mitigate the issue of softmax overconfidence.To use labeled OOD data, we further introduce an energy-based margin objective to explicitly distinguish energy score distributions of OOD from IND.Experiments and analysis confirm the effectiveness of our energy-based method for OOD detection.For future work, we hope to explore theoretical concepts of energy and provide new guidance.

A Baseline Details
We perform main experiments based on two different settings, unsupervised OOD and supervised OOD detection.For unsupervised OOD detection, we compare our proposed energy detection algorithm with other methods, MSP (Maximum Softmax Probability) (Hendrycks and Gimpel, 2017), LOF (Local Outlier Factor) (Lin and Xu, 2019), GDA (Gaussian Discriminant Analysis) (Xu et al., 2020).For supervised OOD detection, we also compare our proposed energy-based margin objective with entropy (Zheng et al., 2020) and N+1 (Fei and Liu, 2016;Larson et al., 2019a).Note that margin and entropy objectives are used in the training stage, we still need detection algorithms MSP, GDA or Energy to detect in the test stage.We supplement the relevant baseline details as follows: MSP (Maximum Softmax Probability) (Hendrycks and Gimpel, 2017) uses maximum softmax probability as the confidence score and regards an intent as OOD if the score is below a fixed threshold.LOF (Local Outlier Factor) (Lin and Xu, 2019) uses the local outlier factor to detect unknown intents.The motivation is that if an example's local density is significantly lower than its k-nearest neighbor's, it is more likely to be considered as the unknown intents.GDA (Gaussian Discriminant Analysis) (Xu et al., 2020) is a generative distance-based classifier for out-of-domain detection with Euclidean space.They estimate the class-conditional distribution on feature spaces of DNNs via Gaussian discriminant analysis (GDA) to avoid over-confidence problems and use Mahalanobis distance to measure the confidence score of whether a test sample belongs to OOD.GDA is the state-of-the-art detection method till now, our proposed energy score still significantly outperforms GDA.
Note that LOF and GDA both require additional post-processing modules to estimate density or distance, which induces expensive computation.We conduct a performance comparison for inference time in Table 3.Since SCL only adds a pre-training loss along with CE and also uses GDA for detection, the inference time is equal to GDA.  2020)) uses labeled OOD data to generate an entropy regularization term to enforce the predicted distribution of OOD inputs closer to the uniform distribution: where H is the Shannon entropy of the predicted distribution.p θ (y|x ood ) is the predicted distribution of the input OOD utterance x ood .
Bound (Liu et al. (2020b)) uses a regularization loss defined in terms of energy to further widen the energy gap: where E is the energy score of IND or OOD samples in the train set.This learning objective using two squared hinge loss with two hyper-parameters m ind and m ood .Note that Bound aims at OOD image classification and replies on two independent energy bounds.Instead, our proposed Margin constructs a contrastive energy margin between IND intents and OOD intents to better disentangle energy distributions.

B Implementation Details
We use the public pre-trained 300 dimensions GloVe embeddings (Pennington et al., 2014) 4 to embed tokens.We use a two-layer BiLSTM as a feature extractor and set the dimension of hidden states to 128.The dropout value is fixed at 0.5.We use Adam optimizer (Kingma and Ba, 2014) to train our model.We set the learning rate to 1E-03.In the training stage, we use standard cross-entropy loss for unsupervised OOD and cross-entropy+energy-guided margin loss for supervised OOD.Besides, in supervised OOD scenario, we employ restriction-oriented random sampling.Specifically, we guarantee that IND and OOD samples are both included in each batch to facilitate calculation of margin loss.We both set the training epoch up to 200 with a early stop of patience 15.
For our proposed energy-guided margin loss, we set the margin m to 19.0 and the temperature T to 0.8.We use the best OOD F1 scores on the validation set to calculate the threshold adaptively.Each result of the experiments is tested 5 times under the same setting and gets the average value.The training stage of our models lasts about 2 minutes for unsupervised OOD and 4 minutes for supervised OOD both on a single Tesla T4 GPU(16 GB of memory).The average value of the trainable model parameters is 3.05M.We will release our code after blind review.
C A Theoretical Proof of Energy Score vs Softmax Score In this section, we give a theoretical proof of why energy score outperforms softmax score.Supposing we get the output logits from the intent classi-4 https://github.com/stanfordnlp/GloVefier, we represent MSP as follows: log MSP(logits) = log max softmax (logits) = log max exp(logits i ) i exp(logits i ) = log exp max(logits) i exp(logits i ) = max(logits) − log sum exp(logits) (7) where logits i represents the i-th value in the vector logits.Recap the energy definetion: Here we set T to 1. Therefore, we get the following equation: log MSP(logits) = max(logits) regularization item +Energy(logits) (9) If the output logits get a high max value, then max(logits) performs as a regularization item to avoid energy score increasing.Therefore, energy score can better mitigate the overconfidence issue than softmax score.

Figure 3 :
Figure 3: The overall architecture of our proposed method.
); Larson et al. (2019a), form a (N+1)class classification problem where the (N+1)-th class represents the OOD intents.Further, Zheng et al. ( (2) We propose an energy-based margin objective to distinguish energy distributions of OOD and IND samples.(3) Extensive experiments and analysis on two benchmarks demonstrate the effectiveness of our method.2 Methodology Overall Architecture Fig 3(a) shows the overall architecture of our proposed method.We first train an in-domain intent classifier using IND data in training stage.Then in the test stage, we extract the intent feature of a test query and employ the detection algorithms MSP (Hendrycks and Gimpel, 2017) or Energy to detect OOD.Fig 3(b) demonstrates the effectiveness of our method distinguishing OOD distributions from IND 2 .Energy-based Score Function To mitigate the issue of overconfident softmax probability in MSP, we propose an energy-based score function to push apart score distributions of OOD and IND samples.We first briefly review the energy theory (LeCun et al., 2006) then explain our proposed energy-based score function for OOD detection.

Figure 6 :
Figure 6: Effect of number of labeled OOD samples.

Figure 8 :
Figure 7: Effect of energy temperature T

Table 2 :
Statistics of the CLINC datasets.
Distribution of softmax scores vs energy scores.datasets in Table 2.They both contain 150 indomain intents across 10 domains.The difference is that CLINC-Small has fewer in-domain training examples than CLINC-Full.Note that all the datasets we used have a fixed set of labeled OOD data but we don't use it for training.