Energy-based Unknown Intent Detection with Data Manipulation

Unknown intent detection aims to identify the out-of-distribution (OOD) utterance whose intent has never appeared in the training set. In this paper, we propose using energy scores for this task as the energy score is theoretically aligned with the density of the input and can be derived from any classifier. However, high-quality OOD utterances are required during the training stage in order to shape the energy gap between OOD and in-distribution (IND), and these utterances are difficult to collect in practice. To tackle this problem, we propose a data manipulation framework to Generate high-quality OOD utterances with importance weighTs (GOT). Experimental results show that the energy-based detector fine-tuned by GOT can achieve state-of-the-art results on two benchmark datasets.


Introduction
Unknown intent detection is a realistic and challenging task for dialogue systems. Detecting outof-distribution (OOD) utterances is critical when employing dialogue systems in an open environment. It can help dialogue systems gain a better understanding of what they do not know, which prevents them from yielding unrelated responses and improves user experience.
A simple approach for this task relies on the softmax confidence score and achieves promising results (Hendrycks and Gimpel, 2017). The softmaxbased detector will classify the input as OOD if its softmax confidence score is smaller than the threshold. Nevertheless, further works demonstrate that using the softmax confidence score might be problematic as the score for OOD inputs can be arbitrarily high (Louizos and Welling, 2017;. Another appealing approach is to use generative models to approximate the distribution of indistribution (IND) training data and use the likelihood score to detect OOD inputs. However, Ren et al. (2019) and Gangal et al. (2019) find that likelihood scores derived from such models are problematic for this task as they can be confounded by background components in the inputs.
In this paper, we propose using energy scores  for unknown intent detection. The benefit is that energy scores are theoretically well aligned with the density of the inputs, hence more suitable for OOD detection. Inputs with higher energy scores mean lower densities, which can be classified as OOD by the energy-based detector. Moreover, energy scores can be derived from any pre-trained classifier without re-training. Nevertheless, the energy gap between IND and OOD utterances might not always be optimal for differentiation. Thus we need auxiliary OOD utterances to explicitly shape the energy gap between IND and OOD utterances during the training stage . This poses a new challenge in that the variety of possible OOD utterances is almost infinite. It is impossible to sample all of them to create the gap. Zheng et al. (2019) demonstrate that OOD utterances akin to IND utterances, such as sharing the same phrases or patterns, are more effective, whereas these high-quality OOD utterances are difficult and expensive to collect in practice.
To tackle this problem, we propose a data manipulation framework GOT to generate highquality OOD utterances as well as importance weights. GOT generates OOD utterances by perturbing IND utterances locally, which allows the generated utterances to be closer to IND. Specifically, GOT contains three modules: (1) a locating module to locate intent-related words in IND utterances; (2) a generating module to generate OOD utterances by replacing intent-related words with desirable candidate words, evaluated in two aspects: whether the candidate word is suitable given the context, and whether the candidate word is irrelevant to IND; (3) a weighting module to reduce the weights of potential harmful generated utterances. Figure 1 illustrates the overall process of GOT. Experiments show that the generated weighted OOD utterances can further improve the performance of the energy-based detector in unknown intent detection. Our code and data will be available at: https://github.com/yawenouyang/GOT.
To summarize, the key contributions of the paper are as follows: • We propose using energy scores for unknown intent detection. We conduct experiments on realworld datasets including CLINC150 and SNIPS to show that the energy score can achieve comparable performance as strong baselines.
• We put forward a new framework GOT to generate high-quality OOD utterances and reweight them. We demonstrate that GOT can further improve the performance of the energy score by explicitly shaping the energy gap and achieves state-of-the-art results.
• We show the generality of GOT by applying generated weighted OOD utterances to fine-tune the softmax-based detector, and the fine-tuned softmax-based detector can also yield significant improvements.
2 Related Work Lane et al., 2006, Manevitz and Yousef, 2007and Dai et al., 2007 address OOD detection for the text-mining task. Recently, this problem has attracted growing attention from researchers (Tur et al., 2014;Ryu et al., 2017;Shu et al., 2017). Hendrycks and Gimpel (2017) present a simple baseline that utilizes the softmax confidence score to detect OOD inputs. Shu et al. (2017) create a binary classifier and calculate the confidence threshold for each class. Some distance-based methods (Oh et al., 2018;Lin and Xu, 2019;Yan et al., 2020) are also used to detect unknown intents as OOD utterances highly deviate from IND utterances in their local neighborhood. Simultaneously, with the advancement of deep generative models, learning such a model to approximate the distribution of training data is possible. However, Ren et al. (2019) find that likelihood scores derived from these models can be confounded by background components, and propose a likelihood ratio method to alleviate this issue. Gangal et al. (2019) reformulate and apply this method to unknown intent detection. Different from these methods, we introduce the energy score for this task.  prove that the energy score is theoretically aligned with the density of the input, and can be derived from any classifier without re-training, hence desirable for our task. We further propose a data manipulation framework to generate high-quality OOD utterances to shape the energy gap between IND and OOD utterances.
Note that there are some related works that also generate OOD samples to improve OOD detection performance.  generate OOD samples with Generative Adversarial Network (GAN) (Goodfellow et al., 2014), and Zheng et al. (2019) explore this method for unknown intent detection. However, there are two major distinctions between our study and these works. First, they generate OOD utterances according to continuous latent variables, which cannot be easily interpreted. In contrast, our framework generates utterances by performing local replacements to IND utterances, which is more interpretable to human. Second, our framework additionally contains a weighting module to reform the generated utterances. Our work is also inspired by Cai et al. (2020), which proposes a framework to augment the IND data, while our framework aims to generate OOD data.

Preliminary
In this section, we formalize unknown intent detection task. Then we introduce the energy score, and its superiority and limitations for this task.

Problem Formulation
Given a training dataset D train where u (i) is an utterance and y (i) ∈ Y in = {y 1 , y 2 , ..., y K } is its intent label. In testing, given an utterance, unknown intent detection aims to detect whether its intent belongs to existing intents Y in . In general, unknown intent detection is an OOD detection task. The essence of all methods is to learn a score function that maps each utterance u to a single scalar that is distinguishable between IND and OOD utterances.

Energy-based OOD Detection
An energy-based model (LeCun et al., 2006) builds an energy function E(u) that maps an input u to a scalar called energy score (i.e., E : R D → R). Using the energy function, probability density p(u) can be expressed as: where Z = u exp(−E(u)/T ) is the normalizing constant also known as the partition function and T is the temperature parameter. Take the logarithm of both side of (1), we can get the equation: Since Z is constant for all input u, we can ignore the last term log Z and find that the negative energy function −E(u) is in fact linearly aligned with the log likelihood function, which is desirable for OOD detection .
The energy-based model has a connection with a softmax-based classifier. For a classification problem with K classes, a parametric function f maps each input u to K real-valued numbers (i.e., f : R D → R K ), known as logits. Logits are used to parameterize a categorical distribution using a softmax function: where f y (u) indicates the y th index of f (u), i.e., the logit corresponding the y th class label. And these logits can be reused to define an energy function without changing function f Grathwohl et al., 2020): According to the above, a classifier can be reinterpreted as an energy-based model. It also means the energy score can be derived from any classifier. Due to its consistency with density and accessibility, we introduce the energy score for unknown intent detection, and utterances with higher energy scores can be viewed as OOD. Mathematically, the energy-based detector G can be described as: where δ is the threshold. Although the energy score can be easily computed from the classifier, the energy gap between IND and OOD samples might not always be optimal for differentiation. To solve this problem,  propose an energy-bounded learning objective to further widen the energy gap. Specifically, the training objective of the classifier combines the standard cross-entropy loss with a regularization loss: where F (u) is the softmax output, λ is the auxiliary loss weight. The regularization loss is defined in terms of energy: which utilizes both labeled IND data D train in and auxiliary unlabeled OOD data D train out . This term differentiates the energy scores between IND and OOD samples by using two squared hinge loss with the margin hyper-parameters m in and m out .
Ideally, one has to sample all types of OOD utterances to create the gap, which is impossible in practice. Zheng et al. (2019) demonstrate that OOD utterances akin to IND utterances could be more effective, but more difficult to collect. To address this problem, we propose a data manipulation framework, which can generate these high-quality OOD utterances and assign each generated utterance an importance weight to reduce the impact of potential bad generation.

Approach
In this section, we will introduce our data manipulation framework GOT in detail. GOT aims to generate high-quality OOD utterances by replacing intent-related words in IND utterances, and then assign a weight to each generated OOD utterance. Eventually, the weighted OOD utterances can be used to shape the energy gap.

Locating Module
Since not all words in utterances are meaningful, such as stop words, when generating OOD utterances, replacing these words may not change the intent. It is more efficient and effective to replace those intent-related words. Hence, we design an intent-related score function S to measure how a word w related to an intent y: (8) where D train y is the subset of D train in , which contains utterances with intent y, I is the indicator function, w j is the j th word in u, and w <j = w 1 , ..., w j−1 .
Given w and y, the intent-related score function is the sum of the log-likelihood ratios for all w in D train y . If w is related to y, w tends to occur more frequently in D train y than other words. For each occurrence of w, i.e., w j equals w, p(w|w <j , y) should be higher than p(w|w <j ) as the former is additionally conditioned on the related y, while the latter is not, hence resulting in a higher S(w, y). In contrast, if w is not related to y, p(w|w <j , y) is much less likely to be higher than p(w|w <j ), or w tends to have a lower frequency in D train y , hence S(w, y) is likely to be small. Therefore, S(w, y) can serve as a valid score function to measure how a word w is related to an intent y. With the help of S, given an utterance to be replaced and its intent label, the locating module calculates the intent-related score for each word in this utterance, and a word with a higher score (i.e., larger than a given threshold) can be viewed as an intent-related word.
Implementation: We use two generative models to estimate p(w j |w <j , y) and p(w j |w <j ) separately. Specifically, we train a class-conditional language model (Yogatama et al., 2017) with D train in to estimate p(w j |w <j , y), shown in Figure 2. To predict the word w j , we can combine the hidden state h j with the intent embedding from a learnable label embedding matrix E y , then pass it through a fully connected (FC) layer and a softmax layer to estimate the word distribution. In the training process, the input is the utterance with its intent from D train in , and the training objective is to maximize the conditional likelihood of utterances. To estimate p(w j |w <j ), we directly use pre-trained GPT-2 (Radford et al., 2019) without tuning. Note that the whole training process only needs D train in , and does not need auxiliary supervised data.

Generating Module
After detecting intent-related words in the utterance u, for each of the intent-related words w t , the generating module aims to generate the replacement words from the vocabulary set to replace w t and obtain OOD utterances. We design a candidate score function Q to measure the desirability of the candidate word c: The first term of the right hand side is the loglikelihood of c conditioned on the context of w t ; the higher it is, the more suitable c is given the context. The second term of the right hand side is the negative log of the average likelihoods of c conditioned on the IND label and previous context; the higher it is, the less relevant c is to IND utterances. Therefore, if c has a higher candidate score, that means it fits the context well and has a low density under the IND utterance distribution, thus can be selected as the replacement word to replace w t . The resulting generated OOD utterance is: Implementation: Similar with the locating module, we also do not need auxiliary supervised data to train the generating module. We use the same class-conditional language model mentioned in Section 4.1 to estimate p(c|w <t , y). p(y) is the training set label ratios. To estimate p(c|w <t , w >t ), we use pre-trained BERT (Devlin et al., 2018) without tuning.

Weighting Module
Since we cannot ensure the generation process is perfect, given a generated OOD utterance set D gen out = {û (i) } M i=1 , there might be some unfavorable utterances that are useless or even harmful for tuning the classifier. To fit these utterances, the generalization ability of the classifier will decrease. The weighting module aims to assign these utterances small weights.
We first use Equation 6 as the loss function to train a classifier by taking D gen out as D train out . Then we calculate the influence value φ ∈ R (Wang et al., 2020) for each generated utteranceû. The influence value approximates the influence of removing this utterance on the loss at validation samples. An utterance with positive φ implies that its removal will reduce the validation loss and strengthen the classifier's generalization ability, thus we should assign it a small weight. * In particular, given φ, we calculate weight α as follows: where γ ∈ R + is used to make the weight distribution flat or steep, max φ and min φ are the maximum and minimum influence value of utterances in D gen out .
Implementation: We still do not need auxiliary supervised data for this module. The validation loss is the cross-entropy loss on the validation set. * Details about how to calculate the influence can be found in (Koh and Liang, 2017;.

Algorithm 1 Data Manipulation Process
Input: Training set D train in , intent-related score function S, candidate score function Q, intent-related word threshold , candidate number K, weight term γ Output: Generated weighted OOD utterances set D gw out 1: D gen out = {} # generated OOD utterances without weights 2: for (u, y) ∈ D train in do 3: for wj ∈ u do 4: if S(wj, y) > then 5: C = top−Kc Q(c; u, wj) 6: for c ∈ C do 7:û = {w<j, c, w>j} 8: Addû

Overall Data Manipulation Process
We summarize the process of GOT in Algorithm 1. Line 4 shows that w j can be viewed as an intentrelated word for y if S(w j , y) is greater than the intent-related word threshold . Line 5 shows that we generate K replacement words with the top-K Q(c; u, w t ).

Shape the energy gap with GOT
After obtaining weighted OOD utterances set D gw out , we can explicitly shape the energy gap with them, resulting in IND utterances with smaller energy scores and OOD utterances with higher energy scores. Specifically, we redefine the regularization loss in Equation 6 as follows and use it to re-train the classifier: (12) In the testing process, we can calculate the energy score for the utterance by Equation 4, and identify whether it is OOD by Equation 5.

Datasets
To evaluate the effectiveness of the energy score and our proposed framework, we conducted experiments on two public datasets: • CLINC150 † (Larson et al., 2019)  covers 150 intent classes over ten domains. It supports some OOD utterances that do not fall into any of the system's supported intents to avoid splitting unknown intents manually.
• SNIPS ‡ (Coucke et al., 2018): this dataset is a personal voice assistant dataset that contains seven intent classes. SNIPS does not explicitly include OOD utterances. We kept two classes SearchCreativeWork and SearchScreeningEvent as unknown intents. Table 1 provides summary statistics about these two datasets. Note that the training set and validation set do not include OOD utterances.

Metrics
We used four common metrics for OOD detection to measure the performance. AUROC (Davis and Goadrich, 2006), AUPR In and AUPR Out (Manning et al., 1999) are threshold-independent performance evaluations and higher values are better. FPR95 is the false positive rate (FPR) when the true positive rate (TPR) is 95%, and lower values are better.
Considering the smaller proportion of OOD utterances in the test set on two datasets, AUPR Out is more informative here.

Baselines
We introduce the following classifier-based methods as baselines: • MSP (Hendrycks and Gimpel, 2017) trains a classifier with IND utterances and uses the softmax confidence score to detect OOD utterances.
• DOC (Shu et al., 2017) trains a binary classifier for each IND intent and uses maximum binary classifier output to detect OOD utterances.
• Mahalanobis ) trains a classifier with softmax loss and uses Mahalanobis distance ‡ https://github.com/snipsco/nlu-benchmark of the input to the nearest class-conditional Gaussian distribution to detect OOD utterances.
• SEG (Yan et al., 2020) also uses LOF in the utterance representation. In training, they use semantic-enhanced large margin Gaussian mixture loss.

Implementation Details
For a fair comparison, all classifiers used in the above methods and ours are pre-trained BERT (Devlin et al., 2018) with a multi-layer perceptron (MLP). We select parameter values based on validation accuracy. For energy score, we follow  to set T as 1, λ as 0.1, m in as -8 and m out as -5. For influence value, we focus on changes on MLP parameters and use stochastic estimation (Koh and Liang, 2017) with the scaling term 1000 and the damping term 0.003. For LMCL implementation, we set nearest neighbor number as 20, scaling factor s as 30 and cosine margin m as 0.35, which is recommended by Lin and Xu (2019). For SEG, we follow Yan et al. (2020) to set margin as 1 and trade-off parameter as 0.5. For our framework, we set candidate number K as 2, weight term γ as 20. In particular, for CLINC150, we set threshold as 150 and generate 100 weighted utterances for each intent. For SNIPS, we set threshold as 1500 and generate 1800 weighted utterances for each intent. The difference in settings between two datasets is due to the different sizes of per intent in the training set.

Results and Analysis
In this part, we will show the results of different methods on two datasets and offer some further analysis. Table 2, we can observe that:

As shown in
• The energy score can achieve comparable results on two datasets. Note that on SNIPS dataset, the advantages of the energy score are not as obvious. The reason is that SNIPS dataset is not as challenging as CLINC150 dataset, most methods can achieve good results, such as AUPR out is greater than 0.9.   • Energy + GOT achieves better results on two datasets as compared to the raw energy score. It indicates that our generated weighted OOD utterances can effectively shape the energy gap, resulting in more distinguishable between IND and OOD utterances.
• We also report ablation study results. "w/o weighting" is the energy score tuned by OOD utterances without reweighting. We can see that there is a decrease in performance on both datasets, which shows the advantage of the weighting module (p-value < 0.005).

Effect of Hyper-parameters
During the training process, we find that the method performance is sensitive to two hyperparameters: auxiliary loss weight λ and the number of generated weighted OOD utterances per intent. We conduct two experiments to demonstrate their effects separately. We choose CLINC150 dataset as it is more challenging as mentioned before.
Auxiliary Loss Weight: We set the auxiliary loss weight λ from 0 to 0.5 with an interval of 0.1 to observe its impact. Results are shown in Figure 3 (left). With the increase of auxiliary loss weight, the performance increases first and then decreases. λ = 0.1   achieves the highest AUPR Out 0.914 and outperforms λ = 0 with an improvement of 1.7% (AUPR Out). The results suggest that although shaping the energy gap can improve the performance, there exists a trade-off between optimizing the regularization loss and optimizing cross-entropy loss.
Number of OOD Utterances: we compare the performance of generated weighted utterance numbers for each intent by adjusting the number from 0 to 100 with an interval of 20.
Results are shown in Figure 3 (right). As a whole, AUPR Out increases as more OOD utterances are incorporated into training. We can see that the performance is also improved even with a small generated number, which indicates the necessity of explicitly shaping the energy gap.

Compare with Wikipedia Sentences
An easy way to obtain OOD utterances is from the Wikipedia corpus. We investigate the effect of regarding Wikipedia sentences as OOD utterances to shape the energy gap on CLINC150 dataset. The Wikipedia sentences are from Larson et al. (2019) and the number is 14750.
As shown in Table 3, we can observe that these sentences cannot improve the performance and even have a negative effect (We experimented with several hyper-parameters, this is the best result we could get). After observing these Wikipedia sentences, we find that they have little relevance to   IND utterances. Therefore, simply using Wikipedia sentences is unrepresentative and ineffective for shaping the energy gap.

GOT for Softmax-based Detector
As mentioned in Section 1, when using the softmax-based detector, OOD inputs may also receive a high softmax confidence score. To tackle this problem,  replace the cross entropy loss with the confidence loss. The confidence loss adds the Kullback-Leibler loss (KL loss) on the original cross entropy loss, which forces OOD inputs less confident by making their predictive distribution to be closer to uniform.
To verify the generality of GOT, we directly use the generated weighted OOD utterances to finetune the softmax-based detector with the confidence loss. The results are shown in Table 4. Our MSP + GOT has a significant improvement and outperforms MSP by 8.9% (AUPR Out). Figure  4 provides an intuitive presentation. The softmax confidence scores of OOD from MSP form smooth distributions (see Figure 4 (left)). In contrast, the softmax confidence scores of OOD from MSP + GOT concentrate on small values (see Figure 4 (right)). Overall the softmax confidence score is more distinguishable between IND and OOD after tuning by GOT.

Case Study for GOT
We sample some intents and showcase generated weighted OOD utterances in Table 5. We can observe that intent-related words that located by our locating module are diverse, containing not only words appeared in the intent label. The replacement word fits the context well, and the intent of the generated utterance is exactly changed in most conditions. Admittedly, GOT may have a bad generation, like replace "benefits" with "services" in the second utterance, which leads the generated utterance is still in-domain. Fortunately, the weighting module assigns these utterances a lower weight to reduce their potential harm.

Conclusion and Future Work
In this paper, we propose using energy scores for unknown intent detection and provide empirical evidence that the energy-based detector is comparable to strong baselines. To shape the energy gap, we propose a data manipulation framework GOT to generate high-quality OOD utterances and assign their importance weights. We show that the energybased detector tuned by GOT can achieve stateof-the-art results. We further employ generated weighted utterances to fine-tune the softmax-based detector and also achieve improvements.
In the future, we will explore more operations, such as insertion, drop, etc., to enhance the diversity of generated utterances.