OutFlip: Generating Out-of-Domain Samples for Unknown Intent Detection with Natural Language Attack

Out-of-domain (OOD) input detection is vital in a task-oriented dialogue system since the acceptance of unsupported inputs could lead to an incorrect response of the system. This paper proposes OutFlip, a method to generate out-of-domain samples using only in-domain training dataset automatically. A white-box natural language attack method HotFlip is revised to generate out-of-domain samples instead of adversarial examples. Our evaluation results showed that integrating OutFlip-generated out-of-domain samples into the training dataset could significantly improve an intent classification model's out-of-domain detection performance.


Introduction
Intent classification is crucial for task-oriented dialogue systems such as Google DialogFlow or Amazon Lex. It is vital for an intent classifier not only to map an input utterance into the correct label but also to detect out-of-domain (OOD) inputs. An accepted OOD input will lead the dialogue system to give erroneous responses.
Approaches for OOD detection in text classification could be classified into two major categories. Outlier detection approaches (Fei and Liu, 2016;Hendrycks and Gimpel, 2017;Shu et al., 2017;Lin and Xu, 2019;Xu et al., 2020) try to find out the boundaries of known classes in feature space. They need no labeled OOD dataset, but it is hard for them to deal with boundary cases. (n + 1)-way classification approaches (Kim and Kim, 2018;Larson et al., 2019;Ryu et al., 2018;Zheng et al., 2020) train classifiers for OOD detection using (pseudo-)labeled OOD samples. In practice, it is difficult and expensive to collect a large 1 Soon will be available at https://github.com/kakaoenterprise/OutFlip number of labeled OOD samples with an openworld environment.
This paper proposes OutFlip, a method to generate OOD samples from in-domain training dataset automatically. For a given training dataset T and a reference intent classification model M which is trained with T , the OutFlip generates a set of OOD samples O. The generated OOD samples O could be used to train M iteratively to improve its OOD detection performance. Since the OutFlip does not require any modifications to the model architecture, it could be used with other OOD detection approaches to further improve the OOD detection performance.
The generated OOD samples should satisfy two conditions. First, they should be "hard-enough"; if the generated examples are too easy to distinguish from in-domain intents, they will be useless in training the OOD detector. Second, they should not belong to any in-domain intents. With a given reference model M and a set of in-domain labels I, this could be considered as finding a sentence x o with truth label y ∈ I and model classification y ∈ I. In this point of view, the OOD sample generation task could be considered as a variant of natural language attack on model M ; the goal of natural language attack on M is to find a x a with truth label y ∈ I and model classification y = y. We revised HotFlip (Ebrahimi et al., 2018), a natural language attack method, to generate such OOD samples.
Our evaluation results showed that the generated OOD samples could significantly improve the OOD detection performances of the reference models. We also showed that applying OutFlip with other OOD detection approaches could further improve the model's OOD detection performance. The evaluation results also suggest that the generated OOD samples could train the models other than the reference model to improve their OOD detection perfor-mances.
Our contributions are summarized as follows: • We proposed OutFlip, a simple and efficient OOD sample generation method using only in-domain training samples.
• We experimentally showed the effectiveness of our proposed approach using the intent classification benchmarks.
• We showed that the generated OOD samples could also improve the OOD detection performances of models other than the reference model.

Related Work
Previous OOD detection works could be classified into two major categories. Outlier detection approaches find boundaries of known classes in feature space. Fei and Liu (2016) computes a center for each class and transforms each document into a vector of similarities to the center. A binary classifier is built using the transformed data for each class. For deep learning-based systems, Hendrycks and Gimpel (2017) proposed the baseline of using softmax score as a threshold. Shu et al. (2017) trained the intent classifier using the sigmoid function and used the standard distribution to set each class's score threshold. Lin and Xu (2019) first trained the classifier using Large Margin Cosine Loss (LMCL) (Nalisnick et al., 2018), and applied Local Outlier Factor (Breunig et al., 2000) to detect the OOD inputs.  proposed a semantic-enhanced Gaussian mixture model to gather vectors of the same classes closely. Xu et al. (2020) calculated the mean and covariance of training samples for each class and used Mahalanobis distance as a distance function.
(n + 1)-way classification approaches train the intent classifier with one additional class, where (n + 1)-th class represents the unseen intent. Kim and Kim (2018) proposed joint learning for indomain and out-of-domain speeches. Larson et al. (2019) manually collected OOD samples to train intent classifiers. Ryu et al. (2018) generated OOD feature vectors using generative adversarial network (Goodfellow et al., 2014) to train an OOD detector. Since the approach proposed in Ryu et al. (2018) only works on continuous feature space, it highly depends on the feature encoder, which transforms inputs into feature vectors. Zheng et al. (2020) also generated OOD feature vectors, but they also used unlabeled examples to enhance classification performance further. Although the (n + 1)-way classification approaches are easy to adopt without modification in the classification model, it is incredibly costly and time-consuming to collect the appropriate OOD samples.
The proposed OutFlip automatically generates OOD samples using the only in-domain training set, significantly reducing the cost of manually collecting OOD samples. Also, the OutFlip does not depend on the feature encoder.
The goal of adversarial attack in text classification is to fool a given text classification model M , by generating an adversarial example x a with truth label y and model classification y = y. Many successful attacks first take a correctly classified example x and replace its important words or characters to get an adversarial sample x a . In a white-box scenario, the attacker has access to the target model's structure; thus, the important word or characters could be easily selected by inspecting the gradient of model M . HotFlip (Ebrahimi et al., 2018) estimates the best change of characters by maximizing the first-order approximation of the change in the loss.
In a black-box scenario, the attacker is not aware of the model or training data; the attacker is only capable of querying the target model with supplied inputs and obtaining the output predictions and their confidence scores. Alzantot et al. (2018) randomly selects a word from sentence x and selects a suitable replacement word that has a similar semantic meaning. Jin et al. (2020) proposed a word importance score, which is used to find the word to be replaced. Li et al. (2020) applied BERT (Devlin et al., 2018) pre-trained language model to find a replacement word.
The proposed OutFlip first extracts important words using the algorithm proposed in Jin et al. (2020), and applies a variant of HotFlip (Ebrahimi et al., 2018) to generate OOD samples which are hard to distinguish from the in-domain intents by the given reference model.

Proposed Approach
In this section, the proposed OOD sample generation approach OutFlip is described in more detail.

HotFlip
We first introduce the white-box adversarial example generation method HotFlip (Ebrahimi et al., 2018). Let M be a text classification model, V be the word vocabulary set, x = {x 1 ; ...; x n } be a sentence with n words where x i ∈ {0, 1} |V | denotes one-hot vector representing the i-th word, and L M (x, y) be the loss of M on input x with true output y. For a given sentence x, a flip of the i-th word from w a to w b is represented by the following vector: where -1 and 1 are in the corresponding positions for words w a and w b in the word vocabulary, respectively. A first-order approximation of the change in loss L M (x, y) can be obtained from a directional derivative along this vector: (2) Then, the HotFlip chooses the vector with the biggest increase in loss: (3) where T sim is a similarity threshold between two words, and P OS(w a ) is the Part-of-Speech tagging of w a . The two constraints are added to ensure that x a is semantically similar to the original input x. With equation 3, the HotFlip could determine the flip position i and the replacement word w b .

OutFlip
For a given reference model M and an in-domain sample x with true output y, the main idea of Out-Flip is to flip the most important word of x, w M (x), to a semantically different word, while minimizing the change of loss L M (x, y). By doing so, the Out-Flip expects to get a sample x o whose truth label is different from the truth label of x, while the model classifications of x o and x are the same.
The word importance score proposed in Jin et al. (2020) is defined as follows: Remain only words whose similarity with w M (x) is less than T sim 15: Randomly select w b among candidates 16: where y is the truth label of x, o y (M, x) is the logit output of the target model M for label y, and x \x i is the sentence after masking x i . The most important word w M (x) is defined as the word with the largest importance score in x.
For each in-domain label y of the training dataset T , we define Core Class Token (CCT) C T (y) as the top 5 most frequent w M (x) among the training samples with truth label y. Since the importance score is calculated based on the reference model M , the OutFlip could select a wrong token as the most important token due to the model error. If the OutFlip flips such a word, the generated sentence's truth label will remain unchanged, leading to an erroneous OOD sample. To prevent such case, the OutFlip simply disregards x during OOD generation process if w M (x) ∈ C T (y).
In summary, the OutFlip chooses the replace- ment word w b using the following equation: and w a ∈ C T (y) (5) Since we do not need the generated OOD samples to be fluent, the part-of-speech condition is removed. The OutFlip randomly chooses w b among the top 1% of the vocabulary in the ascending order of the loss change to generate more diverse samples. We used cosine similarity as the similarity measure.
The truth label of the generated sample x o could be an in-domain label different from y by chance. The OutFlip checks the model classification result of x o to see if it remains the same as x. If the classification result changes, the OutFlip disregards x o . Algorithm 1 shows the pseudocode of the proposed OutFlip.

Iteratively Populating OOD samples
The reference model M could be iteratively trained with the generated OOD samples to improve its OOD detection performance. Figure 1 shows the overall framework. For each iteration, the set of generated OOD samples O is randomly split into training and dev set and used for the next train iteration.
Since the OutFlip does not require any change in model architecture, the OutFlip could be applied independently with other OOD detection algorithms

Experiments
In this section, experimental settings and evaluation results are shown.

Datasets
Experiments are conducted on 3 real task-oriented dialogue datasets, SNIPS (Coucke et al., 2018), ATIS (Hemphill et al., 1990) and Kakao dialogue corpus 2 (Choi et al., 2020). SNIPS and ATIS are well-known English benchmarks. Kakao dialogue corpus is a Korean intent classification benchmark. We evaluated the proposed OutFlip with the Kakao dataset to see if it could be applied to different languages. Table 1 summarizes the statistics of the datasets. The ATIS dataset is highly imbalanced; more than 70% samples belong to one class, while three classes have less than 10 samples. The SNIPS and Kakao datasets are relatively balanced. Since the SNIPS dataset does not have a test set, we randomly selected 30% of the training set and used them as the test set.

Baselines
We implemented two sentence encoders to show the generality of the proposed approach. LSTM (Hochreiter and Schmidhuber, 1997)-based encoder applies one-layer BiLSTM with output dimension 128 on the word embeddings of the given input; a self-attention layer with attention dimension 10 is followed to get the feature vector. CNNbased model applies the algorithm proposed in Kim  (2014). More precisely, one-dimensional convolutions with kernel sizes 2, 3, 4, 5 and filter size 32 are applied on top of the word embeddings. The results are max-pooled to get the feature vector. For both encoders, a dense layer is applied to the feature vector to get the logit of each class. We also implemented three baseline OOD detection systems, as follows: 1. Maximum Softmax Probability (MSP) (Hendrycks and Gimpel, 2017) considers the maximum softmax probability of a sample as the rejection score. If the probability is below a certain threshold, the sample is classified as OOD. We used the threshold of 0.5, as the authors suggested. (Shu et al., 2017) replaces softmax with sigmoid activation as the final layer to calculate the score for each class separately. It also calculates the threshold for each class through a statistical approach.

Deep Open Classification (DOC)
3. Large Margin Cosine Loss (LMCL) (Lin and Xu, 2019) replaces the softmax loss with large margin cosine loss (Nalisnick et al., 2018), to force the model to maximize interclass variance and minimize intra-class variance. After training, it applies Local Outlier Factor (LOF) (Breunig et al., 2000) on training features vectors to detect outliers as OOD.
We set the scaling factor s = 30 and cosign margin m = 0.35, following the authors.
By combining two feature encoders and three baseline OOD detection systems, we implemented eight baseline reference models, six with an OOD detection system and two without.

Experimental Setup
Word embeddings are initialized with GloVe (Pennington et al., 2014) pre-trained word vectors. We downloaded the pre-trained embeddings containing 1.9M words trained on 42B tokens from the author's homepage. For Korean, Korean pre-trained GloVe embedding vectors proposed in Choi et al. (2020) are used. The dimensions of both pretrained embeddings are 300.
We removed some classes from the train/dev set during training and integrated them back during testing, following the evaluation settings of Fei and Liu (2016); Shu et al. (2017); Lin and Xu (2019). We varied the number of known intents in the training dataset as 25%, 50%, and 75% of the intents, and used all intents for testing. We randomly select known intents by weighted random sampling without replacement in the training set. Note that the samples belonging to the unknown intents are removed during training and validation.
Following Fei and Liu (2016); Shu et al. (2017); Lin and Xu (2019), macro F1 score is used to evaluate the models. For each known intent selection, the F1 score for each class is calculated separately. Then the results are macro-averaged across all classes. We reported the average of 10 random known intent selections for each evaluation.
For each OutFlip iteration, 90% of the generated OOD samples are added to the training set, and  the remaining 10% are added to the dev set. The populated train/dev sets are used for the next train iteration. Adam optimizer (Kingma and Ba, 2015) with an initial learning rate of 0.001 is used to train the model. The training batch size is set to 128. Exponential learning rate decay with a decay rate of 0.8 is applied for every two epochs. On each epoch, the trained classifier is evaluated against the dev set, and the training stops when the dev accuracy is not improved for five consequent epochs. Table 2 shows the evaluation results of the proposed OutFlip and other baseline systems. As can be observed from the table, the proposed OutFlip outperforms other baselines when the number of known intents is small. The small number of known intents is the most similar case to real-world applications, since in the open-world environment, the number of unknown intents is much larger than the number of known intents. The OutFlip also gives comparable results for ATIS and SNIPS corpus with the larger number of known intents.

Evaluation Results
For the Kakao corpus, the OutFlip performance is lower compared to the other baselines. To figure out the reason, the qualities of English GloVe embeddings and Korean GloVe embeddings are compared. We used 4 out of 14 categories in the word analogy corpus (Mikolov et al., 2013) for fair comparison; capital-common-countries, capital-world, currency and a subset of family. We removed all the syntactic questions since they cannot be translated into Korean words one-to-one. Part of family category is removed for the same reason. We also removed categories that give an advantage on English pre-trained embeddings; for example, the city-in-state category is removed because it contains relationships between US cities and US states. The remaining 6,168 questions are manually translated into Korean. Table 3 shows the evaluation results of the English and Korean GloVe vectors on our subset of the word analogy corpus. As can be observed, the accuracy of Korean GloVe is much lower compared to the English GloVe vectors. Since the OutFlip relies on the cosine similarities between pre-trained embedding vectors to generate the OOD samples, the quality of embedding vectors is critical to the OOD generation performance.
Next, we applied the proposed OutFlip to other OOD detection baselines to see if the OutFlip could further improve their performance. Table 4 shows the evaluation results. In most cases applying the OutFlip to other OOD detection approaches resulted in performance improvement. The performance improvement was significant when the dataset is balanced, and the number of known intents is small.
We conducted a set of experiments to find out the best OutFlip iteration number and T sim value. Figure 2 and Figure 3 shows the OutFlip performances on the benchmark datasets with changing T sim values and OutFlip iteration numbers, respectively. Increasing T sim value would cause the Out-Flip to generate more challenging examples, but the chance of developing wrong OOD samples would also increase. Figure 2 suggests that the balanced dataset like SNIPS could easily recover the errors introduced from large T sim values. In contrast, the ATIS dataset's macro F1 score decreases with the increased T sim values when many intents are known. Since 3 out of 18 ATIS intents have less than ten sentences, one or two erroneous OOD samples could lead to a performance drop. The macro F1 score of the balanced Kakao dataset does not increase with the T sim values larger than 0.3. Since the quality of Korean GloVe is relatively low, large T sim values introduce more errors compared to the English datasets.
As can be observed from Figure 3, in most cases, the macro F1 score converges with two to three OutFlip iterations. Additional OutFlip iterations give small or no performance improvements for balanced datasets and decrease macro F1 score for unbalanced dataset ATIS by introducing more errors.
One advantage of the OutFlip is that the generated OOD samples could be used to train and improve the OOD detection performance of the models other than the reference model without applying additional OutFlip iterations. We trained the BERT-base and BERT-large models (Devlin et al., 2018) with the ATIS and SNIPS benchmarks. As the same as previous experiments, the unknown intents are removed during training and integrated back during testing. Besides, we added OOD samples generated using reference models OutFlip cnn and OutFlip lstm with three OutFlip iterations and T sim value 0.3, while training the BERT models. Table 5 shows the evaluation results of the BERT models trained with the OutFlip-generated OOD samples. As can be observed from the table, the OutFlip-generated OOD samples significantly improved the OOD detection performances of BERT models, regardless of the reference models used to generate the OOD samples.

Error Analysis
We randomly selected 200 samples from the OOD examples generated by OutFlip cnn with three iterations and T sim = 0.3 for the ATIS dataset, when 75% of the intents are known. The number of newly generated OOD samples for each iteration is shown in   extraction result. For intent atis flight, 797 out of 4,334 samples contain the entity "Denver". For one case, the OutFlip-generated sentence accidentally belongs to the other in-domain intent. However, due to the reference model's error, the OutFlip fails to remove the generated sentence. A training instance "Can you list the cheapest round trip fare from Orlando to Kansas City" (truth label atis airfare ) is converted to a sentence "Can you list the cheapest round trip airplane from Orlando to Kansas City" (truth label atis flight), but the reference model classifies the converted sentence to atis airfare. Since the classification result remains the same, the OutFlip considers the generated sentence as "hard-enough" OOD sample.
The ATIS dataset allows an instance to have multiple labels; two or more labels are assigned to 23 ATIS training instances. The OutFlip failed to properly handle those instances. The remaining one error case is generated from a training instance with two assigned labels.

Conclusion
In this paper, we proposed OutFlip, a method to generate OOD samples using only in-domain training dataset. Our evaluation results showed that the proposed OutFlip could significantly improve the OOD detection performance of an intent classification model by iteratively generating difficult OOD samples. Since OutFlip does not require any modifications to model architecture, it could be used with other OOD detection approaches to improve OOD detection performance further. We also showed that the generated OOD samples could be used to train and improve the OOD detection performance of models other than the reference model, without applying additional OutFlip iterations.
Currently, we only focused on generating difficult OOD samples, which can fool the reference model. However, generating meaningful OOD samples could also be beneficial, since then the dialogue engine developer could check the generated OOD samples to find new intents. As our future work, we will focus on generating meaningful, fluent OOD samples.