Estimating Soft Labels for Out-of-Domain Intent Detection

Out-of-Domain (OOD) intent detection is important for practical dialog systems. To alleviate the issue of lacking OOD training samples, some works propose synthesizing pseudo OOD samples and directly assigning one-hot OOD labels to these pseudo samples. However, these one-hot labels introduce noises to the training process because some “hard” pseudo OOD samples may coincide with In-Domain (IND) intents. In this paper, we propose an adaptive soft pseudo labeling (ASoul) method that can estimate soft labels for pseudo OOD samples when training OOD detectors. Semantic connections between pseudo OOD samples and IND intents are captured using an embedding graph. A co-training framework is further introduced to produce resulting soft labels following the smoothness assumption, i.e., close samples are likely to have similar labels. Extensive experiments on three benchmark datasets show that ASoul consistently improves the OOD detection performance and outperforms various competitive baselines.


Introduction
Intent detection is essential for dialogue systems, and current methods usually achieve high performance under the closed-world assumption (Shu et al., 2017), i.e., data distributions are static, and only a fixed set of intents are considered.However, such an assumption may not be valid in practice, where we usually confront an open-world (Fei and Liu, 2016), i.e., unknown intents that are not trained may emerge.It is necessary to equip dialogue systems with Out-of-Domain (OOD) detection abilities so that they can accurately classify known In-Domain (IND) intents while rejecting unknown OOD intents (Yan et al., 2020a;Shen et al., 2021).
A major challenge for OOD detection is the lack of OOD samples (Xu et al., 2020a).In most appli-  Comparing to the one-hot OOD label, the soft label produced by ASoul is more suitable for this pseudo OOD sample since it carries some IND intents.cations, it is hard, if not impossible, to collect OOD samples from the test distribution before training (Du et al., 2021).To tackle this issue, various studies try to synthesize pseudo OOD samples in the training process.Existing methods include distorting IND samples (Choi et al., 2021;Shu et al., 2021;Ouyang et al., 2021), using generative models (Ryu et al., 2018;Zheng et al., 2020a), or even mixing-up IND features (Zhou et al., 2021a;Zhan et al., 2021).Promising results are reported by training a (k + 1)-way classifier (k IND classes + 1 OOD class) using these pseudo OOD samples (Geng et al., 2020).This classifier can classify IND intents while detecting OOD intent since inputs that fall into the OOD class are regarded as OOD inputs.
Previous studies directly assign one-hot OOD labels to pseudo OOD samples when training the (k + 1)-way classifier (Shu et al., 2021;Chen and Yu, 2021).However, this scheme brings noise to the training process because "hard" pseudo OOD samples, i.e., OOD samples that are close to IND distributions, may carry IND intents (Zhan et al., 2021) (See Figure 1).Indiscriminately assigning one-hot OOD labels ignores the semantic connections between pseudo OOD samples and IND intents.Moreover, this issue becomes more severe as most recent studies are dedicated to producing hard pseudo OOD samples (Zheng et al., 2020a;Zhan et al., 2021) since these samples are reported to facilitate OOD detectors better (Lee et al., 2017).Collisions between pseudo OOD samples and IND intents will be more common.
We argue that ideal labels for pseudo OOD samples should be soft labels that allocate non-zero probabilities to all intents (Hinton et al., 2015;Müller et al., 2019).Specifically, we demonstrate in §3.2 that pseudo OOD samples generated by most existing approaches should be viewed as unlabeled data since they may carry both IND and OOD intents.Soft labels help capture the semantic connections between pseudo OOD samples and IND intents.Moreover, using soft labels also conforms to the smoothness assumption, i.e., samples close to each other are likely to receive similar labels.This assumption lays a foundation for modeling unlabeled data in various previous works (Luo et al., 2018;Van Engelen and Hoos, 2020).
In this study, we propose an adaptive soft pseudo labeling (ASoul) method that can estimate soft labels for given pseudo OOD samples and thus help to build better OOD detectors.Specifically, we first construct an embedding graph using supervised contrastive learning to capture semantic connections between pseudo OOD samples and IND intents.Following the smoothness assumption, a graph-smoothed label is produced for each pseudo OOD sample by aggregating nearby nodes on the graph.A co-training framework with two separate classification heads is introduced to refine these graph-smoothed labels.Concretely, the prediction of one head is interpolated with the graphsmoothed label to produce the soft label used to enhance its peer head.The final OOD detector is formulated as a (k + 1)-way classifier with adaptive decision boundaries.
Extensive experiments on three benchmark datasets demonstrate that ASoul can be used with a wide range of OOD sample generation approaches and consistently improves the OOD detection performance.ASoul also helps achieve new State-ofthe-art (SOTA) results on benchmark datasets.Our major contributions are summarized: 1. We propose ASoul, a method that can estimate soft labels for given pseudo OOD samples.ASoul conforms to the important smoothness assumption for modeling unlabeled data by assigning similar labels to close samples.

2.
We construct an embedding graph to help capture the semantic connections between pseudo OOD samples and IND intents.A co-training framework is further introduced to produce the resulting soft labels with the help of two separate classification heads.
3. We conduct extensive experiments on three benchmark datasets.The results show that ASoul consistently improves the OOD detection performance, and it obtains new SOTA results.
Pseudo OOD Sample Generation: Some works try to tackle OOD detection problems by generating pseudo OOD samples.Generally, four categories of approaches are proposed: 1. Phrase Distortion (Chen and Yu, 2021): OOD samples are generated by replacing phrases in IND samples; 2. Feature Mixup (Zhan et al., 2021): OOD features are directly produced by mixing up IND features (Zhang et al., 2018); 3. Latent Generation (Marek et al., 2021): OOD samples are drawn from the low-density area of a latent space; 4. Open-domain Sampling (Hendrycks et al., 2018): data from other corpora are directly used as pseudo OOD samples.With these pseudo OOD samples, the OOD detection task can be formalized into a (k + 1)-way classification problem (k is the number of IND intents).Our method can be combined with all the above OOD generation approaches to improve the OOD detection performance.
Soft Labeling: Estimating soft labels for inputs has been applied in a wide range of studies such as knowledge distillation (Hinton et al., 2015;Gou et al., 2021;Zhang et al., 2020), confidence cal- ibration (Müller et al., 2019;Wang et al., 2021), or domain shift (Ng et al., 2020).However, few studies try to utilize this approach in OOD detection methods.Existing approaches only attempt to assign dynamic weights (Ouyang et al., 2021) or soft labels to IND samples (Cheng et al., 2022).Our method ASoul is the first attempt to estimate soft labels for pseudo OOD samples.
Semi-Supervised Learning: Our work is also related to semi-supervised learning (SSL) since they all attempt to utilize unlabeled data and share the same underlying smoothness assumption (Wang and Zhou, 2017;Lee et al., 2013;Li et al., 2021).Moreover, the co-training framework in ASoul also helps to enforce the low-density assumption (a variant of the smoothness assumption) (Van Engelen and Hoos, 2020;Chen et al., 2022)   Specifically, an embedding space is obtained using an encoder f and a projection head h by optimizing a supervised constructive loss L ctr on labeled IND data.A graph-smoothed label l g (x) conforming to the smoothness assumption is constructed.l g (x) is further used in a co-training framework, in which two classification heads g 1 and g 2 are maintained.The prediction of one head is interpolated with l g (x) to enhance another head.
tions, are more efficient in improving the OOD detection performance (Lee et al., 2018a;Zheng et al., 2020a).Promising performances are obtained using these hard samples on various benchmarks (Zhan et al., 2021;Shu et al., 2021).
However, we notice that hard pseudo OOD samples used in previous approaches may coincide with IND samples and carry IND intents.Besides Figure 1, we further demonstrate this issue by visualizing pseudo OOD samples produced by Zhan et al. (2021).Specifically, pseudo OOD samples are synthesized using convex combinations of IND features.Figure 2 shows the results on the Banking dataset (Casanueva et al., 2020) when 25% intents are randomly selected as IND intents.It can be seen that some pseudo OOD samples fall into the cluster of IND intents, and thus it is improper to assign one-hot OOD labels to these samples.
The above issue is also observed in other pseudo OOD sample generation approaches.Specifically, we implement the phrase distortion approach proposed by Shu et al. (2021) and employ crowd-sourced workers to annotate 1,000 generated pseudo OOD samples.Results show that up to 39% annotated samples carry IND intents (see Appendix A for more examples).

Overview
In this study, we build the OOD intent detector following three steps: 1. Construct a set of pseudo OOD samples D P ; 2. Estimate a soft label for each sample x ∈ D P ; 3. Obtain a (k + 1)-way classifier and learn a decision boundary for each class to build an OOD detector.A testing input x is identified as OOD if x belongs to the OOD intent I k+1 or x is out of all decision boundaries.
Before applying ASoul, we assume a set of pseudo OOD samples D P are already generated using existing approaches.Figure 3 shows an overview of ASoul.Specifically, a shared utterance encoder f encodes each input x ∈ D I ∪ D P into a representation, and an embedding projection head h constructs an embedding graph on these representations.A co-training framework is also implemented using two (k + 1)-way classification heads g 1 and g 2 , and the prediction of one head is used to enhance soft labels of the peer head.
Note that ASoul is independent of specific methods to produce pseudo OOD samples in D P .In this study, we test various approaches to obtain D P .

Embedding Graph
Embedding Space: An embedding space is maintained in ASoul to capture semantic of input samples.Specifically, for an input x i , an encoder f is used to convert x i into a representation vector, then a projection head h is used to map f (x i ) into an L2 normalized embedding z i = h[f (x i )] to construct the embedding space.To capture better semantic representations, a supervised contrastive loss (Khosla et al., 2020;Gunel et al., 2020) L ctr is optimized on labeled IND samples in D I : in which S(i) represents samples that share the same label with x i in the current batch, A(i) represents all samples in the current batch except x i , Φ maps an input x to its corresponding embedding (i.e., Φ(x) = h[f (x)]), and t > 0 is a scalar that controls the separation of classes.L ctr captures the similarities between examples in the same class and contrast them with the examples from different classes (Gunel et al., 2020).
Graph-Smoothed Label: After obtaining the embedding space, we construct a fully connected unidirectional embedding graph G using samples in D IP = D I ∪ D P .Specifically, we first map each sample x ∈ D IP into an embedding z, i.e., z = Φ(x), and then use all these embeddings as nodes for G. Every two nodes z i and z j in G are linked with an edge.Moreover, we also assign a prior label l p (x) ∈ R k+1 to each sample x ∈ D IP to represent its annotation, i.e., for an IND sample x ∈ D I , l p (x) is defined as the one-hot label corresponding to y, and for a pseudo OOD sample x ∈ D P , l p (x) is defined as the one-hot OOD label corresponding to I k+1 .
For each OOD sample x ∈ D P , a graphsmoothed label l g (x) is obtained by aggregating adjacent nodes on G. Specifically, to conform to the smoothness assumption, we try to minimize the following distance when determining l g (x): where 0 ≤ α ≤ 1 is a scalar, d is a distance function, τ > 0 is a scalar temperature.The second term in Eq. 2 enforces the smoothness assumption by encouraging l g (x) to have similar labels with its nearby samples, whereas the first term tries to maintain l g (x) to meet its original annotation l p (x).
For simplicity, we implement d as the Euclidean distance here, and thus minimizing Eq. 2 yields: Note that the result we derived in Eq. 3 follows most previous graph-smoothing approaches in semi-supervised learning (Van Engelen and Hoos, 2020).To the best of our knowledge, we are the first to apply this scheme to OOD detection tasks.

Co-Training Framework
To further enforce the smoothness assumption, a co-training framework is introduced in ASoul to learn better soft labels using l g (x).Specifically, we implement two classification heads g 1 and g 2 on top of the shared encoder f .Each classification head g i maps the output of f to a (k + 1) dimensional distribution, i.e, g i [f (x)] ∈ R k+1 (i = 1, 2), and a classification loss is optimized on IND samples: in which CE measures the cross-entropy between two distributions.Besides optimizing L IN D cls , a co-training process is implemented to refine l g (x) for each x ∈ D P .Specifically, a soft label l 1 s (x) (or l 2 s (x)) is produced by interpolating l g (x) with the prediction of one classification head g 1 (or g 2 ), and the resulting soft label is used to optimize another head g 2 (or g 1 ).Concretely, the following co-training loss is optimized: where 0 ≤ β ≤ 1 is a weight scalar.Different dropout masks are used in g 1 and g 2 to prompt the diversity required by co-training.Note that as indicated by Lee et al. (2013) and Chen et al. ( 2022), the co-training loss L OOD co favors low density separation between classes, and thus it helps to enforce the low-density assumption when training g i .
The overall training loss for our method is:

OOD Detection
In the inference phase, we directly use the averaged prediction of g 1 and g 2 to implement the OOD detector g(y|x) ∈ R k+1 .
Moreover, an adaptive decision boundary (ADB) is learnt on top of g(y|x) to further reduce the open space risk (Zhou et al., 2022;Shu et al., 2021).Specifically, we follow the approach of Zhang et al. ( 2021) to obtain a central vector c i and a decision boundary scalar b i for each intent class I i ∈ I ∪ {I k+1 }.In the testing phase, the label y for each input x is obtained as: In this way, we can classify IND intents while rejecting OOD intent.

Implementation Details
Our encoder f is implemented using BERT (Devlin et al., 2018) with a mean-pooling layer.The projection head h, classification heads g1 and g 2 are implemented as two-layer MLPs with the LeakyReLU activation (Xu et al., 2020b).The optimizer AdamW and Adam (Kingma and Ba, 2014) is used to finetune BERT and all the heads with a learning rate of 1e-5 and 1e-4, respectively.We use τ = 0.1, α = 0.11 and β = 0.9 in all experiments.All results reported in our paper are averages of 10 runs with different random seeds.See Appendix B for more implementation details.Note that ASoul only introduces little computational overhead compared to the vanilla BERT model (See Appendix D.), and we detail how to choose important hyperparameters for ASoul in Appendix E.

Experiment Setups and Baselines
Following (Zhang et al., 2021;Zhan et al.,Moreover, we also applied the above pseudo OOD sample generation approaches with the previous SOTA method that uses one-hot labeled pseudo OOD samples (Shu et al., 2021).Specifically, a (k + 1)-way classifier is trained by optimizing the cross-entropy loss on D I ∪D P using one-hot labels, and the ADB approach presented in §4.4 is used to construct the OOD detector.
We also compared our method to other competitive OOD detection baselines: MSP: (Hendrycks and Gimpel, 2017) utilizes the maximum Softmax probability of a k-way classifier to detect OOD inputs; DOC: (Shu et al., 2017)  ODIST: (Shu et al., 2021) generates pseudo OOD samples with using a pre-trained language model.
For fair comparisons, all baselines are implemented with codes released by their authors, and use BERT as the backbone.For threshold-based baselines, 100 OOD samples are used in the validation to determine the thresholds used for testing.See Appendix C for more details about baselines.

Results
Table 2 shows the OOD detection performance associated with different pseudo OOD sample generation approaches.Specifically, results marked with "ASoul" measures the performance of our method, while results marked with "Onehot" correspond to the performance of the previous SOTA method (Shu et al., 2021) that uses one-hot labeled samples.We can observe that: 1. ASoul consistently outperforms its one-hot labeled counterpart with large margins.This validates our claim that ASoul can be used to improve the OOD detection performance with different pseudo OOD sample generation approaches; 2. "hard" pseudo OOD samples yield by FM lead to sub-optimal performance when as-signed with one-hot labels (i.e., FM+Onehot generally under-performs PD+Onehot), while it achieves the best performance when combined with ASoul.This demonstrates that assigning one-hot labels to hard pseudo OOD samples introduces noise to the training process and ASoul helps to alleviate these noises.3.Although OOD samples yielded by the open-domain sampling approach are usually disjoint from the training task, they still benefit from ASoul.We suppose this is because the soft labels produced by ASoul prevent the OOD detector from becoming over-confident, which is important to improve the OOD detection performance.
Table 3 shows the performance of all baselines and our best method FM+ASoul.It can be seen that FM+ASoul significantly outperforms all baselines and achieve SOTA results on all three datasets.This validates the effectiveness of ASoul in improving the OOD detection performance.We can also observe large improvements of ASoul when labeled IND datasets are small (i.e., in 25% and 50% set-tings).This demonstrates the potential of ASoul to be applied in practical scenarios, particularly in the early phases of the development that we usually need to handle a large number of OOD inputs with limited IND intents (Zhan et al., 2021).

Ablation Study
Ablation studies were performed to verify the effect of each component in ASoul: We tested following variants: 1. ASoul-CT removes the co-training framework, i.e., only one classification head g 1 is implemented without the co-training process.In this variant, the loss shown in Eq.6 is optimized by moving g 2 and setting β = 1 in Eq.5. 2. ASoul-GS removes the graph-smoothed labels, i.e., the embedding graph is not constructed.In this variant, losses shown in Eq.4 and 5 are optimized and l g (x) in Eq.5 is replaced with the one-hot prior label l p (x). 3. USoul employs uniformly distributed soft labels for samples in D P .In this variant, the soft label l i s (x) in Eq.5 is obtained by uniformly reallocating a small portion of OOD probability to OOD intents.4. KnowD implements a knowledge distillation process to obtain soft labels, i.e., a kway IND intent classifier is first trained on D I and its predictions are interpolated with the one-hot OOD label to obtain the soft label l i s (x) in Eq.5.All above variants are tested with two approaches to produce D P : PD and FM.Results in Table 4 indicate that our method outperforms all ablation models.We can further observe that: 1. soft-labels obtained using other approaches degenerate the model performance by a large margin.This shows the effectiveness of the soft labels produced by ASoul.2. graph-smoothed labels bring the largest improvement compared to other components.This further proves the importance of modeling semantic connections between OOD samples and IND intents.

Feature Visualization
To further demonstrate the effectiveness of ASoul, we visualized the features learnt in the penultimate layer of OOD detectors that are trained using one-hot labels or soft labels.We use the best performing pseudo OOD samples generation approach (i.e., FM) in this analysis.Results shown in Figure 4 demonstrate that soft labels produced by ASoul help the OOD detector learn better representations compared to one-hot labels.The learnt feature space is smoother and representations for IND and OOD samples are more separable.This validates our claim that ASoul helps to conform to the smoothness assumption and improves the OOD detection performance.

Conclusion
In this paper, we first analyze the limitation of existing OOD detection approaches that use one-hot labeled pseudo OOD samples.Then we propose a method ASoul that can estimate soft labels for given pseudo OOD samples and use these soft labels to train better OOD detectors.An embedding space is constructed to produce graph-smoothed labels to capture the semantic connections between OOD samples and IND intents.A co-training framework further refines these graph-smoothed labels.Experiments demonstrate that our method can be combined with different pseudo OOD sample generation approaches, and it helps achieve SOTA results on three benchmark datasets.In the future, we plan to apply our method in other tasks, such as Text-to-SQL parsers (Hui et al., 2021;Wang et al., 2022;Qin et al., 2022) or lifelong learning (Dai et al., 2022).

Limitations
We identify the major limitation of this work is its input modality.Specifically, our method is limited to textual inputs and ignores inputs in other modalities such as vision, audio, or robotic features.These modalities provide valuable information that can be used to better OOD detectors.Fortunately, with the help of multi-modal pre-training models (Radford et al., 2021;Zheng et al., 2022), we can obtain robust features well aligned across different modalities.In future works, we will try to model multi-modal contexts for OOD detection and explore better pseudo OOD sample generation approaches.Another limitation of this work is the pretraining model used in experiments: a model pretrained on dialogue corpora is expected to yield better performance (He et al., 2022c,a,b;Zhou et al., 2021b;Wang et al., 2020;Zheng et al., 2020b).Moreover, it is reported that better OOD detection performance can be obtained if we can extract more robust features for IND tasks (Vaze et al., 2021).Our method can be readily applied to other feature extractors that are better performed on dialogues.

Ethics Statement
This work does not present any direct ethical issues.In the proposed work, we seek to develop a general method for OOD intent detection, and we believe this study leads to intellectual merits that benefit from a reliable application of NLU models.All experiments are conducted on open datasets.the embedding graph more shape (concentrating on nearest neighbors), while large τ forms smooth distributions.
Results are shown in Figure 7 (left).With the increase in temperature, the OOD detection performance tends to decrease.τ = 0.1 achieves the highest F1-ALL score of 84.11%.This suggests that a small temperature makes ASoul focus more on neighbors and gain better performance.Dropout Rate: We compare the performance of dropout rates to the classification heads by adjusting the rate from 0 to 0.7 with an interval of 0.1.
Results are shown in Figure 7 (right).The performance first increases and then decreases as the dropout rate increases.In the begging phase, using a high dropout rate introduces more diversity required by co-training, and thus the OOD detection performance improves.However, using a higher dropout rate introduces much noise to the co-training process, and thus downgrades the OOD detection performance.

F More Evaluation Metrics
We also calculate micro F1-scores over all intents (IND and OOD intents) for our best-performing method FM+ASoul and one of our strongest baselines Outlier on the CLINC150 dataset.As shown in Table 9, FM+ASoul still outperforms the baseline on the micro F1-score.

Figure 1 :
Figure 1: A pseudo OOD sample generated by distorting IND inputs (See more examples in Appendix A).Comparing to the one-hot OOD label, the soft label produced by ASoul is more suitable for this pseudo OOD sample since it carries some IND intents.
Figure 2: t-SNE visualization of pseudo OOD samples generated by feature mixup (Zhan et al., 2021) on the Banking dataset under the 25% setting.It can be seen that some pseudo OOD samples coincide with IND samples.See more analyses in Appendix A.

Figure 3 :
Figure3: An overview of ASoul.Specifically, an embedding space is obtained using an encoder f and a projection head h by optimizing a supervised constructive loss L ctr on labeled IND data.A graph-smoothed label l g (x) conforming to the smoothness assumption is constructed.l g (x) is further used in a co-training framework, in which two classification heads g 1 and g 2 are maintained.The prediction of one head is interpolated with l g (x) to enhance another head.
IND features; 3. Latent Generation (LG): followsZheng et al. (2020a) to decode pseudo OOD samples from a latent space; 4. Open-domain Sampling (OS): followsZhan et al. (2021) to use sentences from other corpora as OOD samples.Each approach mentioned above associates with one of the four categories listed in §2.
employs k 1-vs-rest Sigmoid classifiers and use the maximum predictions to detect OOD intents; OpenMax: (Bendale and Boult, 2016) fits a Weibull distribution to the logits and re-calibrates the confidences with an Open-Max Layer; LMCL:(Lin and Xu, 2019) introduces a large margin cosine loss to maximize the decision margin and uses LOF as the OOD detector; ADB:(Zhang et al., 2021) learns an adaptive decision boundaries for OOD detection; Outlier:(Zhan et al., 2021) mixes convex interpolated outliers and open-domain outliers to train a (k+1)-way classifier; SCL:(Zeng et al., 2021a) uses a supervised contrastive learning loss to separate IND and OOD features; GOT:(Ouyang et al., 2021) shapes an energy gap between IND and OOD samples.

Figure 4
Figure 4: t-SNE visualization of learnt features on the test set of CLINC150 under the 25% setting.

Figure 7 :
Figure 7: Effect of τ in graph-based smoothing (left) and dropout rate in co-training (right) with FM+ASoul on CLINC150 under the 25% setting.
I = {(x i , y i )} only contains IND samples, i.e., x i is an input, and y i ∈ I is the label of x i .

Table 2 :
Performances of ASoul when combined with different OOD sample generation approaches.Best results among each setting are bolded.The best performing ASoul-based method significantly outperforms other baselines with p-value < 0.05 (t-test) in each setting.

Table 3 :
Performance of ASoul and baselines.Best results among each setting are bolded.All improvements of our method over baselines are significant with p-value < 0.05 (t-test).

Table 5 :
Case study of generated OOD samples with ODIST on the CLINC150 dataset.

Table 6 :
Case study of generated OOD samples with ODIST on the StackOverflow dataset.

Table 9 :
Performances of Outlier and FM+Asoul on the CLINC150 dataset under the metric of micro F1-score over all intents(IND and OOD intents).