Novel Slot Detection: A Benchmark for Discovering Unknown Slot Types in the Task-Oriented Dialogue System

Existing slot filling models can only recognize pre-defined in-domain slot types from a limited slot set. In the practical application, a reliable dialogue system should know what it does not know. In this paper, we introduce a new task, Novel Slot Detection (NSD), in the task-oriented dialogue system. NSD aims to discover unknown or out-of-domain slot types to strengthen the capability of a dialogue system based on in-domain training data. Besides, we construct two public NSD datasets, propose several strong NSD baselines, and establish a benchmark for future work. Finally, we conduct exhaustive experiments and qualitative analysis to comprehend key challenges and provide new guidance for future directions.


Introduction
Slot filling plays a vital role to understand user queries in personal assistants such as Amazon Alexa, Apple Siri, Google Assistant, etc. It aims at identifying a sequence of tokens and extracting semantic constituents from the user queries. Given a large scale pre-collected training corpus, existing neural-based models (Mesnil et al., 2015;Lane, 2015, 2016;Goo et al., 2018;Haihong et al., 2019;He et al., 2020b,d;Yan et al., 2020;Louvan and Magnini, 2020; have been actively applied to slot filling and achieved promising results.
Existing slot filling models can only recognize pre-defined entity types from a limited slot set, which is insufficient in the practical application scenario. A reliable slot filling model should not only predict the pre-defined slots but also detect potential unknown slot types to know what it doesn't × √ Figure 1: An example of Novel Slot Detection in the task-oriented dialogue system. Without NSD, the dialogue system gives the wrong response since it misunderstands the unknown slot "is this my world" as the indomain playlist type. In contrast, NSD recognizes "is this my world" as NS and the system gives a fallback response. Meanwhile, with human-in-the-loop annotation, the system can increase its functions or skills. know, which we call Novel Slot Detection (NSD) in this paper. NSD is particularly crucial in deployed systems-both to avoid performing the wrong action and to discover potential new entity types for future development and improvement. We display an example as Fig 1 shows. In this paper, we define Novel Slot (NS) as new slot types that are not included in the pre-defined slot set. NSD aims to discover potential new or out-of-domain entity types to strengthen the capability of a dialogue system based on in-domain precollected training data. There are two aspects in the previous work related to NSD, out-of-vocabulary (OOV) recognition (Liang et al., 2017a;Zhao and Feng, 2018;Hu et al., 2019;He et al., 2020c,d;Yan et al., 2020;He et al., 2020e) and out-of-domain (OOD) intent detection (Lin and Xu, 2019;Larson et al., 2019;Xu et al., 2020a;Zeng et al., 2021b,a Table 1: Comparison between slot filling and novel slot detection. In the novel slot detection labels, we consider "album" as an unknown slot type that is out of the scope of the pre-defined slot set. Meanwhile, "artist" belonging to in-domain slot types still needs to be recognized as the original slot filling task. for pre-defined slot types, using character embedding (Liang et al., 2017a), copy mechanism (Zhao and Feng, 2018), few/zero-shot learning (Hu et al., 2019;He et al., 2020e;Shah et al., 2019), transfer learning (Chen and Moschitti, 2019;He et al., 2020c,b) and background knowledge (Yang and Mitchell, 2017;He et al., 2020d), etc. Compared to OOV recognition, our proposed novel slot detection task focuses on detecting unknown slot types, not just unseen values. NSD faces the challenges of both OOV and no sufficient context semantics (see analysis in Section 6.2), greatly increasing the complexity of the task. Another line of related work is OOD intent detection (Hendrycks and Gimpel, 2017;Lee et al., 2018;Lin and Xu, 2019;Ren et al., 2019;Zheng et al., 2020;Xu et al., 2020a) which aims to know when a query falls outside the range of predefined supported intents. The main difference is that NSD detects unknown slot types in the token level while OOD intent detection identifies out-of-domain intent queries. NSD requires a deep understanding of the query context and is prone to label bias of O (see analysis in Section 5.3.1), making it challenging to identify unknown slot types in the task-oriented dialog system.
In this paper, we first introduce a new and important task, Novel Slot Detection (NSD), in the task-oriented dialogue system (Section 2.2). NSD plays a vital role in avoiding performing the wrong action and discovering potential new entity types for the future development of dialogue systems. Then, we construct two public NSD datasets, Snips-NSD and ATIS-NSD, based on the original slot filling datasets, Snips (Coucke et al., 2018) and ATIS (Hemphill et al., 1990) (Section 2.2). From the perspective of practical application, we consider three kinds of dataset construction strategies, Replace, Mask and Remove. Replace denotes we label the novel slot values with all O in the training set. Mask is to label with all O and mask the novel slot values. Remove is the most strict strategy where all the queries containing novel slots are removed. We dive into the details of the three different construction strategies in Section 3.2 and perform a qualitative analysis in Section 5.3.1. Be-sides, we propose two kinds of evaluation metrics, span-level F1 and token-level F1 in Section 3.4, following the slot filling task. Span F1 considers the exact matching of a novel slot span while Token F1 focuses on prediction accuracy on each word of a novel slot span. We discuss performance comparison between the two metrics and propose a new metric, restriction-oriented span evaluation (ROSE), to combine the advantages of both in Section 5.3.3. Then, we establish a fair benchmark and propose extensive strong baselines for NSD in Section 4. Finally, we perform exhaustive experiments and qualitative analysis to shed light on the challenges that current approaches faced with NSD in Section 5.3 and 6.
Our contributions are three-fold: (1) We introduce a Novel Slot Detection (NSD) task in the task-oriented dialogue system. NSD helps avoid performing the wrong action and discovering potential new entity types for increasing functions of dialogue systems. (2) We construct two public NSD datasets and establish a benchmark for future work. (3) We conduct exhaustive experiments and qualitative analysis to comprehend key challenges and provide new guidance for future NSD work.

Slot Filling
Given a sentence X = {x 1 , ..., x n } with n tokens, the slot filling task is to predict a corresponding tag sequence Y = {y 1 , ..., y n } in BIO format, where each y i can take three types of values: B-slot type, I-slot type and O, where "B" and "I" stand for the beginning and intermediate word of a slot and "O" means the word does not belong to any slot. Here, slot filling assumes y i ∈ y, where y denotes a pre-defined slot set of size M. Current approaches typically model slot filling as a sequence labeling problem using RNN Lane, 2015, 2016;Goo et al., 2018) or pre-trained language models .

Novel Slot Detection
We refer to the above training data D as in-domain (IND) data. Novel slot detection aims to identify  Table 2: Comparison between three processing strategies in the training set. We consider "album" as an unknown slot type and "-" denotes the sentence is removed from the training data.
unknown or out-of-domain (OOD) slot types via IND data while correctly labeling in-domain data.
We denote unknown slot type as NS and in-domain slot types as IND in the following sections. Note that we don't distinguish between B-NS and I-NS and unify them as NS because we empirically find existing models hardly discriminate B and I for an unknown slot type. We provide a detailed analysis in Section 5.3.3. We show an example of NSD in Table 1. The challenges of recognizing NSD come from two aspects, O tags and in-domain slots. On the one hand, models need to learn entity information for distinguishing NS from O tags. On the other hand, they require discriminating NS from other slot types in the pre-defined slot set. We provide a detailed error analysis in Section 6.1.

Dataset
Since there are not existing NSD datasets, we construct two new datasets based on the two widely used slot filling datasets, Snips (Coucke et al., 2018) and ATIS (Hemphill et al., 1990). We first briefly introduce Snips and ATIS, then elaborate on data construction and processing in detail, and display the statistic of our NSD datasets, Snips-NSD and ATIS-NSD. Finally, we define two evaluation metrics for the NSD task, Span F1 and Token F1.

Data Construction and Processing
For Snips and ATIS datasets, we keep some slot classes in training as unknown and integrate them back during testing, following (Fei and Liu, 2016;Shu et al., 2017;Lin and Xu, 2019). We randomly select part of slot types in Snips and ATIS as unknown slots(5%, 15%, and 30% in this paper). Note that the original train/val/test split is fixed.
Considering class imbalance, we perform weighted sampling where the chosen probability is relevant to the number of class examples similar to (Lin and Xu, 2019). To avoid randomness of experiment results, we report the average result over 10 runs. After we choose the unknown slot types, a critical problem is how to handle sentences including these unknown slot types in training set. For OOD intent detection, we just need to remove these sentences in training and validation set. However, for Novel Slot Detection, a sentence perhaps contains both in-domain slots and unknown slots, which is nontrivial for tackling unknown slots at the token level. We need to balance the performance of recognizing unknown slots and in-domain slots. Therefore, we propose three different processing strategies as follows: (1)  We display examples of the above three strategies in Table 2. For the val and test set, we just label the unknown slot values with all NS while keeping the in-domain labeling fixed. Note that NS  tags only exist in the val and test set, not in the training set. Besides, we keep original in-domain slots fixed to evaluate the performance of both NS and in-domain slots. We aim to simulate the practical scenario where we can hardly know what unknown slots are. These three strategies all have its practical significance. Compared with others, Remove is the most suitable strategies for real-world scenarios. In practical scenario, dialog systems first train in the data set labeled by human annotators, and then applied to the actual application. In the process of interaction with the real users, novel slot types appear gradually. Therefore, we consider that the training set doesn't contain potential novel slots sentences. In other words, Remove is the most suitable strategy for NSD in real applications. What's more, Section 5.3.1 demonstrates Remove performs best while the others suffer from severe model bias by O tags. Therefore, we adopt Remove as the main strategy in this paper. Table 4 shows the detailed statistics of Snips-NSD-15% constructed by Remove strategy, where we choose 15% classes in the training data as unknown slots. 4 Combining Table 3 and Table 4, we can find Remove strategy removes 28.70% of queries in the original Snips training set, hence increases the percentage of OOV word from 5.95% to 8.51%. And unknown slot values account for 12.29% of total slot values in the test set.

Metrics
The traditional slot filling task uses Span F1 5 for evaluation. Span F1 considers the exact span matching of an unknown slot span. However, we find in Section 5.3.3 that this metric is too strict to NSD 4 Since different proportions of unknown slots have different statistics, here we only display the results of Snips-NSD-15% for brevity. 5 https://www.clips.uantwerpen.be/conl l2000/chunking/conlleval.txt models. In the practical application, we only need to coarsely mine parts of words of unknown slots, then send these queries containing potential unknown slot tokens to human annotators, which has effectively reduced extensive labor and improved efficiency. Therefore, we define a more reasonable metric, Token F1 which focuses on the word-level matching of a novel slot span. We also propose a new metric, Restriction-Oriented Span Evaluation (ROSE), for a fair comparison in Section 5.3.3.

Methodology
In this section, we introduce the NSD models proposed in this paper and illustrate the differences between the various parallel approaches during the training and test stage.

Overall Framework
The overall structure of model is shown in Fig 2. In the training stage, we either train a multiple-class classifier or binary classifier using different training objectives. We use public BERT-large (Devlin et al., 2019) embedding layer and BiLSTM-CRF (Huang et al., 2015) for token level feature extraction. Then, in the test stage, we use the typical neural multiple classifier to predict the in-domain slot labels. Meanwhile, we use the detection algorithm, MSP or GDA to figure out novel slot tokens. Finally, we override the slot token labels which are detected as NS. In terms of training objectives, detection algorithms, and distance strategies, we compare different variants as follows.
Training objective. For in-domain slots, we propose two training objectives. Multiple classifier refers to the traditional slot filling objective setting, which performs token-level multiple classifications on the BIO tags (Ratinov and Roth, 2009)    Distance strategy. The GDA detection is based on the distances between a target and each slot representation cluster. In original GDA, when the minimum distance is greater than a certain threshold, it is predicted to be novel slots. We propose a novel strategy named Difference, which uses the maximum distance minus the minimum distance, when the difference value of a target is less than a threshold, it is predicted as novel slots. Both of their thresholds are obtained by optimizing the NSD metrics on the validation set.

Implementation Details
We use the public pre-trained Bert-large-uncased model to embed tokens which has 24 layers, 1024 hidden states, 16 heads and 336M parameters. The hidden size for the BiLSTM layer is set to 128. Adam is used for optimization with an initial learning rate of 2e-5. The dropout value is fixed as 0.5, and the batch size is 64. We train the model only on in-domain labeled data. The training stage has an early stopping setting with patience equal to 10. We use the best F1 scores on the validation set to calculate the MSP and GDA thresholds adaptively. Each result of the experiments is tested for 10 times under the same setting and reports the average value. The training stage of our model lasts about 28 minutes on single Tesla T4 GPU(16 GB of memory).

Main Results
Table 5 and 6 show the experiment results with seven different models on two benchmark slot filling datasets Snips-NSD and ATIS-NSD constructed by Remove strategy. We both report NSD and IND results using Span F1 and Token F1. We compare these models from three perspectives, detection method, objective and distance strategy in the following. The analysis of effect of the propor-   tion of unknown slot types is described in 5.3.2. Detection Method: MSP vs GDA. Under the same setting of objective, GDA performs better than MSP in both IND and NSD, especially in NSD. We argue that GDA models the posterior distribution on representation spaces of the feature extractor and avoids the issue of overconfident predictions (Guo et al., 2017;Liang et al., 2017bLiang et al., , 2018. Besides, comparing Snips-NSD and ATIS-NSD, NSD Token F1 scores on ATIS-NSD are much higher than Snips-NSD but no significant difference exists for NSD Span F1 scores. The reason is that Snips-NSD has a higher average entity length (1.83) than ATIS-NSD (1.29), making it harder to detect the exact NS span. Objective: Binary vs Multiple. Under all settings, Multiple outperforms Binary with a large margin on two datasets in both IND and NSD metrics. For MSP, combining Multiple and Binary get higher F1 scores. Specifically, the Binary classifier is used to calculate the confidence of a token belonging to non-O type, which can judge whether the token belongs to entities and distinguish NS from type O. On the other hand, we use the Multiple classifier to calculate the confidence for tokens that are of type NS, to distinguish NS from all predefined non-O slot types. For GDA, we do not combine Multiple and Binary because of poor performance. Multiple achieves the best results for all the IND and NSD F1 scores. We suppose multi-class classification can better capture semantic features than binary classification. Distance Strategy: Minimum vs Difference. We find under the same setting of Binary, Difference strategy outperforms Minimum on both datasets for NSD metrics. But under the same setting of Multiple, there is no consistent superiority between the two distance strategies. For example, Difference outperforms Minimum for NSD metrics on ATIS-NSD, opposite to the results on Snips-NSD. We argue different distance strategies are closely related to objective settings and dataset complexity. We will leave the theoretical analysis to the future. Table 7 displays IND and NSD metrics of three different dataset processing strategies on Snips-NSD using the same model GDA+Multiple+Minimum.

Effect of Different Data Processing Strategies
In this section, we will dive into the analysis of the effects of different data processing strategies. Results show the Replace strategy gets poor performance in NSD, which proves labeling unknown slots as O tags will severely mislead the model. The Mask and Remove strategies are more reasonable since they remove unknown slots from the training data. Their main difference is that Mask only deletes token-level information, while Remove even eliminates the contextual information. For NSD in all datasets, Remove gains significantly better performance on both Token F1 and Span F1 than Mask by 9.06%(5%), 7.83%(15%) and 4.56%(30%) on Token F1, and 8.57%(5%), 7.12%(15%) and 6.5%(30%) on Span F1. We argue the remaining context is still misleading even if the novel slot tokens are not directly trained in the Mask strategy. Besides, Mask does not conform to the real NSD scenario. Generally, Remove is the most suitable strategy for NSD in real applications and can achieve the best performance. Fig 3 displays

New Metric: ROSE
The previous results have shown Span F1 is much lower than the token F1. The reason is that Span F1 is a strict metric, where the model needs to correctly predict all NS tokens and the correct boundary. This is difficult for NSD models due to the lack of supervised information. In fact, NSD models only need to mark some tokens in the span of novel slots and send the total sequence containing the NS tokens back to the humans. A small number of token omissions or misjudgments are acceptable. Therefore, to meet a reasonable NSD scenario, we propose a new metric, restriction-oriented span evaluation (ROSE), to evaluate the span prediction performance under different restrictions. First, we do not punish the situation where tokens prediction exceeds the span. Then, we consider a span is correct when the number of correctly predicted tokens is greater than a settable proportion p of the span length. We take the average of the ROSE score and the original span F1 to avoid the model obtaining an outstanding result through over-long prediction. The results using Snips with 15% of novel slots are shown in Figure 4. As the degree of restriction increases, the metrics tend to decline. It indicates that the model can mostly identify more than half   of the tokens in spans. To make a comprehensive evaluation, we defined the ROSE-mean, namely the mean of ROSE-25%, ROSE-50%, ROSE-75%, and ROSE-100%. We present results on part of proposed models in Table 8.

Analysis of Single Unknown Slot
To analyze the relationship between NSD performance and a single specific slot, we calculate the token and span metrics treating each single slot type as an unknown slot and show the results of the top five and bottom five for Token F1 scores in Table 9. We find that the slots with better performance often account for a larger percentage of the data set, such as Object name or Entity name. They also tend to have a larger value space, such as TimeRange, Music item, or Artist. These characteristics allow the semantic representation of these slots to be distributed over a large area rather than clustered tightly together. We consider that this distribution is more reasonable because in a real application scenario, novel slots are diverse and its distribution tends to be diffuse. Performance on these types also proves that the NSD models we propose can be better generalized to a reasonable data setting.

Analysis for Relationship of Multiple Unknown Slots
In order to explore the effect of inter-slot relationships on NSD, we conducted experiments in which two types are mixed as novel slots. Some of the results are shown in

Discussion
In this section, we empirically divide all the error samples into three categories. Each type of problem contains two aspects, corresponding to NSD precision and recall, respectively. We present the relative proportions of several types of errors in Table 11, which using Snips dataset with 5% novel slots on GDA+multiple+minimum model. For each error type, we present an example in Table 12 to describe the characteristics and analyze the causes. Then, we dive into identifying the key challenges and finally proposed possible solutions for future work.

Error Analysis
Tag O. Tag O is the largest and most widely distributed type in the dataset, and it generally refers to the independent function tokens. Therefore, when identifying, it is easy to be confused with other types, and the confusion is more serious for novel slots without supervised learning. We observed that tokens with O label detected as novel slots usually exist near spans, and the function words in the span labeled as a novel slot have a probability of being predicted as O. We consider that this kind of problem is related to the context. slots is the most common type of error, in which similar slots account for a large part of it. Due to the overlap between vocabulary or shared similar context, the model often tend to be overconfident to predict similar slot labels, we analyze the phenomenon in Table 10, when similar types is treated as a new slot at the same time, NSD efficiency will rise significantly. We employ a generative classification method GDA, compared with the traditional MSP method, to make full use of data features and alleviate the problem.

Challenges
Based on the above analysis, we summarize the current challenges faced by the NSD task: Function tokens. Articles, prepositions, and so on that act as connective words in a sequence. It is usually labeled with type O, but also found in some long-span slots, such as Movie name. It can lead to confusion between O and novel slot when this kind of slot is the target of NSD. Insufficient context. Correct slot detection often depends on the context, and this supervised information is missing for novel slots. Models can only conduct NSD to tokens using the original embeddings or representations trained in other contexts, which can lead to bias in the semantic modeling of the novel slot. Dependencies between slots. There are some semantic overlaps or inclusion relationships in the slot definition of the current benchmark slot filling datasets. As a result, the semantic features are not sufficiently discriminative, and thus some outliers tokens in in-domain slots are easily confused with the novel slots. Open vocabulary slots. Open vocabulary slots is a special kind of slot, its definition is usually macroscopic and can be further divided, the value range is broad. The representation distribution for Open vocabulary slots tends to be diffuse and uneven, which can be misleading to NSD.

Future Directions
For tag O, a possible solution is to use a binary model to assist identification between O and non-O function tokens, we provide a simple method in this paper and leave further optimizing to future work. Then, to decouple the dependencies between slots, it is critical to learn more discriminative features for in-domain data, using contrastive learning or prototypical network is expected to help. Besides, in the traditional slot filling task, the open vocabulary slot problem has been researched for a long time, and accumulate many achievements. Adaptive combination and improvement of relevant methods with NSD tasks is also an important direction of our future research.

Related Work
OOV Recognition OOV aims to recognize unseen slot values in training set for pre-defined slot types, using character embedding (Liang et al., 2017a), copy mechanism (Zhao and Feng, 2018), few/zeroshot learning (Hu et al., 2019;Shah et al., 2019), transfer learning (Chen and Moschitti, 2019;He et al., 2020c) and background knowledge (Yang and Mitchell, 2017;He et al., 2020d), etc. Our proposed NSD task focuses on detecting unknown slot types, not just unseen values. OOD Intent Detection Lee et al. (2018); Lin and Xu (2019); Xu et al. (2020a) aim to know when a query falls outside the range of predefined supported intents. Generally, they first learn discriminative intent representations via in-domain (IND) data, then employs detecting algorithms, such as Maximum Softmax Probability (MSP) (Hendrycks and Gimpel, 2017), Local Outlier Factor (LOF) (Lin and Xu, 2019), Gaussian Discriminant Analysis (GDA)  to compute the similarity of features between OOD samples and IND samples. Compared to our proposed NSD, the main difference is that NSD detects unknown slot types in the token level while OOD intent detection identifies sentence-level OOD intent queries.

Conclusion
In this paper, we defined a new task, Novel Slot Detection(NSD), then provide two public datasets and establish a benchmark for it. Further, we analyze the problems of NSD through multi-angle experiments and extract the key challenges of the task. We provide some strong models for these problems and offer possible solutions for future work.

Broader Impact
Dialog systems have demonstrated remarkable performance across a wide range of applications, with the promise of a significant positive impact on human production mode and lifeway. The first step of the dialog system is to identify users' key points. In practical industrial scenario, users may make unreasonable queries which fall outside of the scope of the system-supported slot types. Previous dialogue systems will ignore this problem, which will lead to wrong operations and limit the system's development. In this paper, we firstly propose to detect not only pre-defined slot types but also potential unknown or out-of-domain slot types using MSP and GDA methods. According to exhaustive experiments and qualitative analysis, we also discuss several major challenges in Novel Slot Detection for future work. The effectiveness and robustness of the model are significantly improved by adding Novel Slot Detection, which takes a step towards the ultimate goal of enabling the safe real-world deployment of dialog systems in safety-critical domains. The experimental results have been reported on standard benchmark datasets for considerations of reproducible research.