Redwood: Using Collision Detection to Grow a Large-Scale Intent Classification Dataset

Dialog systems must be capable of incorporating new skills via updates over time in order to reflect new use cases or deployment scenarios. Similarly, developers of such ML-driven systems need to be able to add new training data to an already-existing dataset to support these new skills. In intent classification systems, problems can arise if training data for a new skill’s intent overlaps semantically with an already-existing intent. We call such cases collisions. This paper introduces the task of intent collision detection between multiple datasets for the purposes of growing a system’s skillset. We introduce several methods for detecting collisions, and evaluate our methods on real datasets that exhibit collisions. To highlight the need for intent collision detection, we show that model performance suffers if new data is added in such a way that does not arbitrate colliding intents. Finally, we use collision detection to construct and benchmark a new dataset, Redwood, which is composed of 451 categories from 13 original intent classification datasets, making it the largest publicly available intent classification benchmark.


Introduction
As task-oriented dialog systems like Alexa and Siri have become more and more pervasive, tools enabling developers to build custom dialog systems have followed suit. Such tools-like Microsoft's Luis 1 , Twilio's Autopilot 2 , Rasa 3 , and Google's DialogFlow 4 -enable engineers and dialog designers to craft dialog systems composed of intents, or core categories of competencies or skills in which the system is knowledgeable and to which the system can respond intelligently. New intents may be added periodically to the dialog system as part of its development and maintenance cycle, or dialog system models may be combined together (e.g., Clarke et al. (2022)).
These phenomena may occur especially in realworld deployments, where datasets for dialog models may be developed, grown, and modified by large (and even disparate) teams over the span of a project's lifetime. Furthermore, dialog system models and their corresponding training datasets are sometimes offered as-a-service or "off-the-shelf" to dialog system builders who might not be fully familiar with the breadth or scope of the pre-existing dataset or model. If the builder adds a new intent to the dataset that overlaps with an existing intent, then the re-trained model's performance can suffer. As such, there is a need for tools and algorithms to help detect when a new intent overlaps-that is, collides-with an already-existing intent category.
In this paper, we introduce the challenge of intent collision detection, and develop several algorithms for determining whether a candidate intent category collides with another intent category. To do so, we curate and release a meta-dataset of 722 intents from 13 existing datasets. This graph-like metadataset consists of annotations indicating tuples of colliding intent pairs (examples of colliding intents can be seen in Table 1). We then introduce several collision detection algorithms and evaluate them on this meta-dataset.
We also use intent collision detection to build Redwood, a new intent classification dataset of 451 intent categories. Redwood is built by combining 13 smaller datasets. As a comparison, we also build Redwood-naïve, which is constructed by naïvely joining together all 13 datasets without arbitrating colliding intents. We find that classifier performance on Redwood-naïve to be substantially worse than Redwood, showcasing the negative effect of not addressing intent collisions in data.
Upon official release, Redwood will by far be the largest openly available intent classifica-

Dataset Samples
Snips how cold is it in princeton junction will it be chilly in fiji at ten pm is it foggy in shelter island  give me the 7 day forecast what's the temperature like in tampa  will it rain today  MTOP  what is the weather in new york today  how much is it going to rain tomorrow  give me the weather for march 13th   Slurp  set alarm tomorrow at 6 am  make an alarm for 4pm  set a wake up call for 10 am  MTOP  can you set a warning alarm for 7pm  set an alarm for monday at 5pm  make an alarm for the 5th  Clinc-150  wake me up at noon tomorrow  set my alarm for getting up  i need you to set alarm for

Related Work
The Collision Detection Task. We discuss three areas of related work related to our proposed intent collision detection task: generalized zeroshot learning, open set classification, and out-ofdistribution sample identification. In generalized zero-shot learning, a model is trained with data from a set of "seen" label classes (e.g., intents) and, during inference, must identify test samples as belonging to either a "seen" label class or an "unseen" class for which the model has limited auxiliary knowledge (e.g., descriptions of unseen classes, but no concrete training examples).
Both open set classification and out-ofdistribution sample identification refer to the modeling task of classifying inference samples among label classes seen during training or to identify if the sample belongs to an unknown or undefined label class. Slot-filling models that are trained on B/I/O tags naturally predict the unknown class la-bel as O tags, but for intent classifiers the task is much more challenging since it requires curating viable training data for an out-of-distribution category (i.e., it is challenging to know in advance what types of out-of-distribution inputs a system might encounter).
Our proposed task of intent collision detection differs from the aforementioned tasks because "inference" samples need not be considered one at a time, but can instead be grouped together into entire candidate intent categories. This enables considering entirely different modeling tasks like those discussed in Section 3.3. Nevertheless, both our meta-dataset of intent collisions and Redwood allow for the evaluation of both zero-shot and generalized zero-shot learning models, and the Redwood intent classification dataset includes a substantial number of out-of-distribution samples for evaluating open set classification and out-of-distribution sample detection.
Dataset Combination. Dataset combination has been used in other fields beyond dialog systems and conversational AI. For instance,  combined several speech recognition datasets together to form their SpeechStew dataset. As there are no target labels analogous to intents in automatic speech recognition, the creators of Speech-Stew did not have to consider collisions among intent categories.

Detecting Collisions
In this section we discuss our proposed challenge, intent collision detection. We begin with a motivating example showing why detecting collisions is important, as well as a formal problem statement. Then, we introduce and evaluate several collision detection baselines on our meta-dataset.

Motivating Example
As a motivating example, suppose our intent classification system has been trained on the Clinc-150 dataset (Larson et al., 2019b), an intent classification dataset consisting of 150 intents. 5 The Clinc-150 dataset includes an intent called weather, which is meant to handle weather-related queries such as "what's the weather like today" and "tell me the weather in New York." Suppose further that a new developer or a new team attempts to update the intent classifier with new data that contains a new intent category, such as the get_weather intent from the HWU dataset 6 (Liu et al., 2019). In such a scenario, there are now training data samples that overlap substantially, but that are labeled with different intents (weather vs. get_weather 5 In this paper, dataset names are in italics and intent names are in teletype font. Example queries are in italics and in quotes if they appear in-line. 6 Recall from Section 1 that such updates from new teams or new developers may be from routine perfective maintenance during a model's lifetime.  in this example). Thus, upon updating the model by training on HWU's get_weather data, the predictive performance on any weather-related inference queries might be split between these two intents. This disparity can also cause unintended consequences downstream in production models, such as calls to database systems that are triggered based on the user's intent. Indeed, when we train a BERT classifier on the original Clinc-150 training set, the accuracy on the weather test set is 100%. When we add a HWU's get_weather intent to Clinc-150 to create a new 151 st intent and re-train the BERT classifier, we observe an accuracy score of 60% on the weather test set. This performance drop is a symptom of having added an intent category that collides with another intent category. Such a model-which was trained on colliding intentscould cause unexpected behavior on downstream events, especially if the weather and get_weather intents trigger different business logic workflows or system responses. We note that, while in this example, the colliding weather and get_weather intent names are quite similar, other colliding pairs like Snips' search_screening_event and Metal-Woz's movie_listings do not have lexically similar intent names, precluding straightforward string matching of intent names.

Problem Statement
In this subsection, we formally define our collision detection problem. We first consider a scenario in which we have two intent classification datasets, A and B, where A i ∈ A and B j ∈ B refer to specific intent categories in each. We say that intent categories A i and B j collide if there exist a sufficient number of queries in A i that semantically overlap with a sufficient number of queries in B j . This semantic overlap can occur when a developer attempts to add new intent categories to a starting training dataset-when an intent classification model trained on the combined dataset A ∪ B will cause queries belonging to A i to be classified in B j (and vice versa).
As an example, suppose we have an intent classifier built from a starting dataset such as Clinc-150, which, among other things, contains a weather intent category for weather-related inquiries (cf. Section 3.1). Suppose further that we seek to grow this starting dataset by adding datapoints from a candidate dataset such as HWU (see Section 3.1, which contains a get_weather intent category). If we naïvely combine these two datasets together, a resulting intent classifier will result in some queries from the original weather category to be classified to the newly-added get_weather category because these two categories are semantically similar. Table 1 illustrates several example colliding intents and associated queries. Our approach addresses these collisions by detecting their prevalence and quantifying their impact automatically, aiding developers in improving the quality of their datasets and scope of their dialogue systems.
Because the notion of semantic overlap can differ from category to category and dataset to dataset, we observe several classes of relationships among colliding intent categories in practice. In particular, intent collisions can be simple-pairwise, transitive, or hierarchical. In the simple-pairwise case, two intents collide with each other only, and not with any other intent in either dataset. However, we also observe transitivity within intent classes. Figure 1 illustrates example utterances within intent classes a, b, and c, where all intent classes are transitively related to one another in a cycle.
Lastly, we observe non-transitive hierarchies among colliding intents. In this case, a broad intent category from one dataset can collide with two or more intent categories that do not relate to each other. Figure 2 shows a hypothetical intent class x consisting of general banking queries, including balance inquiries and transfer requests, and classes y and z consist solely of balance inquiry and transfer requests, respectively. Here, because class x is more broad than y and z, each of y and z collide with x, but y and z do not collide with each other. Our approach can help developers reveal such cases when managing datasets, and we consider these collision relationships in the creation of our Redwood dataset.

Approaches
We introduce two approaches for detecting collisions: Classifier Confusion and Data Coverage.
Classifier Confusion. A column of a confusion matrix charts the distribution of predictions of a classifier for data in a particular category. We call such a distribution the classification distribution.
We adapt this notion for our first collision detection approach, which identifies a candidate intent A to collide with B ∈ C if a classifier model trained on dataset C produces a classification distribution d such that max(d) sum(d) > τ , where τ is a threshold set by the developer. We call this ratio the classifier collision score.
Data Coverage. We define the coverage of one intent B over another intent A as Here, sim(a, b) computes the similarity between two phrases a and b (for instance, sim(a, b) could be the cosine similarity between two phrase embeddings or the Jaccard similarity between n-gram sets). The coverage metric can be used to detect if two intents collide using a threshold rule. In other words, A and B collide if Coverage(A,B) > κ, where κ is a threshold chosen by the developer. We call the coverage metric the coverage score.
The Collision Meta-Dataset We constructed a graph-like dataset that indicates the collision relationships between intents. To build this dataset, we reviewed all intents from all of the datasets listed in Table 2 to check for collisions between other intents. We developed a ground truth set of tuples indicating whether two intents collide among these datasets. Figure 3 shows the structure of the intent collision meta-dataset, and Table 2 displays the number of collisions that occur relative to each individual dataset. The meta-dataset includes the three types of collisions defined in Section 3.2.

Experimental Evaluation
Implementation Details. We evaluate our intent collision detection methods on our newly-created collision meta-dataset. For evaluating the classifier confusion approach, we train a multi-class intent classifier on each individual dataset (except the single-intent datasets) and then run inference on all other intents from the other datasets. We compute and report the classifier confusion score for each run. In our experiments, we use a linear SVM classifier with bag-of-words feature representations.
For evaluating the data coverage approach, we first sample 7 a nearly equal number of colliding and non-colliding intent pairs from the collision metadataset. We then compute the coverage scores for the selected pairs using several sentence representation and similarity metrics. We use the SBERT library's SBERT-NLI and SBERT-miniLM sentence embedders (Reimers and Gurevych, 2019) along with cosine similarity. Additionally, we also use n-gram-based similarity, defined as where a and b are queries from two intents, and N = 3 in our experiments. For both the data coverage and classifier confusion experiments, we only consider intents that have at least 10 queries. For the collision detection experiments, we used all 285 collision pairs and sampled 300 non-colliding pairs since there are substantially more non-colliding pairs. The  classifier confusion approach does not compare intents in a pairwise manner, and instead compares a dataset (i.e., a classifier trained on a dataset) against a single intent at a time. We run a classifier on all multi-intent datasets, which yielded a total of 400 collision pairs and 6,802 non-collision pairs for the classifier confusion experiments.
Metrics. While in actual application settings, a user may wish to use thresholds for τ and κ (defined earlier in Section 3.3) to determine whether intents collide, we evaluate both classifier confusion and coverage methods in a threshold-free manner using the AUC score. The AUC score allows us to judge each method's ability to distinguish collisions versus non-collisions; an AUC score of 1.0 means perfect separability between collisions and non-collisions, while an AUC score of 0.5 means a method is unable to distinguish between colliding and non-colliding intents.

Results
Data Coverage. Figure 4 charts coverage scores and confusion scores for various approaches. In Figure 4 (a) and (b), the coverage approaches tend to return higher coverage scores for non-collisions and lower coverage scores for collisions, which aligns with our expectations given our definition of the coverage metric and assuming the similarity metric used in the coverage computation is effective. The AUC scores allow us to quantitatively judge the performance of the various coverage-based approaches: in Table 3, the SBERT-miniLM embedding method yields the highest AUC score, and interestingly the n-gram-based coverage method performs second best, with the SBERT-NLI embedding method in third.
Classifier Confusion. Figure 4 (c) charts classifier confusion scores for the SVM-based classifier confusion approach. Our results demonstrate that actual intent collisions typically yield high classifier confusion scores, while non-collisions yield lower confusion scores. Visually, however, Figure 4 (c) seems to indicate that that the classifier confusion approach is less effective than the coverage-based approaches. This is made more apparent by the AUC score in Table 3. We note that the data coverage and classifier confusion AUC scores are not directly comparable as they use different evaluation settings. Nonetheless, the difference in performance scores does lead us to conclude that the data coverage approach is more effective.
In sum, these experimental results demonstrate that the two intent collision detection approaches introduced here are effective in detecting collisions among real datasets, with the data coverage approach being the stronger of the two.

Building the Redwood Dataset
With tools addressing the problem of intent collision detection in hand, we now turn our attention to combining the individual datasets from Table 2 together to form a single large-scale intent classification dataset, Redwood. This section discusses the construction of Redwood and a companion outof-scope evaluation set, and then evaluates several benchmark intent classifiers on the dataset. These datasets and associated evaluations demonstrate the consequences of leaving colliding intents unaddressed, providing a valuable resource for the community to improving intent classification models.

Data
In-Scope Data. After creating the collision metadataset, a natural extension was to combine each dataset together to form Redwood. We used the collision meta-dataset to help inform us of which intents could combined, and which intents could

Dataset N. Samples
Vertanen (2017) 2067 Clinc-150 1200 Total 3267 stand alone in Redwood. In some cases, we removed intents that caused hierarchical collisions, as sometimes joining together intents from a hierarchical collision produced an intent that was too broad. We included only those intents that have at least 50 queries, and the resulting Redwood consists of 451 total intents and 62,216 queries. Following the terminology used in Larson et al. (2019b), we consider these 451 intents to be inscope.
By way of comparison, we also produced a "naïve" version of Redwood, called Redwood-naïve,   where all the intents from the datasets listed in Table 3.4 were joined together without using collision detection or any other method of arbitrating or correcting colliding intents. Like the original Redwood, we included only intents that have at least 50 queries, and capped each intent at a maximum of 150 queries. Redwood-naïve consists of 619 intents and 85,746 total queries. All versions of Redwood were split into train and test splits per intent: 85% training, 15% testing.
Out-of-Scope Data. In contrast to in-scope, outof-scope queries are those that do not belong to any of the in-scope intents. Considering out-ofscope queries in an evaluation of intent classification models is important because such queries occur in production settings, where end users cannot be expected to know the full range of intents when interacting with a conversational AI system. We include a collection of 3,267 out-of-scope queries in addition to the Redwood corpus. Redwood's out-ofscope data originates from the following sources: Clinc-150 dataset, which itself includes a set of outof-scope queries; and Vertanen (2017), a crowdsourced dialog dataset from which we use the first dialog turns. We reviewed all candidate out-ofscope queries, removing those that were actually in-scope. Examples of queries from the Redwood dataset are shown in Table 4.

Benchmark Evaluation
Models. We benchmark intent classification performance using the MobileBERT model (Sun et al., 2020) using the HuggingFace library (Wolf et al., 2020). The MobileBERT implementation uses a softmax function to compute logits to a probability vector p, from which we can obtain confidence scores for each intent. These confidence scores can be used to predict whether a query is in-or out-of-scope, according to a decision threshold t given by Such decision rules were used in Hendrycks and Gimpel (2017) and Larson et al. (2019b).
Metrics and Experiments. We measure intent classifier accuracy on in-scope data without considering out-of-scope inputs. We also measure each model's ability to distinguish in-scope and out-ofscope queries by computing the AUC between inand out-of-scope confidence scores. In this way, we use AUC to measure how separable in-and outof-scope queries based on their confidence scores without having to select an confidence threshold t. An AUC score of 0.5 (the minimum AUC score) implies the model cannot distinguish in-versus out-of-scope inputs. An AUC of 1.0 indicates the model can perfectly separate inputs.

Results
Model performance on Redwood-naïve and Redwood is shown in Table 6. First, we notice that the intent classifiers perform reasonably well on the in-scope classification task, with MobileBERT classifying queries with 91% accuracy. The models also perform well on the out-of-scope task, and discriminate between in-and out-of-scope queries with AUC scores of 0.921 and 0.928 on the Clinc-150 and Vertanen (2017) out-of-scope evaluation sets.
The bottom half of Table 6 presents model performance when trained and tested on Redwoodnaïve. In this case, model performance is substantially worse than models trained on the carefully-crafted Redwood dataset, confirming our hypothesis from Section 3.1 that model performance suffers if trained on data with colliding intents.
We drill deeper into the impact of intent collisions on models trained on Redwood-naïve in Table 7 which charts per-intent accuracy based on the number of other intents that collide with that intent. This table groups intents based on the number of collisions, and we see that on average, intents with no collisions exhibit higher accuracy than intents with collisions. In general, colliding intents lead to degraded accuracy: intents with one or more collisions have accuracy of around 10 or more points lower than the no-collision group, with the exception of the 6-collision group. The average accuracy of the 6-collision group on Redwood-naïve is indeed surprising, and we posit that the MobileBERT model-a high-capacity transformer model-can learn the nuances of each individual intent, even if they do semantically collide.

Conclusion and Future Work
This paper introduces the task of intent collision detection when constructing or updating an intent classification model's dataset to incorporate additional intents. Using 13 individual datasets, we constructed a meta-dataset to track intent collisions between the datasets, and then introduced and evaluated two intent collision detection techniques and found that both perform effectively at the collision detection task.
To help measure and address this problem, we constructed Redwood, a large-scale intent classification dataset consisting of 451 intents and over 60,000 queries. We used Redwood to benchmark several intent classification models on the task of in-scope query prediction and out-of-scope detection, The new Redwood dataset is the largest publicly available intent classification benchmark, in terms of number of intents, and is available at github.com/gxlarson/redwood. Future work will include annotating slots to extend Redwood to joint intent classification and slotfilling, and it is likely that new tools will have to be developed for doing so. Additionally, using the collision detection methods introduced in this paper, Redwood can be periodically updated with new intents whenever other new intent classification datasets are published.