IDAS: Intent Discovery with Abstractive Summarization

Intent discovery is the task of inferring latent intents from a set of unlabeled utterances, and is a useful step towards the efficient creation of new conversational agents. We show that recent competitive methods in intent discovery can be outperformed by clustering utterances based on abstractive summaries, i.e., “labels”, that retain the core elements while removing non-essential information. We contribute the IDAS approach, which collects a set of descriptive utterance labels by prompting a Large Language Model, starting from a well-chosen seed set of prototypical utterances, to bootstrap an In-Context Learning procedure to generate labels for non-prototypical utterances. The utterances and their resulting noisy labels are then encoded by a frozen pre-trained encoder, and subsequently clustered to recover the latent intents. For the unsupervised task (without any intent labels) IDAS outperforms the state-of-the-art by up to +7.42% in standard cluster metrics for the Banking, StackOverflow, and Transport datasets. For the semi-supervised task (with labels for a subset of intents) IDAS surpasses 2 recent methods on the CLINC benchmark without even using labeled data.


Introduction
Intent classification is ubiquitous in conversational modelling.To that end, finetuning Large Language Models (LLMs) on task-specific intent data has been proven very effective (Casanueva et al., 2020;Zhang et al., 2021d).However, such finetuning requires manually annotated (utterance, intent) pairs as training data, which are time-consuming and thus expensive to acquire.Companies often have an abundance of utterances relevant to the application area of their interest, e.g., those exchanged between customers and support agents, but manually annotating them remains costly.Consequently, intent discovery aims to recover latent intents without using any such manually annotated utterances, by partitioning a given set of (unlabeled) utterances into clusters, where utterances within a cluster should share the same conversational goal or intent.
Prior works typically (i) train an unsupervised sentence encoder to map utterances to vectors, after which these are (ii) clustered to infer latent intents.Such unsupervised encoder training is achieved largely under the assumption that utterances with similar encodings convey the same intent.For instance, by iteratively clustering and updating the encoder with supervision from the cluster assignments (Xie et al., 2016a;Caron et al., 2018a;Hadifar et al., 2019;Zhang et al., 2021c), or by retrieving utterances with similar encodings and using them as positive pairs to train the encoder with contrastive learning (Zhang et al., 2021a(Zhang et al., , 2022).
Yet, it remains unclear which particular features cause utterance representations to be similar.Various noisy features unrelated to the underlying intents, e.g., syntax, n-gram overlap, nouns, etc. may contribute in making utterances similar, leading to sentence encoders whose vector encodings may inadequately represent the underlying intents.
Different from prior works that train unsupervised encoders, we use a pre-trained encoder without requiring any further finetuning, since we pro-pose making utterances more (dis)similar in the textual space by abstractly summarizing them into concise descriptions, i.e., "labels", that preserve their core elements while removing non-essential information.We hypothesize that these core elements better represent intents and prevent nonintent related information from influencing the vector similarity.Table 1 illustrates how labels retain the intent-related information by discarding irrelevant aspects such as syntax and nouns.
This paper introduces Intent Discovery with Abstractive Summarization (IDAS in short), whereby the label generation process builds upon recent advancements of In-Context Learning (ICL) (Brown et al., 2020).In ICL, an LLM is prompted with an instruction including a small number of (input, output) demonstrations of the task at hand.ICL has shown to be effective at few-shot learning without additional LLM finetuning (Min et al., 2022a,b).However, intent discovery is unsupervised and therefore lacks the annotated (utterance, label) demonstrations required for ICL.To overcome this limitation, our proposed IDAS proceeds in four steps.First, a subset of diverse prototypical utterances representative of distinct latent intents are identified by performing an initial clustering and selecting those utterances closest to each cluster's centroid, for which an LLM is then prompted to generate a short descriptive label.Second, labels for the remaining non-prototypical utterances are obtained by retrieving the subset of the n utterances most similar to the input utterance, from the continually expanding set of utterances with already generated labels (initialized with just the prototypes), and using those n neighbors as ICL-demonstrations to generate the input utterance's label.Third, as the generated labels may still turn out too general or noisy, utterances with their labels are combined into a single vector representation using a frozen pre-trained encoder.Finally, K-means clusters the combined encodings to infer latent intents.
We compare our IDAS approach with the state-ofthe-art in unsupervised intent discovery on Banking (Casanueva et al., 2020), StackOverflow (Xu et al., 2015), and a private dataset from a transport company, to assess IDAS's effectiveness in practice.We show that IDAS substantially outperforms the state-of-the-art, with average improvements in cluster metrics of +3.94%, +2.86%, and +3.34% in Adjusted Rand Index, Normalized Mutual Information, and Cluster Accuracy, respectively.Fur-ther, IDAS surpasses two semi-supervised intent discovery methods on CLINC (Larson et al., 2019) despite not using any ground truth annotations.

Related Work
Statistical approaches: Early, more general short text clustering methods employ statistical methods such as tf-idf (Sparck Jones, 1972), to map text to vectors.Yet, the sparsity of these encodings prevents similar texts, but phrased with different synonyms, from being assigned to the same cluster.To specifically mitigate this synonym effect, external features have been used to enrich such sparse vectors, e.g., with WordNet (Miller, 1995) synonyms or lexical chains (Hotho et al., 2003;Wei et al., 2015), or Wikipedia titles or categories (Banerjee et al., 2007;Hu et al., 2009).
Neural sentence encoders: Rather than relying on external knowledge sources, neural approaches pre-train sentence encoders in a self-supervised way (Kiros et al., 2015;Gao et al., 2021), or with supervision (Conneau et al., 2017;Reimers and Gurevych, 2019;Gao et al., 2021), to produce dense general-purpose vectors that better capture synonymy and semantic relatedness.
Unsupervised intent discovery: Since generalpurpose neural encoders may fail to capture domain-specific intent information, intent discovery solutions have shifted towards unsupervised sentence encoders specifically trained on the domain data at hand.For instance, Xu et al. (2015) train a self-supervised Convolutional Neural Network, and use it to encode and cluster utterances with K-means.Zhang et al. (2022) adopt the same 2-step approach, but instead pre-train the encoder with contrastive learning, where utterances with similar vector encodings are retrieved to serve as positive pairs.A more common strategy is to cluster and train the encoder end-to-end, either by (i) iteratively clustering utterances and updating the encoder with supervision from the cluster assignments (Xie et al., 2016a;Caron et al., 2018b;Hadifar et al., 2019), or (ii) simultaneously clustering utterances and updating the encoder's weights with a joint loss criterion (Yang et al., 2017a;Zhang et al., 2021a).
As an alternative strategy to make utterances more (dis)similar based on the intents they convey, we employ an LLM to summarize utterances into labels that retain both the utterances' core ele-  ments and domain-specific information as encoded in the LLM's weights.Since our generated labels should increase the (dis)similarity of (un)related utterances in the input space, rather than directly in the vector space, we use a frozen pre-trained encoder, thus deviating from the above methods that train unsupervised encoders.

Label generation with ICL
Semi-supervised intent discovery: Similar to our current work, the aforementioned methods focus on unsupervised intent discovery.In the related but different semi-supervised intent discovery task, a fraction of the latent intents is assumed to be known, i.e., the "Known Class Ratio".Annotated data from these known intents is exploited to improve the detection of both known and unknown intent utterances, e.g., by optimizing a cluster loss with pairwise constraints derived from utterances of the same known intent (Lin et al., 2020).Alternative 2-step approaches first pre-train encoders with supervision from known intent utterances, then either directly encode and cluster utterances with K-means (Shen et al., 2021), or further refine the encoder on the unlabeled utterances.The latter refinement can be achieved through contrastive learning (Zhang et al., 2022) or by iteratively clustering and updating the encoder (Zhang et al., 2021b,c).
In-context learning: The core idea of ICL (Brown et al., 2020) is to perform tasks through inference, i.e., without updating parameters, by prompting an LLM with the string concatenation comprising (i) a task instruction, (ii) a small set of (input, output) demonstrations, and (iii) the input.We implement IDAS's label generation process with ICL, as it has shown to substantially outperform zero-shot approaches without demonstrations (Min et al., 2022a,b;Chen et al., 2022).However, since we focus on unsupervised intent discovery and thus lack annotated (utterance, label) demon-strations, we bootstrap the set of demonstrations with automatically retrieved "prototypes".Rather than selecting demonstrations randomly, Liu et al. (2022) found that it is more effective to pick demonstrations similar to the input utterance, which we thus do.Note that alternative methods are possible (Rubin et al., 2022;Sorensen et al., 2022).

Methodology
Task formulation: Let {(x i , y i )|i = 1 . . .N } be a dataset of N utterances x ∈ X from the set of natural language expressions X , with corresponding intents y chosen from a set of K possible intents Y = {y i |i = 1 . . .K}.Given the utterances without the intents, D x = {x i |i = 1 . . .N }, intent discovery aims to infer Y from D x by mapping utterances x i to vectors E(x i ) with encoder E : X → R d , based on which the utterances are partitioned into clusters {C i |i = 1 . . .K}, such that clustered utterances (e.g., x i,j , x k,j ∈ C j ) share the same intent (y i,j = y k,j ), while utterances from different clusters (e.g., x i,j ∈ C j and x k,l ∈ C l , C l ̸ = C k ) have distinct intents (y i,j ̸ = y k,l ).
Overview: As summarized in Fig. 1, to infer latent intents IDAS (1) identifies a subset of diverse "prototypes", P ⊂ D x , representative of the latent intents ( §3.1); then (2) independently summarizes them into labels, which are further used to also generate labels for the remaining non-prototypical utterances x ∈ D x \ P, by retrieving from the subset M of utterances that already have labels (initially P) the set N n (x) of n utterances most similar to x as ICL-demonstrations for generating the label of x ( §3.2); further (3) encodes utterances and their labels into a single vector representation with a frozen pre-trained encoder ( §3.3); and finally (4) infers the latent intents by performing K-means on the combined representations ( §3.4).

Step 1: Initial Clustering
The objective of this step is to identify a diverse set of prototypes, P ⊂ D x , that in Step 2 will be automatically labeled by an LLM and serve as initial demonstrations for generating the labels of nonprototypical utterances.It is therefore important to choose prototypes p ∈ P that each represent a distinct latent intent y ∈ Y, and collectively cover as many as possible of all latent intents.We assume a similarity function between two vector representations of utterances by s : R d × R d → R, and use it to retrieve prototypes by performing an initial clustering on the utterances in D x , in the vector representation space induced by encoder E. Then we select a prototype from each identified cluster, as the utterance in that cluster whose vector representation is closest to the cluster's centroid.
Formally, the utterances in D x are first encoded with E and then partitioned into for which the respective centroids c i ∈ R d and prototypes p i ∈ D x are calculated as

Step 2: Label Generation
Step 2.1: Prototype Labeling To generate label ℓ i for prototype p i , we employ an LLM and provide it with an instruction (inst) such as "describe the question in a maximum of 5 words".The LLM then generates a concise description of the prototype p i , which we use as its label ℓ i .Mathematically, this is represented as where P denotes the probability distribution of the LLM, and ℓ i represents the token sequence t 1 i , . . ., t l i output by the LLM.
Step 2.2: Label Generation with ICL To generate label ℓ for the non-prototypical utterance x ∈ D x \ P, IDAS utilizes ICL by conditioning an LLM on the prompt, i.e., the string concatenation of (i) an instruction inst, e.g., "classify the question into one of the labels", (ii) the set of n demonstrations of (utterance, label) pairs {(x i , ℓ i )|i = 1 . . .n}, and (iii) the utterance x itself.Formally, the label is the token sequence generated by the LLM that maximizes the probability given the prompt: Since unsupervised intent discovery lacks manually annotated demonstrations, IDAS uses a continually expanding set of utterances with automatically generated labels, denoted by M. Initially, M = P, with P the set of prototypes from Step 2.1.An utterance x with newly generated label ℓ is added to M, such that it can serve as a demonstration for remaining unlabeled utterances.
Typically, ICL uses a small set of n demonstrations (i) due to the limit on the number of input tokens of LLMs, and (ii) because performance does not improve for larger number of demonstrations (Min et al., 2022c).Moreover, Liu et al. (2022) found that selecting demonstrations as samples similar to the test input, rather than choosing them randomly, substantially boosts ICL's performance.Therefore, IDAS adopts KATE (Liu et al., 2022) by first mapping utterances in M to vectors with encoder E, and then using the similarity function s to select the set of the n most similar utterances1 from M to E(x), denoted by N n (x) ⊂ M, as demonstrations for input utterance x.
Note that while we use "classify" in the instruction, we do not consider the prototypical labels generated in Step 1 as a fixed label set (i.e., verbalizers).Rather, label ℓ for non-prototypical utterance x is the token sequence as generated directly by the LLM.As a result, labels for non-prototypical utterances may still differ from those generated for the prototypes.Particularly, the LLM can generate new labels for input utterances that represent intents for which no prototypes have been identified yet, and thus have no ICL demonstrations of the latent intent.Thus, we minimize error propagation from Step 1.On the other hand, when the LLM considers that a demonstration likely shares the same latent intent with the input utterance, the "classify" instruction should encourage the LLM to generate a copy of that demonstration's label, which in turn minimizes variation among generated labels of utterances with the same latent intent.

Step 3: Encoding Utterances and Labels
After Step 2, each utterance x ∈ D x has an associated generated label ℓ ∈ M. We use the pre-trained encoder E to respectively encode the utterances and their corresponding labels into separate vectors E(x) and E(ℓ), after which these are averaged into the combined representation: (Note that utterances could also be represented just by their label encoding E(ℓ), yet such generated labels could be noisy or overly general.) We further contribute a non-parametric smoothing method that (i) aims to suppress features that are specific to individual utterances and thus potentially less representative of the underlying intents, while (ii) enhancing those features that are shared across utterances and thus more likely to be representative of the latent intents.We therefore represent utterance x as the average of the vector encodings of the n ′ most similar utterances N n ′ (x, ℓ) to x, including x itself: (2) We automatically determine the value of n ′ as the value that maximizes the average silhouette score (Rousseeuw, 1987) among all samples, which for sample i is given by silhouette-score , where a(i) is the average distance of sample i to all other samples in its cluster, and b(i) is the average distance of sample i to all samples in the neighboring cluster nearest to i.

Step 4: Final intent discovery
To finally infer the latent intents, we represent each utterance x ∈ D x with its label ℓ as ϕ SMOOTH (x, ℓ), and apply K-means clustering, setting K to the ground truth number of latent intents |Y|, following Hadifar et al. (2019); Zhang et al. (2021aZhang et al. ( ,c, 2022)).
4 Experimental Setup

Datasets
We evaluate our IDAS approach on two widely adopted intent classification datasets, CLINC (Larson et al., 2019) and Banking (Casanueva et al., 2020), as well as the StackOverflow topic classification dataset (Xu et al., 2015).We also use a private dataset from a transportation company.Table 2 summarizes dataset statistics.

Baselines
On Banking, StackOverflow, and our Transport dataset, we compare IDAS against the state-of-theart in unsupervised intent discovery, i.e., the MTP-CLNN model (Zhang et al., 2022) that outperforms prior unsupervised methods, such as DEC (Xie et al., 2016b), DCN (Yang et al., 2017b), and DeepCluster (Caron et al., 2018b).As the MTP-CLNN model is pre-trained on the annotated training data of CLINC, directly comparing against it would be unfair.Instead, we compare our approach on CLINC with two state-of-the-art semi-supervised intent discovery methods, DAC (Zhang et al., 2021c) and SCL+PLT (Shen et al., 2021).Compared to the semi-supervised setting, the unsupervised setting without annotations is thus more challenging.We report results of DAC and SCL+PLT with an increasing "Known Class Ratio" (KCR) of 25%, 50%, and 75%, using the annotated data for the known intents of Shen et al. (2021).2022), we assess cluster performance by comparing the predicted clusters to the ground truth intents using the (i) Adjusted Rand Index (ARI) (Steinley, 2004), (ii) Normalized Mutual Information (NMI), and (iii) Cluster Accuracy (ACC) based on the Hungarian algorithm (Kuhn, 1955).Since IDAS's label generation process may depend on the order in which utterances occur, we perform Steps 1-2 leading to utterance labels 5 times, shuffling the utterance order.We further conduct the final clustering Step 4 with 10 different seeds for each of those 5 label generation runs, to account for variation incurred by K-means.For each dataset, we then average the results in terms of means and standard variations across each of these 5 sets.

Implementation
Encoder: We use the same pre-trained encoder E in all steps of our approach, i.e., to (i) retrieve prototypes ( §3.1), (ii) mine the n demonstrations N n (x) for utterance x ( §3.2), and (iii) encode utterances with their labels using Eqs.( 1)-(2) ( §3.3).To rule out performance differences stemming purely from the encoder, we employ the same pre-trained encoder as the baseline we compare with: we use the MTP encoder for Banking, StackOverflow, and Transport, where we compare to MTP-CLNN (Zhang et al., 2022), and the SBERT encoder paraphrase-mpnet-base2 (i.e., SMPNET) (Reimers and Gurevych, 2019) for CLINC, where we compare to DAC (Zhang et al., 2021c) and SCL+PLT (Shen et al., 2021).
Language models and prompts: IDAS uses the text-davinci-003 GPT-3 model (Ouyang et al., 2022) as its LLM for label generation.We adopt the OpenAI playground default values, except for the temperature, which we set to 0 to minimize variation among generated labels of utterances with the same latent intent.To generate prototypical labels ( §3.2), we use the instruction "Describe the domain question in a maximum of 5 words", where the domain is banking, chatbot, or transport for the corresponding dataset.Since StackOverflow is a topic rather than an intent classification dataset, we adopt a slightly different prototypical prompt.To generate labels for non-prototypical utterances with ICL ( §3.2), we use "Classify the domain question into one of the provided labels" for all 4 datasets.See Appendix A.2 for full prompts and examples.
Nearest neighbor retrieval: The function s is implemented with cosine similarity.We use n = 8 demonstrations N n (x) to generate label ℓ for utterance x ( §3.2), based on Min et al. (2022c) and Lyu et al. (2022), who report that further increasing n does not improve ICL's performance.The number of smoothing samples n ′ is determined by running the final K-means ( §3.4) multiple times with n ′ ranging from 5 to 45 and selecting the value that maximizes the average silhouette score.

Main Results
In unsupervised clustering, no labels are available and thus there is only a test set, used to evaluate the model's induced clusters against gold standard labels (Xie et al., 2016a;Yang et al., 2017a;Hadifar et al., 2019;Zhang et al., 2021a).In the semisupervised intent detection setting, intent labels are available for a subset of intents: there is an additional labeled training set -which can be exploited, e.g., for (pre-)training a sentence encoder.
Zhang et al. ( 2022) evaluated their MTP and MTP-CLNN models by (pre-)training the encoder based on an unlabeled training set different from the test set where (new) intent clusters are induced, i.e., they evaluate on a held-out test set unseen during any (pre-)training phase.Since in our IDAS, no encoder is trained, we perform Steps 1-4 on the (unlabeled) test set following (Xie et al., 2016a;Yang et al., 2017a;Hadifar et al., 2019;Zhang et al., 2021a).To ensure a fair comparison we also consider an MTP-CLNN that uses that same test set in (pre-)training its encoder (i.e., for the D unlabeled as defined in Zhang et al. (2022); results marked by ♠ in Table 3).Note that the test sets for a particular dataset are identical across all reported results.First, we compare IDAS against the state-of-theart in the unsupervised setting, i.e., MTP-CLNN, with results reported in Table 3.Both in the original settings of Zhang et al. (2022) (keeping the test data unseen during training, ♢) as well as when using the unlabeled test data in training MTP(-CLNN) (♠), our IDAS significantly surpasses it, with gains averaged over three datasets of +3.19-3.94%,+1.79-2.86%and +1.96-3.34% in respectively ARI, NMI and ACC.We further find that IDAS consistently outperforms MTP-CLNN on all metrics and datasets, except for Banking, where IDAS and MTP-CLNN perform similarly (when comparing them in similar settings, i.e., both using unlabeled test data in training phase).Note that both IDAS and MTP-CLNN perform worse on StackOverflow and Banking in our settings (♠) compared to the original results of Zhang et al. (2022) (♢), likely because in case of ♠, the MPT(-CLNN) encoder(s) were trained on a substantially lower number of samples, i.e., only 5.5% for Stack-Overflow (1,000 for ♠ vs. 18,000 for ♢) and 34% for Banking (3,080 for ♠ vs. 9,016 for ♢).
Second, we assess our IDAS's performance in the semi-supervised task setting, where a subset of intents has labeled data.Note however that our IDAS does not use the labels for those utterances in any way.The results for CLINC presented in Table 4 show that IDAS outperforms both semi-supervised SCL+PLT and DAC methods for KCR's of 25% and 50%.Notably, IDAS surpasses SCL+PLT and DAC for KCR of 50%, with improvements in the range of 5.77-6.76%,1.61-2.32%,and 4.78-4.89% in ARI, NMI, and ACC, respectively.Even for KCR = 75%, it performs just slightly worse than DAC, further confirming IDAS's effectiveness.

Ablations
Below, we investigate the impact of (i) the encoding strategies from §3.3, and (ii) ICL from §3.2 on IDAS's performance.The results for each ablation are averaged over 5 runs with the utterances' order corresponding to those used for presenting the main results, i.e., with IDAS's default parameters values.Due to computation budget constraints, we only provide ablations on StackOverflow for (ii), since it requires GPT-3.For (i), we report results for Banking, StackOverflow, Transport, and CLINC.
Inferring the number of smoothing neighbors: Smoothing requires selecting the number of neighbors n ′ .Our proposed IDAS selects the value of n ′ ∈ {5, . . ., 45} that yields the highest silhouette score.To assess the effect of that chosen n ′ value, we plot the ARI, NMI, and ACC scores for varying n ′ in Fig. 2. We observe that the ARI, AMI, and ACC scores obtained with the automatically inferred n ′ are nearly identical to the best achievable performance, demonstrating that the silhouette score is an effective heuristic for selecting a suitable number of smoothing neighbors.

Random vs. nearest neighbor demonstrations:
IDAS employs KATE (Liu et al., 2022) to select the n ICL demonstrations most similar to x, i.e., N n (x), for generating x's label ( §3.2).To evaluate KATE's effectiveness for intent discovery, we present results for IDAS where n (= 8) demonstrations are instead selected randomly.strations (No ICL, n = 0).This follows the intuition that the LLM can pick a label from one of the n-NN instances, which likely shares an intent with the utterance to be labeled, thus effectively limiting label variation and improving clustering performance.
Varying the number of ICL demonstrations: We generate labels (1) without ICL, adopting the static prompt for generating the prototypical labels, without any demonstrations, and (2) with ICL for varying numbers of demonstrations n ∈ {1, 2, . . ., 16}.Table 6 shows that (i) using any number of demonstrations leads to superior performance compared to using no demonstrations (No ICL); (ii) by varying small amounts of demonstrations (n = 1, 2, or 4) no significant differences are found; (iii) the best performance is achieved by using more demonstrations, i.e., 8 or 16.Consistent with the results of Min et al. (2022c); Lyu et al. (2022), increasing n from 8 to 16 does not result in further improvements, thus confirming that n = 8 demonstrations is a good default value.
Overestimating the number of prototypes: Following Hadifar et al. (2019); Zhang et al. (2021aZhang et al. ( ,c, 2022)), IDAS assumes a known number K of intents, both for the initial clustering (Step 1, retrieving prototypes, §3.1) and for the final clustering (Step 4, recovering latent intents, §3.4).While K can be estimated from a subset of utterances, determining it exactly is difficult.Unlike MTP-CLNN (Zhang et al., 2022), IDAS does not assume that the number of samples of each latent intent is known.To probe the robustness of IDAS's label generation to an incorrect number of prototypes, we conduct the initial K-means clustering with twice the gold number of intents.The K× 2 row in Table 6 shows that this results in only a minor performance drop, indicating that IDAS's label generation process is sufficiently robust to such overestimation.In fact, we hypothesize that having multiple prototypes representing the same intent is less harmful than an insufficient number or incorrectly selected prototypes that do not accurately represent each intent.
Unlike existing methods that train unsupervised sentence encoders, our IDAS approach employs a frozen pre-trained encoder since it increases the (dis)similarity of (un)related utterances in the textual space by abstractly summarizing utterances into "labels".Our experiments demonstrate that IDAS substantially outperforms the current state-ofthe-art in unsupervised intent discovery across multiple datasets (i.e., Banking, StackOverflow, and our private Transport), and surpasses two recent semi-supervised methods on CLINC, despite not using any labeled intents at all.Our findings suggest that our alternative strategy of abstractly summarizing utterances (using a general purpose LLM) is more effective than the dominant paradigm of training unsupervised encoders (specifically on dialogue data), and thus may open up new perspectives for novel intent discovery methods.Since our generated labels provide a better measure of intentrelatedness, we hypothesize that they could also enhance the performance of existing methods that train unsupervised encoders, e.g., by (i) reducing the number of false positive contrastive pairs for MTP-CLNN (Zhang et al., 2022), or (ii) improving the purity of clusters induced by methods that iteratively cluster utterances and update the encoder with (self-)supervision from cluster assignments (Xie et al., 2016a;Caron et al., 2018b;Hadifar et al., 2019).To facilitate such follow-up work, we release our generated labels for the Banking, StackOverflow, and CLINC datasets.2

Limitations
Our work is limited in the following senses.First, all presented results relied on the ground truth number of intents to initialize the number of clusters for conducting K-means to retrieve prototypes ( §3.1) and infer latent intents ( §3.4).In practice, however, the ground truth number of intents is unknown and needs to be estimated by examining a subset of utterances.However, our in §5.2 investigated the impact of overestimating the number of ground truth intents by a factor of two, and found that IDAS's performance did not degrade much.While we did not explore this for the final K-means to infer latent intents, future work could investigate cluster algorithms that do not require the number of dialogue states as input, e.g., DBSCAN (Ester et al., 1996), Mean shift (Comaniciu and Meer, 2002), or Affinity propagation (Frey and Dueck, 2007).Second, we generated labels with the GPT-3 (175B) text-davinci-003 model, which may be prohibitively expensive and slow to run for very large corpora.In our initial experiments, we tried using smaller-sized models such as text-curie-001, text-babbage-001, and text-ada-001, as well as Flan-T5-XL (Chung et al., 2022), but found that the generated labels were of lower quality compared to those of text-davinci-003.In future work, it would thus be interesting to further explore how to more effectively exploit such smaller-sized and/or opensource language models.

Ethics Statement
Since IDAS automatically recovers intents from utterances, e.g., those exchanged between users and support agents, any prejudices that may be present in these utterances may become apparent or even amplified in intents inferred by our model, since clearly IDAS does not eliminate such prejudices.Hence, when designing conversational systems based on such inferred intents, extra care should be taken to prevent them from carrying over to conversational systems deployed in the wild.
Moreover, since IDAS's label generation process relies on LLMs, biases that exist in the data used to train these LLMs may be reinforced, leading to generated labels that may discriminate against or be harmful to certain demographics.

A Appendix
In §A.1, we analyze how using a more powerful pre-trained sentence encoder affects the cluster performance of IDAS.Additionally, we present and discuss the prompts in §A.2, and conduct a qualitative analysis of the generated labels produced by our IDAS approach in §A.3.Finally, in §A.4,we provide a brief overview of the implementation details of our experiments.
A.1 Effect of using a more powerful encoder Here, we assess the impact of using a more powerful frozen pre-trained encoder on the clustering performance of IDAS.Specifically, we provide results of the four encoding strategies using the SBERT encoder all-mpnet-base-v2 (Reimers and Gurevych, 2019) in Table 7.The overall results, presented in the three rightmost columns as the average of the scores across the three datasets, show that each encoding strategy for all-mpnet-base-v2 (bottom half of the table) consistently improves upon the corresponding results for the encoder used in our previous main results (as repeated here in the top rows).However, the label-only encoding strategy (E(ℓ)) achieves similar results for different encoders, likely because the labels already are a short disambiguated version of their associated utterances.Conversely, the other three strategies that exploit the original utterances x deliver substantially better results for all-mpnet-base-v2, as the advanced encoder can more effectively disambiguate utterances based on their latent intents, thus improving cluster performance.Notably, using all-mpnet-base-v2 for the smoothing strategy (ϕ SMOOTH (x, ℓ)) com-pared to using MTP (Banking, Stackoverflow) or paraphrase-mpnet-base-v2 (CLINC), results in gains of +3.88%, +2.08%, and +2.93% in ARI, NMI, and ACC, respectively.These results validate that employing more powerful pre-trained sentence encoders can further improve cluster performance out-of-the-box.It should be noted that, due to limitations in computation budget, we only replaced the encoder for Step 4 to induce intent clusters.However, we anticipate that using all-mpnet-base-v2 also for Steps 1-2 could result in additional improvements.

A.2 Prompts
Figures 3-4 present the static prompts used to generate prototypical labels in Step 2.1 ( §3.2) without demonstrations, as well as the ICL prompts for generating labels of non-prototypical utterances in Step 2.2 ( §3.2).One advantage of instructing LLMs is the ability to specify additional information in the prompts.When clustering topic datasets, there typically is a general understanding of the broad topic according to which utterances should be partitioned, and this topic can be specified in the prompts used to instruct LLMs.Since Stack-Overflow pertains to topics rather than intents, we adopted a more specific prototypical label generation prompt to instruct the LLMs to directly summarize the utterances based on the "technology" they refer to.While this approach may not be effective for intent discovery (i.e., a single conversational dataset can contain intents from multiple topics as well as non-topic intents), we speculate that it could be applied to other topic classification datasets, e.g., News or Biomedical, where a proto-typical prompt could instruct the LLM to identify the "news category" or "medical drug", "disease", etc.We defer exploring IDAS for topic clustering beyond StackOverflow to future work.

A.3 Qualitative Analysis
We conduct a qualitative analysis of IDAS's generated labels.Tables 8-10 show the generated labels for a subset of clusters induced in Step 4 for the corresponding StackOverflow, Banking, and CLINC datasets.For each presented cluster, we report (i) the generated labels with their associated counts in that cluster, and (ii) the majority gold intent, i.e., the most prevalent gold intent among utterances in that cluster, and the number of utterances within that cluster belonging to the majority gold intent.
Main findings: Overall, Tables 8-10 reveal that there is little variation among generated labels within a specific cluster.Specifically, for the majority of clusters, the most frequently occurring generated label has a notably higher count than the other generated labels, e.g., the first row in Table 8 shows that the label "Magento" is generated for 47 out of 49 utterances in that cluster.These findings further support our main hypothesis that abstract summarization increases the similarity in the input space of utterances with the same latent intent.Given the low variation across generated labels within clusters, we hypothesize that our generated labels could also make clusters more easy to interpret compared to utterance-only clustering, thereby potentially reducing the time required for manually inspecting clusters in real-world settings.
Slightly specific labels: While most clusters clearly contain a single label that appears much more frequently than other labels, there are some clusters, e.g., pto_request, plug_type, reminder_update, and calories for CLINC (Table 10), where this is not the case.However, a closer examination of these clusters reveals that the labels still exhibit low variation since they share the same syntactic and lexical structure.For instance, the plug_type cluster's generated labels mostly follow the "Plug Converter ⟨noun adjunct⟩" pattern, with only the noun adjunct being specific to the utterance from which the label is generated.Note that for our intent discovery purpose, these slightly more specific labels do not negatively impact cluster performance, as long as there is a high overlap in syntactical and lexical structure among generated labels.
Overly general labels: Although some utterances are summarized into slightly more specific labels, others may be summarized into overly general labels.For instance, in the banking cluster exchange_via_app (Table 9) the label "Foreign currency exchange" appears 25 times.However, 6 of those 25 utterances do not have exchange_via_app as their gold intent, despite having obtained the same generated label as those other 19 utterances that do.This is due to the fact that generated labels corresponding to more high-level intents may be assigned to utterances that belong to different intents but share that common more high-level intent.For instance, the utterances "Can this app help me exchange currencies?" and "I want to make a currency exchange to EU" have respective gold intents exchange_via_app and fiat_currency_support, yet both are summarized into a more high-level "Foreign currency exchange" label.In contrast to generated labels that are slightly too specific, overly general labels can adversely affect cluster performance, as they may incorrectly group together utterances that belong to different intents despite sharing a common high-level intent.

A.4 Implementation Details
For all presented experiments, the utterances are encoded (Steps 1, 3-4) on a 2.6 GHz 6-Core Intel Core i7 CPU, using a frozen pre-trained sentence encoder.Similarly, both the initial and final K-means clustering to respectively retrieve prototypes (Step 1) and infer latent intents (Step 4), are conducted on CPU.We adopt the K-means implementation of scikit-learn (Pedregosa et al., 2011), with default parameter values, i.e., using the algorithm of Lloyd (1982) and n_init = 10.

Fig. 2 :
Fig. 2: Inferring the number of smoothing neighbors n ′ .The vertical lines represent the automatically determined number of smoothing neighbors corresponding to the highest silhouette score (sil).

Table 1 :
Illustration based onGPT-3 and CLINC (Larson et al., 2019), demonstrating how abstractly summarizing utterances retains the core elements while removing non-intent related information.The example in the bottom block, where apr is labeled as interest rate inquiry, exemplifies the broad domain knowledge captured by LLMs.

Table 3 :
Comparison against unsupervised state-of-the-art.♢: results from Zhang et al. (2022).♠: results from (pre-)training MTP(-CLNN) on the test set (rather than a distinct unlabeled training set).The best model is typeset in bold and the runner-up is underlined.∆MTP-CLNN values are the absolute gains of our IDAS.

Table 5 :
Table 6 shows a substantial improvement of KATE over the random selection method, where the latter only marginally outperforms IDAS without any demon-Effect of the encoding strategies.

Table 7 :
Effect of using a more powerful sentence encoder.The first four rows show the main results presented in §5.1, i.e., with the MTP encoder for Banking and StackOverflow, and with paraphrase-mpnet-base-v2 for CLINC.The last four rows show the results of performing the final clustering (Step 4) with encoder all-mpnet-base-v2.