ClusterPrompt: Cluster Semantic Enhanced Prompt Learning for New Intent Discovery

,


Introduction
New Intent Discovery (NID) aims to automatically identify novel intent categories that are not defined or observed beforehand.It plays a critical role in task-oriented dialogue systems with the ability to discern newly emerging user preferences (Liao et al., 2023a,b), thereby providing high-quality services (Lin et al., 2020;Zhang et al., 2021cZhang et al., , 2023)).Different from the traditional intent classification (E et al., 2019;Chen et al., 2019;Wang et al., 2021), the key challenges of NID lie in how to properly transfer the prior knowledge from exist- ing intents to discover new intents and efficiently solicit semantic evidence from user utterances.
Existing NID methods can be divided into two categories: unsupervised and semi-supervised.In unsupervised NID, researchers mainly focus on how to extract better utterance features to assist the clustering (Padmasundari and Bangalore, 2018;Shi et al., 2018).But they tend to ignore the prior knowledge contained in the labeled data.Thus, in semi-supervised NID, various methods train representation learning models to facilitate knowledge transfer between labeled and unlabeled data, then perform clustering on utterance representations for NID in a two-stage fashion (Zhang et al., 2023;Zhou et al., 2022).In view of knowledge transfer, there are works that first pre-train an in-domain intent classifier, and then gradually update it with clustered pseudo-labels (Lin et al., 2020;Zhang et al., 2021c).There are also works that formulate contrastive learning objectives to optimize model parameters (Mou et al., 2022b;Zhang et al., 2022).
The essence of these two-stage methods lies in learning discriminative semantic features for utterances via minimizing intra-class variance and maximizing inter-class variance in the first stage.Hence, the similarity or dis-similarity relations between utterances or clusters are emphasized.For instance, pair-wise similarities are used as pseudo supervision in (Lin et al., 2020) and various contrastive learning objectives naturally enforce such relations (Wei et al., 2022;Mou et al., 2022a).However, the clustering process in the second stage can be easily distorted in favor of labeled data and dominant intent categories, resulting in the in-domain over-fitting problem as shown in Figure 1.Another risk is that such relation distortion would obscure the semantic meaning of intent clusters, leading to less meaningful new intents.
In this paper, we thus propose a Cluster semantic enhanced Prompt Learning (CsePL) method for NID as two stages.Specifically, we leverage the semantic knowledge to regulate both of the two stages, which are formulated as Intent Cluster Representation Learning (ICRL), and Prompting for Intent Discrimination (PID).In the ICRL stage, besides using two-level contrastive learning objectives to learn compact and closely connected regions for intents in feature space, we align the intent cluster representations with their corresponding label semantics.It enables the model to learn stable semantic features and semantic-aware intent cluster representations.In the following PID stage, we employ the learned intent cluster representations for soft prompt initializations and integrate them into input utterances to facilitate new intent discrimination.Given that the new inputs encompass the semantics from all intents, the prompting mechanism will encourage the model to focus on matching the utterances with their inherent semantic meaning, thus reducing the dominance of existing intents.We evaluate the proposed CsePL on three widely-used datasets.It outperforms state-ofthe-art (SOTA) methods in various aspects.
To summarize, our contributions are three-fold: • In view of the in-domain over-fitting and meaningless new intent problems, we propose to reiterate the importance of semantic knowledge in utterances and intent clusters for NID.• We propose two-level contrastive learning objectives with label semantic alignment for learning semantic-aware intent cluster representations and leverage the soft prompting mechanism to enhance the usage of semantic knowledge in intent discrimination.• Experiments show that CsePL not only gains significant improvements over SOTA methods, but also suggests meaningful intent labels and enables early detection of new intents.

Related Work
New Intent Discovery.Identifying new intents are key to adaptable conversational agents for better dialogue state understanding (Zhang et al., 2019;Liao et al., 2021).Previous research on NID can be predominantly categorized into two types: unsupervised and semi-supervised.For the former, early approaches (Cheung and Li, 2012;Li et al., 2013) primarily relied on statistical features of the unlabeled data to cluster similar queries for discovering new user intents.Subsequently, some studies (Xie et al., 2016;Yang et al., 2017;Shi et al., 2018;Hadifar et al., 2019) have endeavored to leverage deep neural networks to learn robust representations conducive to new intent clustering.However, none of these fully leveraged supervised signals, such as existing intent labels.To address this, recent studies (Lin et al., 2020;Zhang et al., 2021b,c;Mou et al., 2022b;Zhang et al., 2022) have extended NID to a semisupervised setting to achieve prior knowledge transfer, in which the labeled data is incorporated into the training process to assist new intent clustering.For example, Wei et al. (2022) and Zhang et al. (2023) first pre-trained a backbone model with the supervision of the limited labeled data.Then, they employed the pre-trained backbone to generate pseudo labels for the unlabeled data, directing the model to discern novel intents.Additionally, different from those pseudo-labeling based semi-supervised methods, Mou et al. (2022a) and Zhang et al. (2022) sought to directly optimize utterance representations with the aid of supervised data.They formulated distinct contrastive learning objectives to learn discriminative utterance representations, facilitating the similar utterance clustering and establishing distinct boundaries for new intent clusters.
However, all these methods overemphasize on relations such as similarity or dissimilarity among utterances for better clustering effects, while belittle the usage of semantics inside utterances and intents.Similar to the trivial solution in clustering (Yang et al., 2017;Caron et al., 2018;Ji et al., 2019;Zhang et al., 2021a;Zheng et al., 2023), it brings the problem of in-domain over-fitting and meaningless new intents due to the data distortion.In our work, we enhance model with semantic knowledge and use soft prompts to detect new intents.

Representation Space
Encoder  Prompt Learning.It is a new NLP paradigm of leveraging pre-trained language models (PLMs) (Brown et al., 2020;Sanh et al., 2022;Deng et al., 2023), which reformulates downstream tasks by inserting task-specific instructions into the input to align them with pre-training tasks.Early works (Jiang et al., 2020;Shin et al., 2020;Yuan et al., 2021;Ben-David et al., 2022) mainly utilized discrete hand-crafted or automatically searched prompts to acquire knowledge from PLMs.However, since discrete prompts are hard to optimize, recent works (Li and Liang, 2021;Lester et al., 2021;Liu et al., 2021;Gu et al., 2022;Hou et al., 2022) in prompt learning made efforts to optimize soft prompts in the continuous embedding space.It is more flexible and performs well in various downstream tasks.Here we explore to utilize both discrete prompts and soft prompts to leverage semantic knowledge for NID.
Contrastive Learning.Contrastive learning is a popular and effective approach to learn discriminative representations in both computer vision and NLP tasks (Chen et al., 2020;He et al., 2020;Fang and Xie, 2020;Carlsson et al., 2021;Giorgi et al., 2021;Gao et al., 2021;Wu et al., 2022a,b;Ye et al., 2022;Guo et al., 2022).The primary intuition of contrastive learning is to pull together positive pairs in feature space, while push away negative pairs.Motivated by its superior performance, contrastive learning has also been adopted to intent recognition in recent works (Zhang et al., 2021d;Mou et al., 2022a;Wei et al., 2022;Zhang et al., 2022Zhang et al., , 2023)).We leverage it to help us learn better utterance and intent cluster representations.
3 The CsePL Approach

Problem Formulation
Let C k and C u denote the known intent set and unknown intent set respectively, where the goal of NID is to identify potential unknown intents in C u from D unlabeled and classify the input x i into its corresponding intent y i , where Here we focus on the semi-supervised NID setting.

Model Overview
The proposed CsePL model is illustrated in Figure 2, which consists of two stages for discovering new intents.The first stage is Intent Cluster Representation Learning (ICRL) while the second stage is Prompting for Intent Discrimination (PID).We introduce these two stages in the following subsections one by one.In general, the ICRL stage is designed to solicit meaningful intent cluster representations.To achieve this, we reorganize the user utterances using a unified hand-crafted discrete prompt, which are then provided as inputs to a BERT-based backbone for feature extraction.
Then the two-level contrastive learning with label semantic alignment objectives are applied to optimize model parameters.Given the learned intent cluster representations, the PID stage targets discriminating intents for all utterances under the all intents-aware situation.Hence, we construct soft prompts with these learned semantic-aware intent cluster representations as initializations and perform further fine-tuning to guide the pre-trained backbone to discriminate intents.During inference, we send all test utterances to the well-trained PID model to extract utterances' intents-aware representations, and then conduct K-means to predict the intent categories for them.

Intent Cluster Representation Learning
Different from previous approaches that emphasis on relations among utterances or clusters, in this stage, we aim to enhance our model with semantic knowledge inside utterances and intents to learn meaningful intent cluster representations.
To achieve this, we first employ a prompt learning method to extract representations for the input utterances in both D labeled known and D unlabeled .As aforementioned, it is an effective approach to leverage PLMs to extract semantic information inside utterances.Given an input utterance x i , we convert x i to x prompt i by inserting a unified hand-crafted discrete prompt into it.The prompted utterance x prompt i is given as: where "[CLS] The intent [MASK] is in:" are handcrafted discrete prompt tokens.We tried different token designs and empirically chose these for best performance.We sent x prompt i to PLMs, and regard the extracted representation z i at position "[MASK]" as the representation of input x i .
Then, we conduct two-level contrastive learning with label semantic alignment to optimize the obtained utterance representations for learning meaningful intent cluster representations.In the utterance-level, we conduct both supervised and unsupervised contrastive learning to learn more accurate utterance representations.In the cluster level, we enforce the cluster center to be far away from other cluster utterances and close to its own cluster members.Beyond these, we further use label semantics to regulate the intent cluster center representations via contrastive learning objective.
Utterance-level Contrastive Loss.Inspired by the contrastive learning scheme (Khosla et al., 2020) under supervised setting, we optimize the model to bring utterances under the same intent together while disperse utterances from different intents apart.Let y i denotes the ground-truth intent label for an utterance x i and Y(i) denote all utterances sharing the same intent label as x i , the utterance-level supervised contrastive loss is: where f i = ϕ(z i ), τ 1 denotes the temperature in contrastive loss, and ϕ is a normalizing projection.
Beyond the label supervision, we also conduct unsupervised contrastive learning to help learning utterance representations similar to Equation 1.For the utterance x i , its positive sample is derived from its dropout-augmented view x i .Meanwhile, the negative samples for x i are taken from all other utterances and their corresponding augmentations within the same mini-batch.
Moreover, since intent labels are high-quality supervision signals, we also adopt cross-entropy loss for labeled utterances to optimize the model.
Cluster-level Contrastive Loss.In the cluster level, we optimize contrastive objectives to pull each cluster representation close to its utterance members while push away from other intent cluster utterances.We conduct K-means to obtain intent clusters from D unlabeled at the beginning of each training epoch.Suppose a cluster S i = {x 1 , x 2 , • • • , x q }, we directly define the cluster center c i for the cluster as: Therefore, the cluster-level contrastive learning objective is defined as: where c i is the corresponding intent cluster center representation of the utterance x i , while c j represents any intent cluster center that is not of x i and τ 2 denotes the temperature in this contrastive loss.
Semantic Alignment.For the labeled training data, we investigate to solicit semantic information from both utterances and intent labels and optimize the model with label semantic alignment.The intuition behind this is that, the label semantic features are more stable in the feature space and will not be influenced by the distribution of the training data.
Aligning the known intent cluster representations to such stable label semantic features can protect the model from in-domain over-fitting problem and enable the model to give suggestions about intent labels.To achieve this, we also adopt a contrastive learning objective for label semantic alignment.Specifically, for each known intent y i ∈ C k , we first use the BERT-based model to get embeddings for tokens in intent label y i .As each intent label may contain multiple tokens, we directly apply mean pooling operation to obtain intent label representation as: where |y i | is the number of tokens in the intent label y i and BertEmb(•) is the embedding projection in BERT-based model.t j denotes the j-th token in the y i .Then, the label semantic alignment loss is defined as follows: where As a result, the overall loss function in the ICRL stage can be formulated as follows: where L ce is the cross-entropy loss.{α, β, λ, η} represent hyper-parameters that modulate the respective contributions of distinct losses.

Prompting for Intent Discrimination
In the former stage, we have trained the model with different contrastive learning objectives to learn meaningful intent cluster representations.In the PID stage, we apply soft prompt learning to exploit these learned representations for efficient intent discrimination.It is an effective and flexible approach to leverage the semantic knowledge of PLMs and demonstrates strong performance across a range of downstream tasks.Specifically, we utilize the learned intent cluster representations as soft prompt initializations.Given an input utterance x i , we first use the pre-trained backbone to convert x i into a sequence of token embeddings }.Then, we insert a sequence of soft prompt vectors to E i to construct the formally input as: where c i is the intent cluster representation learned in the ICRL stage.The prompted input E prompt is send to the pre-trained backbone to extract representation for x i .In this stage, we regard the extracted representation at position "[CLS]" after normalizing projection as the utterance representation h i .It's noteworthy that each derived intent cluster representation acts as an independent soft prompt token in the prompted input, which reveals all the intent candidate semantics to the model.To update the model in this stage, following the work of Zhang et al. (2022), we optimize a contrastive learning objective, which mines neighboring utterance representations and pulls them together and pushes away distant ones in the feature space.The contrastive learning loss is calculated as: , where N (i) is the neighbor utterance set of x i and τ 4 denotes the temperature.Similar to Zhang et al. (2022), we select the most similar 50 utterances to x i in the feature space as its neighbors.Here n is the mini-batch size and each utterance is accompanied with an augmented version.During training, we update the neighbor utterance set N (i) every few epochs to guide the model to form clear cluster boundaries for new intent discovery.

Datasets
We conduct experiments to evaluate the performance of our CsePL on three widely-used NID datasets.Banking77 (Casanueva et al., 2020)  dataset and contains 22,500 samples with 150 intents across 10 domains.StackOverflow (Xu et al., 2015) dataset is collected from Kaggle.com, which includes 20,000 samples over 20 intents.
In the experiments, we retain the same division of Banking77, Clinc150, and StackOverflow as delineated in Zhang et al. (2023).More experimental details can be found in Appendix A.2.

Evaluation Metrics
We adopt three commonly used metrics to evaluate the clustering performance: Normalized Mutual Information (NMI), Adjusted Rand Index (ARI) and Accuracy (ACC).To evaluate ACC, we use the Hungarian algorithm (Kuhn, 1955) to construct the mapping between predicted clusters and groundtruth intent categories.Note that ACC is the most important evaluation metric in our experiments.

NID Performance Comparison
We present the main performance comparison results in Table 1, where the best results are highlighted in bold.Generally speaking, our proposed CsePL achieves significant improvements compared with the previous baselines.Here, we present the result analyses from the following aspects: Our proposed CsePL learns cluster-friendly semantics for discovering new intents: It can be seen that the proposed CsePL outperforms the baselines such as USNID and MTP-CLNN significantly and achieves new SOTA performances on three NID datasets.For example, compared with the USNID, our CsePL improves the ACC by 3.57%, ARI by 1.82%, and NMI by 0.29% on the Banking77 dataset with 75% known intent rate.It is worth mentioning that the Banking77 dataset is a fine-grained dataset collected from banking dialogues, whose utterances and intent labels contain rich semantic knowledge.This demonstrates that the CsePL can mine valuable and cluster-friendly semantic knowledge from the training data to enhance new intent clustering.
Interestingly, we find that although the proposed CsePL tends to perform worse than the strongest baseline USNID on the NMI metric for the 25%  less stable as evidenced by the large p-value.We find that given different random seeds, CsePL would perform better or worse than the USNID on the 25%-StackOverflow.This might due to the difference on the set of known intents selected.
Our proposed CsePL reduces in-domain overfitting for NID: We can observe that our CsePL maintains its superior performance when confronting the predominance of a larger scale labeled data.For example, for the Clinc150 dataset with a 75% known intent rate setting, the CsePL surpasses the best-performing baseline USNID by margins of 3.1% in ACC, 2.11% in ARI, and 0.16% in NMI.It is worth noting that the Clinc150 dataset encompasses 150 distinct intents and possesses a greater number of labeled training utterances for known intents.This shows the efficacy of the CsePL in effectively mitigating in-domain over-fitting.
Effect of different known intent rates: From the Table 1, we can observe that the performances of all NID methods gradually decrease with the known intent rate going down.As the lower the known intent rate is, the less labeled data is available for guiding the model training, which leads to more difficult prior knowledge transfer for discovering new intents.However, with the decrease of the known intent rate, our proposed CsePL achieves more substantial improvements.For example, on the Banking77 dataset with a 25% known intent rate, the CsePL achieves 5.21% ACC improvement compared with the USNID, while the ACC improvements are 3.67% and 3.57% with known intent rates of 50% and 75%, respectively.These results further indicates that our proposed CsePL is more generalized in NID.

Suggesting New Intent Labels
In order to demonstrate that our proposed CsePL can solicit semantic knowledge from existing utterances and intent labels to suggest new intent cluster labels, we select four known intents and unknown intents from C k and C u respectively in the Bank-ing77 dataset with the 25% known intent rate, and utilize their intent cluster representations learned in the ICRL stage to search for the most related tokens from the whole BERT-based vocabulary.We use the cosine similarity for the ranking and only present the most relevant tokens after filtering special tokens such as "[MASK]" and "[CLS]".
As reported in Table 2, for the known intents, the intent cluster representations learned by our CsePL can exactly retrieve the label tokens or semantic similar tokens from the BERT-based vocabulary.For example, given an intent top up failed, all tokens appearing in the intent label are retrieved by the intent cluster representation as the most related tokens.For the intent supported cards and currencies, our proposed CsePL can search for and distinguish the semantic similar tokens such as cash and money as the relevant tokens.This suggests that the intent cluster representations derived by our CsePL can accurately capture the semantics associated with their respective intent labels.
It is noteworthy that the CsePL is also capable of providing meaningful label suggestions for unknown intents.For instance, our CsePL picks up the token exchange from the vocabulary as the most relevant token for the intent exchange char in C u .Even for the unknown intent wrong amount of cash received with more complex semantics, the semantic similar tokens receiving and money are retrieved as the related tokens by the CsePL.meaningful unknown intent labels.

Early Detection of New Intents
Early detection is a critical requirement for new intent discovery methods.To demonstrate this, we compare the performances of different methods when only a few utterances are available for each unknown intent.The results are reported in Table 3.We can observe that when the utterances for each unknown intent are limited, all methods perform worse than before, but the proposed CsePL significantly outperforms other two methods.It indicates the ability of the CsePL in discriminating new intents in the early stage.This also signals the importance of leveraging semantic knowledge.

Detailed Analysis
In this subsection, we conduct detailed analysis to explore the impact of each key component in the CsePL: 1) CsePL w/o PID: we entirely remove the PID stage and only use the model trained in the ICRL stage for NID. 2) CsePL w/o SemanticAlign: we remove the label semantic alignment (Equation 4) in the ICRL stage during training.3) We also analyze the effect of predicted cluster number K.
Note that we present results exclusively for the Banking77.While other datasets exhibit similar patterns but we omit them due to space limitation.

Effect of Prompt Discrimination
As shown in that investigate various soft prompt initialization techniques during the PID stage.We leave the details of the experimental results in Appendix A.3.

Effect of Semantic Alignment
We also compare the model performance of removing semantic alignment process in the ICRL stage with the standard CsePL to explore the contribution of the semantic alignment.We find that eliminating the semantic alignment for intent cluster representation learning degrades the performance for new intent discovery.For example, the ACC drops from 71.06% to 68.75% on the Banking77 dataset with the 25% known intent rate without utilizing the semantics.This demonstrates the importance of the semantic alignment process.

Effect of Estimating Cluster Number K
We have been assuming the cluster number K as a given hyper-parameter in the same fashion as baselines.However, in the practical dialogue systems, the number of clusters is unknown and it is important to predict K for new intent discovery.Following the work of Zhang et al. (2021c), we predict the cluster number K via an estimation algorithm.More details can be found in Appendix A.4.We present the model performances of different cluster number K in Table 5.We can observe that although the performances of both 8  our CsePL and the SOTA baseline USNID decline with an inaccurate cluster number K, the proposed CsePL still achieves significant performance improvements over the USNID method.It shows that the proposed CsePL is more robust regarding the estimated cluster number K.

NID Representation Visualisation
In order to more intuitively analyze the effect of the proposed CsePL in representation learning, we present t-SNE visualizations comparing the leading baseline USNID and our CsePL approach, as illustrated in Figure 3.The USNID visualization reveals that data points of the unknown intent are distorted and dispersed within two known intent clusters unable to verify identity and verify my identity.This results in the in-domain over-fitting problem.Furthermore, this dispersion undermines the meaning of the newly learned intent cluster, as it encompasses instances from three distinct intents.Conversely, the visualization for the CsePL demonstrates how the label semantic alignment effectively aligns the intent cluster representations with the semantics of their corresponding labels.This process renders the unknown intent cluster, why verify identity, more coherent and less distorted.Additionally, with reduced noise in this cluster, its meaning becomes more discernible.

Error Analysis
In this subsection, we conduct an error analysis to delve into the problem of in-domain over-fitting and to evaluate the effectiveness of our proposed CsePL method.In Table 6, we present the ratio of unknown intent samples that the model wrongly classified as known intents.Additionally, we highlight the percentage of utterances that originate from known intents but were inaccurately predicted.We can observe that the leading baseline USNID incorrectly classifies 9.1% of utterances with unknown intents as known intents.This misclassification rate is nearly twice that of known intents being predicted inaccurately, which stands at 4.6%.This implies that the presence of known intent data could excessively sway the clustering procedure, leading to the in-domain over-fitting problem.Compared with the USNID, the clustering outcomes derived from the CsePL demonstrate a diminished influence of the known intent data, leading to a notable reduction in both ratios.

Conclusion
In this paper, we reemphasized the importance of semantic knowledge in new intent discovery and proposed a Cluster semantic enhanced Prompt Learning (CsePL) method.Specifically, we designed two-level contrastive learning with label semantic alignment for intent cluster representation learning, and a soft prompting method to leverage the learned intent cluster representation for NID.Experimental results on three public datasets demonstrate the effectiveness of the CsePL.Extensive analyses further show that the CsePL not only significantly outperforms the existing baselines, but also implies new intent labels and detects the appearance of the new intents at an early stage.

Acknowledgement
This research is supported by the Ministry of Education, Singapore, under its AcRF Tier 2 Funding (Proposal ID: T2EP20123-0052).Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of the Ministry of Education, Singapore.9

Limitations
We discuss the limitations from the following perspectives: (1) Usage of LLMs.Recently, large language models (LLMs) such as ChatGPT or GPT-4 have exhibited their outstanding performance on various NLP task, showing its abundance in semantic knowledge.Though BERT has advantage in relatively low resource consumption, we will look into how to leverage LLMs's knowledge for better NID.
(2) New intent labels.Although our method has show the potential in suggesting new intent labels, we plan to further investigate the possibility of generating the whole label directly, which will be more useful.
(3) Early detection.It is critical in deployed systems.We plan to further look into this aspect and conduct comprehensive experiments to test the limit on how early it can work.

Figure 1 :
Figure 1: The overview of the in-domain over-fitting problem in NID and our label semantic alignment.

Figure 2 :
Figure 2: The overall architecture of our proposed CsePL framework for new intent discovery.The left part denotes the ICRL stage and the right part is the PID stage, where [utterance] is the original utterance, and {c 1 , c 2 , c 3 , ...} are the soft prompts initialized by all the learned intent cluster representations.
k | is the number of known intents and τ 3 denotes the temperature.Note that cluster indices generated by K-means are permuted randomly in different training epoch.To tackle this pseudo label inconsistency problem and provide high-quality supervised signals, we also conduct cluster alignment following the work of Zhang et al. (2021c).

Table 1 :
Main performance results on the new intent discovery across three public datasets.† denotes p-value<0.01, * denotes p-value<0.05anddenotes p-value>0.05under t-test.

Table 2 :
Intent label suggestions: the related tokens that appear in the intent label are marked in red, while the related tokens that have similar meaning to the intent label tokens are highlighted in bold.

Table 3 :
This shows the ability of the CsePL in suggesting Results of early detection of new intents.For each unknown intent, only 20 utterances are available.

Table 5 :
Effect of estimating cluster number K.

Table 6 :
Ratio of wrongly predicted intents.UPK denotes utterances that belong to Unknown intents but are inaccurately Predicted as Known intents.KPE represents utterances that originate from Known intents but are Predicted Erroneously.