IntenDD: A Unified Contrastive Learning Approach for Intent Detection and Discovery

Identifying intents from dialogue utterances forms an integral component of task-oriented dialogue systems. Intent-related tasks are typically formulated either as a classification task, where the utterances are classified into predefined categories or as a clustering task when new and previously unknown intent categories need to be discovered from these utterances. Further, the intent classification may be modeled in a multiclass (MC) or multilabel (ML) setup. While typically these tasks are modeled as separate tasks, we propose IntenDD, a unified approach leveraging a shared utterance encoding backbone. IntenDD uses an entirely unsupervised contrastive learning strategy for representation learning, where pseudo-labels for the unlabeled utterances are generated based on their lexical features. Additionally, we introduce a two-step post-processing setup for the classification tasks using modified adsorption. Here, first, the residuals in the training data are propagated followed by smoothing the labels both modeled in a transductive setting. Through extensive evaluations on various benchmark datasets, we find that our approach consistently outperforms competitive baselines across all three tasks. On average, IntenDD reports percentage improvements of 2.32%, 1.26%, and 1.52% in their respective metrics for few-shot MC, few-shot ML, and the intent discovery tasks respectively.


Introduction
Intents form a core natural language understanding component in task-oriented dialogue (ToD) systems.Intent detection and discovery not only have immense utility but are also challenging due to numerous factors.Intent classes vary vastly from one use case to another, and often arise out of business needs specific to a particular product or organization.Further, modeling requirements might necessitate considering fine-grained and semantically-similar concepts as separate intents (Zhang et al., 2021c).Overall, intent-related tasks typically are expected to be scalable and resource efficient, to quickly bootstrap to new tasks and domains; lightweight and modular for maintainability across domains and expressive to handle large, related often overlapping intent scenarios (Vulić et al., 2022;Zhang et al., 2021a).
INTENDD proposes a unified framework for intent detection and discovery from dialogue utterances from ToD systems.The framework enables the modeling of various intent-related tasks such as intent classification, both multiclass and multilabel, as well as intent discovery, both unsupervised and semi-supervised.In Intent detection (classification), we expect every class to have a few labeled instances, say 5 or 10.However, in intent discovery, not all classes are expected to have labeled instances and may even be completely unlabeled.
Recently, intent-related models focus more on contrastive representation learning, owing to the limited availability of labeled data and the presence of semantically similar and fine-grained label space (Kumar et al., 2022;Zhang et al., 2021c).Similarly, a common utterance encoder forms the backbone of INTENDD, irrespective of the task.The utterance encoder is learned by updating the parameters of a general-purpose pre-trained encoder using a two-step contrastive representation learning process.First, we adapt a general-purpose pre-trained encoder by using unlabelled information from various publicly available intent datasets.Second, we update the parameters of the encoder using utterances from the target dataset, on which the task needs to be performed, making the encoder specialize on the corpus.Here, we use both labeled and unlabelled utterances from the target dataset, where pseudo labels are assigned to the latter.
For intent classification, both multiclass and multilabel, INTENDD consists of a three-step pipeline.It includes training a classifier that uses the rep-resentation from the encoder as its feature representation, followed by two post-processing steps in a transductive setting.Specifically, a multilayer perceptron-based classifier is trained by stacking it on top of the utterance representation from our encoder.The post-processing steps consider the target corpus as a graph in a transductive setting.The first postprocessing step involves propagating the residual errors in the training data to the neighbors.The second one further performs label smoothing by propagating the labels obtained from the previous step.Both these steps are performed using Modified Adsorption, an iterative algorithm that enables controlling the propagation of information that passes through a node more tightly (Talukdar and Pereira, 2010).
Major contributions: INTENDD reports performance improvements compared to that of competitive baselines in all the tasks and settings we experimented with, including multiclass and multilabel classification in few-shot and high data settings; unsupervised and semi-supervised intent discovery.Our two-step post-processing setup for intent classification leads to statistically significant performance improvements to our base model.While existing intent models focus primarily on better representation learning and data augmentation, we show that classical transductive learning approaches can help improve the performance of intent models even in fully supervised settings.Finally, we show that with a careful construction of a graph structure in a transductive learning setting in terms of both edge formation and edge weight formation can further improve our outcomes.

INTENDD
INTENDD consists of a two-step representation learning module, a classification module, and an intent detection module.We elaborate on each of these modules in this section.

Continued Pretraining
We start with a general-purpose pre-trained model and use it as a cross-encoder for the continued pretraining (Gururangan et al., 2020).We start with a standard general-purpose pre-trained model as the encoder.We follow Zhang et al. (2021c) for our pretraining phase where the model parameters are updated both using a combination of token-level masked language modeling loss and a sentencelevel self-supervised contrastive loss.For a batch of K sentences, we compute the contrastive loss (Wu et al., 2020;Liu et al., 2021) as follows (1) For a sentence x i , we obtain a masked version of the sentence xi , where a few tokens of x i are randomly masked.Further, we dynamically mask tokens such that each sentence has different masked positions across different training epochs.In L sscl , h i is the representation of the sentence x i and hi is the representation of the xi .τ is the temperature parameter that controls the penalty to negative samples and sim(., .)denotes the cosine similarity between two vectors.The final loss L pretraining is computed as L pretraining = L sscl + λL mlm .Here, L mlm is token level masked language modelling loss and λ is a weight hyper-parameter.

Corpus-specialized Representation Learning
The pretraining step uses unlabelled sentences from publicly available intent datasets which should ideally expose a pre-trained language model with utterances in the domain.Now, we consider contrastive representation learning using the target dataset on which the task needs to be performed.Consider a dataset D with a total of N unlabelled input utterances.Here, assuming D to be completely unlabeled, we first assign pseudo labels to each of the utterances in D. Using the pseudo labels, we learn corpus-level contrastive representation by using supervised contrastive loss (Khosla et al., 2020).The pseudo labels are assigned by first finding clusters of utterances by using a community detection algorithm, 'Louvain' (Blondel et al., 2008).Community detection assumes the construction of a graph structure.We form a connected weighted directed graph G D (V D , E, W ), the input utterances in D form the nodes in G D .We identify lexical features in the form of word-level n-grams.
We identify keyphrases that are representative of the target corpus on which the representation learning is performed.The keyphrases are obtained by finding word-level n-grams that have a high association with the target corpus, as compared to the likelihood of finding those in other arbitrary corpora.Here, we obtain the pointwise mutual information (PMI) of the n-grams in the target corpus, based on the likelihood of the n-gram occurring in the corpus, compared to a set of utterances formed via the union of the sentences in the target corpus and that in the corpora used during pretraining setup.Let P be the union of all the sentences in the corpora used in the pretraining step.Now, the PMI is calculated as Here, df (kp, D) is the count of utterances in D that contain the keyphrase kp.df (kp, |P ∪ D|) is the frequency of the keyphrase from the combined collection D and P. Here, we only consider those keyphrases which is present at least five times in D.Moreover, the log frequency of the count of the keyphrase is also multiplied with PMI to avoid high scores for rare words (Jin et al., 2022).Further, the PMI value is multiplied by the square of the number of the words in the ngram so as to have higher scores for ngrams with larger values of n (Banerjee and Pedersen, 2002).We validated this decision during preliminary experiments where we found that multiplying PMI with the square of the number of words generally worked better for the datasets considered in this work.That said, it's important to note that this design choice may vary in its necessity when applied to a different dataset, and its requirement should be established through empirical investigation.
Now, the keyphrases are used to construct G D .Two nodes have edges between them if they both have at least one common keyphrase.The edge weights are the sum of the keyphrase scores common between two nodes.The weight matrix W is a N × N matrix representing the edge weights in the graph.W is row-normalized using min-max normalization, a form of feature scaling.The graph G D is then used to perform community detection using Louvain, a modularity-based community detection algorithm.Community membership is used to form clusters of inputs.Here, all the nodes in G D that belong to the same cluster are assigned with a common (pseudo)-label.
Louvain Method: is a modularity-based graph partitioning approach for detecting hierarchical community structure (Blondel et al., 2008).Here, each utterance is considered a node in a graph and the edge weights capture the strength of the relation between node pairs.Louvain Method attempts to iteratively maximize the quality function it optimizes, generally modularity.While the approach may be started with any arbitrary partitioning of the graph, we start with each data point belonging to its own community (singleton communities).It then works iteratively in two phases.In the first phase, the algorithm tries to assign the nodes to their neighbors' community as long as that reassignment leads to a gain in the modularity value.The second phase then aggregates the nodes within a community and forms a super node, thus creating a new graph where each community in the first phase becomes a node in the second phase.The process iteratively continues until the modularity value can no longer be improved.
Until now, we were assuming G D to be completely unlabeled.However, we are yet to discuss two crucial questions.One, how to incorporate labeled information for an available subset of utterances in a semi-supervised setup.Here, we need to ensure that nodes belonging to the same true label should not get partitioned into separate clusters.We merge those inputs with the same true label as a single node before constructing G D , and initialize Louvain with the graph structure so obtained.The merging of the utterances with a common label into a single node trivially ensures that no two utterances of the same label get partitioned into different clusters.Hence, we ensure that no two nodes with the same true label are assigned with different pseudo labels.However, at this stage, the pseudo-labels are obtained purely for representation learning.It is not intended to be representative of the real intent classes but is rather simply a partition based on the keyphrases in the utterances.Finally, Using the pseudo labels obtained via Louvain, we learn corpus-level contrastive representation by using supervised contrastive loss (Khosla et al., 2020).Here, during the representation learning each utterance is treated separately and we do not consider the merging that we performed for the community detection.
Keyphrase selection for constructing G D : While we have a list of n-grams, along with their feature scores.Here, we employ recursive feature elimination (RFE), a greedy feature elimination approach as our feature selection strategy.In RFE we start with a large set of features and greedily eliminate features, one at a time.We start with the top k features and perform the community detection using Louvain.We then start with the least promising feature from our selected features and check if removing the feature leads to an increase in the overall modularity of the graph, as compared to the modularity when the feature was included.Here, the number of nodes in G D remain the same, though the number of edges and their edge weights are dependent on the features.A single run of the Louvain algorithm has a complexity of O(n.logn), where n is the number of nodes.So in worst case, the time complexity for graph construction is O(n.d.logn), where d is the number of features.We perform the feature selection for a few fixed iterations.We incorporate some additional constraints to keep track of for the feature selection, which are as follows: The graph needs to remain a single connected graph and if the removal of a feature violates it, then we keep the feature.Second, in all the tasks we consider, we assume the knowledge of the total number of intents.Hence a feature, whose presence, even if contributes positively to modularity but results in increasing the gap between the total number of true intent classes and the number of clusters Louvain provides with it as the feature, then the feature is removed as well.

Intent Discovery
We perform intent discovery in both unsupervised and semi-supervised setups.Intent discovery is performed via clustering.Here, we start with the same graph construction as was used for Louvain in §2.2.The weight matrix W is row-normalized.Additionally, we obtain a similarity matrix A based on the cosine similarity between the utterance level encodings of two nodes.The encodings are obtained from the encoder learned in §2.2.We obtain a weighted average of the edge weights in W and A. Specifically, the weights for the average is obtained via grid search and selects the configuration that optimizes the silhouette score, an intrinsic measure for clustering quality.The new graph will be referred to as G pred .With G pred , we perform Louvain again for intent discovery.The labeled nodes in a semi-supervised setup would be merged as a single node before running Louvain.When a new set of utterances arrive, these utterances are added as nodes in G pred .Their corresponding values in A are obtained based on their representation obtained from our encoder ( §2.2).The corresponding values in W are obtained based on the existing set of ngrams and no new feature selection is performed.

Intent Classification
Irrespective of whether multiclass or multilabel setup, our base classifier is a multilayer perceptron comprising of a single hidden layer with nonlinearity.It uses the utterance level representation, learned in §2.1 and §2.2, as its input feature, which remains frozen during the training of the classifier.The classifier is trained using cross-entropy loss with label smoothing (Vulić et al., 2022;Zhang et al., 2021c).The activation function at the output layer is set to softmax and sigmoid for multiclass and multilabel classification respectively.
Modified Adsorption (MAD) is a graph-based semi-supervised transductive learning approach (Talukdar and Crammer, 2009).MAD is a variant of the label propagation approach.While label propagation (Zhu et al., 2003) forces the unlabeled instances to agree with their neighboring labeled instances, MAD enables prediction on labeled instances to vary and incorporates node uncertainty (Yang et al., 2016).It is expressed as an unconstrained optimization problem and solved using an iterative algorithm that guarantees convergence to a local optima (Talukdar and Pereira, 2010;Sun et al., 2016).The graph typically contains a few labeled nodes, referred to as seed nodes, and a large set of unlabelled nodes.The graph structure can be explicitly designed in MAD.The unlabelled nodes are typically assigned a dummy label.In MAD, a node actually is assigned a label distribution than a hard assignment of a label.
From a random walk perspective, it can be seen as a controlled random walk with three possible actions, each with predefined probabilities, all adding to one (Kirchhoff and Alexandrescu, 2011).The three actions involve a) continuing a random walk to the neighbors of a node based on the transition matrix probability, b) stopping and returning the label distribution for the node, and c) abandoning and returning an all-zero distribution or a high probability to the dummy label.Each of these components forms part of the MAD objective in the form of seed label loss, smoothness loss across edges, and the label prior loss.The objective is: Here M is the symmetrized weight matrix, Y jl is the initial weight assignment or the seed weight for label l on node j, Ŷjl is the updated weight of the label l on node j. S is diagonal matrix indicating seed nodes, and R jl is the regularization target for label l on node j.Here, we are assuming a classification task with K labels, and MAD introduces a dummy label as an initial assignment for the unlabeled nodes.
We follow Huang et al. (2021) and perform two post-processing steps.While the original approach use label spreading (Zhou et al., 2003) for both steps, we replace it with MAD.Moreover, our graphs are constructed by a combination of embedding-based similarity and n-gram based similarity as described in §2.3, i.e.G pred .Both the postprocessing steps are applied on the same graph structure.However, the seed label initializations differ in both settings.

Propagation of Residual Errors:
We obtain the predictions from the base predictor, where each prediction is a distribution over the labels.Using the predictions, we compute the residual errors for the training nodes and propagate the residual errors through the edges of the graph.The unlabelled and validation nodes are initialized with a zero value (or a dummy value), and the seed nodes are initialized with their residuals.Essentially Y is initialized with a non-zero error for the training nodes with a non-zero residual error.With this initialization of Y we apply MAD on G M AD .The key assumption here is that the errors in the base prediction are positively correlated with the similarity neighborhood in the graph and hence the residuals need to be propagated (Huang et al., 2021).Here, the residuals are propagated.Hence at the end of the propagation, each node has the smoothed errors as a distribution over the labels.To get the predictions after this step, the smoothed errors are added to predictions from the base predictor for each node.
Smoothing Label Distribution The last step in our classification pipeline involves a smoothing step.Here, we make the fundamental assumption of homophily, where adjacent nodes tend to have similar labels.Here, Y is initialized as follows: Seed labels are provided with their ground truth labels, the validation nodes and the unlabelled nodes are provided with initialized with the predictions after the error propagation step.With this initialization, we perform MAD over G M AD .In multiclass classification, the label with the maximum value for each node is predicted as the final class.In multilabel classification, all the labels with a score above a threshold are predicted as the final labels.

Experimental Setup
We perform experiments for the three intent related tasks -Intent Discovery, Multiclass Intent Detection, and Multi-label Intent Detection.Here, we provide training and modeling details that are common to all three tasks and then mention taskspecific details such as the baselines and evaluation metrics at appropriate sections.
Training and Modeling Details.We choose RoBERTa (Liu et al., 2019b) with the base configuration as our common encoding backbone and pretrain with aforementioned datasets.For encoding the input utterances, we use a cross-encoder architecture as detailed by (Mesgar et al., 2023).In this setup, the joint embedding for any pair of utterances (p, q) -needed for contrastive learning for instance-is obtained by embedding it as "[CLS] p [SEP] q" and the [CLS] representation is used as the representation for that pair.Mesgar et al. (2023) found that a cross-encoder approach works much better than a Bi-encoder where any pair of utterances are independently embedded.
We perform all of our experiments using the tranformers library (Wolf et al., 2020) and the pytorch framework (Paszke et al., 2019).We train our models using the AdamW optimizer with learning rate set to 2e-5, warmup rate of 0.1, and weight decay of 0.01.We pretrain our model for 15 epochs, and thereafter perform task-specific training for another 20 epochs.All experiments are performed on a machine with NVIDIA A100 80GB and we choose the maximum batch size that fits the GPU memory (= 96).We perform hyperaparameter search for the temperature τ and lambda λ over the ranges τ ∈ {0.1, 0.3, 0.5}, and λ ∈ {0.01, 0.03, 0.05}.

Intent Discovery
Datasets.We use three datasets for benchmarking INTENDD for Intent Discovery, namely, BANKING77, CLINC-150, and Stack Overflow.We assess the effectiveness of our proposed ID system in two practical scenarios: unsupervised ID and semi-supervised ID.To ensure clarity, we introduce the term Known Intent Ratio (KIR), which represents the ratio of known intents in the training data: the number of known intent categories (|I k |) divided by the sum of the known intent categories and unknown categories (|I k |+|I u |).In this context, a value of |I k | = 0 corresponds to unsupervised ID, indicating the absence of any known intent classes.For semi-supervised ID, we adopt the approach outlined in previous works (Kumar et al., 2022;Zhang et al., 2021b), conducting experiments using three KIR values: {25%, 50%, 75%}.
Baselines.We follow the recent work of Kumar et al. (2022) to select suitable baselines for unsupervised and semi-supervised scenarios.Due to space constraints, we detail these in the appendix.
Results.We report all the intent discovery results in table 1.To begin with, it is important to highlight that our proposed method INTENDD consistently demonstrates superior performance surpassing all baseline techniques in both unsupervised and semisupervised settings across all three datasets.Specifically, in an entirely unsupervised scenario, SBERT-KM emerges as the most formidable baseline, yet our approach significantly outperforms it.It should be noted that the fundamental distinction between INTENDD and SBERT-KM lies in our graph construction strategy for clustering.Our strategy relies on a combination of semantic similarity (via embeddings) and n-gram based similarity (via keyphrases), underscoring the importance of incorporating both these similarity measures.
Furthermore, while our approach demonstrates notable enhancements across all configurations, these improvements are particularly pronounced when the amount of labeled data is limited, resulting in an average increase of nearly 3% in accuracy for KIR values of 0% and 25%.

Multiclass Intent Detection
Datasets and Evaluation Metric.Following Zhang et al. (2021c), we perform few-shot intent detection and select three challenging datasets for our experiments, namely, CLINC-150, BANK-ING77, and HWU64.We use the same training and test splits as specified in that paper, and use detection accuracy as our evaluation metric.
Results.Table 2 shows the results of our experiments for multiclass intent detection.Our proposal, INTENDD demonstrates superior performance across all three setups when compared to the baseline models in the 5-shot, 10-shot, and full data scenarios.In the 5-shot setting, exhibits an average absolute improvement of 2.47%, with the highest absolute improvement of 4.31% observed in the BANKING77 dataset.Across all the datasets, IN-TENDD achieves average absolute improvements of 1.31% and 0.71% in the 10-shot and full data settings, respectively.
INTENDD currently does not incorporate any augmented data in its experimental setup.We do not compare our work with data augmentation methods as they are orthogonal to ours.One such example is that of ICDA (Lin et al., 2023), where a large language model (OPT-66B) (Zhang et al., 2022) is used to augment the intent detection datasets for few-shot data settings.Nevertheless, we find that our method performs better than ICDA.We mention this comparison in the appendix B.1.1: Results for Intent Discovery.First set of results are in a completely unsupervised setting, while others are when some of the intent categories are known.KIR is used to represent the Known Intent Ratio.In all the experiments involving known intents classes, we assume the proportion of labeled examples to be 10% (Kumar et al., 2022).Baseline results are taken from Kumar et al. (2022) and those marked with -have not been reported in literature.DSSCC paper does not report results for DSSCCBERT on Stack Overflow, and we could not get access their code to independently run that model.The best results for each dataset and setting are marked in bold.We note that our proposed method consistently outperform recent baselines by a significant margin.We report intent detection accuracy for three data settings.We use the baseline numbers from (Lin et al., 2023).The best results for each dataset and setting are marked in bold.

Method
Is Modified Adsorption important for Intent Detection?INTENDD uses a pipeline of three classification setups: one using the MLP, and two in a transductive setting using the Modified Ad-sorption (MAD).We perform ablation experiments with these components and report results in the table 2. We report results from three systems by progressively adding one component at a time.IN-TENDD-MLP denotes the results without using the two steps of Modified Adsorption, INTENDD-EP denotes the results with MAD but only the residual propagation step (i.e.without the label smoothing).We observe consistent performance improvements due to each of the components of the pipeline.Notably, the label propagation step leads to more significant improvements and these gains are not only observed in the few-shot setups but also in the fully data scenarios.

Multilabel Intent Detection
Datasets and Evaluation Metric.Following Vulić et al. (2022), we use three datasets for multilabel intent detection: BANKING77, Mix-ATIS, and HOTELS subset is taken from NLU++ benchmark.MixATIS consists of a multilabel dataset synthetically obtained via concatenating single-label instances from the ATIS dataset.We do not perform experiments with InsuranceFAQ from that paper since it was an internal data.We report standard evaluation metrics: F1 and exact match accuracy (Acc).We report results on all datasets in two settings: low-data, and the high-data regimes, again replicating the experimental settings from Vulić et al. (2022).
Baselines.Our main baseline is the MultiCon-vFiT model proposed by Vulić et al. (2022) with two variants.MultiConvFiT (FT) where full finetuning along with the updating encoder parameters is performed.The second, more efficient alternative MultiConvFiT (Ad) where an adapter is used instead of updating all parameters.Along with this, two other baselines from ConVFiT (Vulić et al., 2021) are adapted -DRoB, and mini-LM.Please refer to Vulić et al. (2022) for more details on these methods.
Results.The results of our experiments are shown in table 3. First, the results demonstrate consistent gains achieved by our method across all three datasets.Notably, in low-data scenarios, we observe an average increase of approximately 1% in F-scores.As anticipated, the performance enhancements are more substantial in low-data settings.However, it is noteworthy that our model outperforms MultiConVFiT even in high-data setup.
We find the results of our base predictor and our final classifier to be statistically significant for all the settings of multi-class and multi-label intent detection using the t-test (p < 0.05).

Conclusion
In summary, this paper presents a novel approach, INTENDD, for intent detection and discovery in task-oriented dialogue systems.By leveraging a shared utterance encoding backbone, IN-TENDD unifies intent classification and novel intent discovery tasks.Through unsupervised contrastive learning, the proposed approach learns representations by generating pseudo-labels based on lexical features of unlabeled utterances.Additionally, the paper introduces a two-step postprocessing setup using modified adsorption for classification tasks.While intent classification tasks typically focus on contrastive representation learning or data augmentation, we show that a two-step post-processing setup in a transductive setting leads to statistically significant improvements to our base classifier, often rivaling or at par with data augmentation approaches.Extensive evaluations on diverse benchmark datasets demonstrate the consistent improvements achieved by our system over competitive baselines.

Limitations
While our research provides valuable insights and contributions, we acknowledge certain limitations that should be considered.In this section, we discuss two main limitations that arise from our work.
First, a limitation of our proposed intent discovery algorithm is its reliance on prior knowledge of the number of intent clusters.This assumption may not hold in real-world scenarios where the underlying intent structure is unknown or may change dynamically.The requirement of knowing the exact number of intent clusters can be impractical and unrealistic, limiting the generalizability of our approach.However, we recognize that this limitation can be addressed through modifications to our algorithm.Future investigations should explore techniques that allow for automated or adaptive determination of the number of intent clusters, making the approach more robust and applicable to diverse real-world settings.
The second limitation of our research lies in the reliance on the construction of a graph using extracted keyphrases during the contrastive pretrain-  Queen, 1967) on top of sentence embeddings from BERT (Devlin et al., 2019) and SBERT (Reimers and Gurevych, 2019) to cluster user utterances (BERT-KM, SBERT-KM respectively).DEC (Xie et al., 2016) is a two step deep clustering approach involving a Stacked Autoencoder (SAE) along with confidence based cluster assignment.SAE-KM uses K-means with SAE (Xie et al., 2016), DCN (Yang et al., 2017) is a method that performs dimensionality reduction and clustering using a joint objective function, and DAC (Chang et al., 2017) treats the clustering problem as a pairwise binary classification problem to learn cluster centers.
For the semi-supervised case, we use CDAC+ (Lin et al., 2020), in which the pairwise constraints from the labeled examples are incorporated into the clustering problem.DeepAligned (Zhang et al., 2021b) uses labeled data to generated pseudo labels as well as pretrain a BERT model followed by K-means clustering.Finally, we compare our method with a very recent method DSCC (Kumar et al., 2022) where the authors propose an end-to-end contrastive clustering algorithm to jointly learn cluster centers and utterance representations via a combination of supervised and self-supervised methods.We report results with two backbone models used in the paper, BERT and S-BERT.

Multiclass Intent Detection
In this study, we consider several baseline models for intent de-  tection.The first baseline, RoBERTa-base, utilizes RoBERTa as its base model, supplemented with a linear classifier on top for classification purposes.Another baseline, CONVBERT, involves fine-tuning BERT using a vast opendomain dialogue corpus consisting of 700 million conversations (Mehri et al., 2020).Furthermore, CONVBERT + Combined, an intent detection model based on CONVBERT, adopts example-driven training with similarity matching and transformer attention observers, along with task-adaptive self-supervised learning using masked language modeling on intent detection datasets.The term "Combined" refers to the optimal MLM+Example+Observers setting described in Mehri and Eric (2021).Another baseline model, DNNC (Discriminative Nearest-Neighbor Classification) (Zhang et al., 2020), employs a discriminative nearest-neighbor approach, matching training examples based on similarity and employing data augmentation during training.Additionally, it enhances performance through pre-training on three natural language inference tasks.Finally, CPFT (Contrastive Pre-training and Fine-Tuning) (Zhang et al., 2021c) represents the current state-of-theart in few-shot intent detection, employing selfsupervised contrastive pre-training on multiple intent detection datasets, followed by fine-tuning using supervised contrastive learning.

B Additional Results
Variance in Few-shot Intent Detection.In the few-shot settings, we generally report lower variance than CPFT, the system with the second-best results consistently.

Table 2 :
Results for Multiclass Intent Detection.

Table 3 :
Vulić et al. (2021)bel Intent Detection.We report both F 1 score and Accuracy for all the settings.The first number in each cell is the F 1 score and the second number is the accuracy.Vulić et al. (2022)proposed two variants for MultiConvFiT -one with full fine-tuning (FT) and another with adapters (Ad).The ConvFiT model proposed byVulić et al. (2021)has been adapted for multi-label settings with DistilRoBERTa (DRoB), and mini-LM as backbones.The best results for each dataset and setting are marked in bold.To establish the statistical significance of our results, we performed the paired t-test between INTENDD and MultiConvFiT (FT) and found the p-value in all cases to be < 0.05.

Table 4 :
Dataset Statistics for the three Intent Identification tasks explored in this work.Second column lists the number of intent classes in each of the datasets.

Table 5 :
Standard Deviation across different runs for Few-Shot Intent Detection.We observe that, compared to CPFT, our method has lower variance across most settings.
Table 5 shows the standard deviation for INTENDD and CFPT, where CPFT has a lower variance than INTENDDonly in two out of six settings.