CoCoID: Learning Contrastive Representations and Compact Clusters for Semi-Supervised Intent Discovery

Intent discovery is to mine new intents from user utterances, which are not present in the set of manually predefined intents. Previous approaches to intent discovery usually automatically cluster novel intents with prior knowledge from intent-labeled data in a semi-supervised way. In this paper, we focus on the discriminative user utterance representation learning and the compactness of the learned intent clusters. We propose a novel semi-supervised intent discovery framework CoCoID with two essential components: contrastive user utter-ance representation learning and intra-cluster knowledge distillation. The former attempts to detect similar and dissimilar intents from a minibatch-wise perspective. The latter regu-larizes the predictive distribution of the model over samples in a cluster-wise way. We conduct experiments on both real-life challenging datasets (i.e., CLINC and BANKING) that are curated to emulate the true environment of commercial/production systems and traditional datasets (i.e., StackOverflow and DBPedia) to evaluate the proposed CoCoID. Experiment re-sults demonstrate that our model substantially outperforms state-of-the-art intent discovery models (12 baselines) by over 1.4 ACC and ARI points and 1.1 NMI points across the four datasets. Further analyses suggest that CoCoID is able to learn contrastive representations and compact clusters for intent discovery.


Introduction
Intent discovery is to pinpoint novel intents from user utterances, which are not present in the set of predefined intents. Discovering novel intents for goal-oriented dialogue has recently attracted growing attention and interest (Lin et al., 2020;, not only because it is difficult to manually define all potential user intents for conversational agents deployed in a wide range of * Corresponding author Model NMI ↑ ARI ↑ sup-simcse-bert-large-uncased 80.25 66.93 sup-simcse-bert-base-uncased 75.09 59.52 real-world scenarios, but also due to the cost of curating intent-labelled data. Approaches to intent discovery usually include unsupervised clustering methods (Kathuria et al., 2010;Cheung and Li, 2012;Padmasundari and Bangalore, 2018) and semi-supervised methods (Basu et al., 2004;Hsu et al., 2018;Han et al., 2019). The former performs unsupervised clustering algorithms, e.g., K-means clustering, to group user utterances into clusters according to their underlying intents. The latter attempts to inject weakly supervised signals (Haponchyk et al., 2018;Caron et al., 2018), or external knowledge (Lin et al., 2020; into the procedure of clustering. Compared with the unsupervised clustering methods, models that explore supervised signals or external knowledge are capable of discovering better user intents . In this paper, we follow the semi-supervised research philosophy for intent discovery. However, different from previous works (Lin et al., 2020;, our focus lies on two aspects: user utterance representation and the compactness of clusters. For utterance representation, we have conducted a simple investigation in our preliminary experiments. With the DeepAligned model , we have compared two different settings for user utterance representation learning: supervised SimCSE-BERT_base 2 vs. SimCSE-BERT_large 3 (Gao et al., 2021). The results are shown in Table 1, which clearly demonstrate that better user representations lead to more accurate intent discovery. For the compactness of clusters, intuitively, the more compact a cluster is, the more possible user utterances in the cluster share the same intent.
To learn better user utterance representations for clustering and to improve the compactness of learned clusters, we propose CoCoID that learns Contrastive user utterance representations and Compact clusters for semi-supervised Intent Discovery. To learn contrastive representations, we define a user utterance to be clustered as the anchor utterance and feed it to the feature extractor twice with different dropout masks. In doing so, we obtain two different representations for the anchor utterance and use them as positive pairs. Other utterances in the same minibatch serve as negative samples. We then perform contrastive learning to improve utterance representations so that utterances with similar underlying intents tend to be close to each other while utterances with dissimilar intents are separated from each other.
To improve the compactness of clusters, we propose an intra-cluster knowledge distillation (ICKD) method. We randomly sample utterances from the cluster where the anchor utterance locates. The intent label of the anchor utterance is used to guide knowledge distillation from the anchor utterance to the sampled utterances. The motivation behind ICKD is to help shorten the distance between utterances in the cluster, which can be also considered as a regularization strategy for the clustering model.
In summary, our contributions are twofold.
1) We propose a novel semi-supervised intent discovery framework CoCoID that learns contrastive user utterance representations and presents an intra-cluster knowledge distillation to improve the compactness of learned intent clusters.

Related Work
Our work is related to intent discovery, contrastive learning and knowledge distillation. We briefly review these topics within the constraint of space.

Intent Discovery
User intent detection is an essential component in dialogue systems. A wide range of approaches, including unsupervised (Padmasundari and Bangalore, 2018), supervised (Hakkani-Tür et al., 2013) and semi-supervised (Basu et al., 2004;Hakkani-Tür et al., 2015;Han et al., 2019) methods, have been explored for intent discovery. In the unsupervised research strand, Kathuria et al. (2010) propose to exploit K-means clustering to understand user intents. In addition to unsupervised clustering approaches, Haponchyk et al. (2018) solve intent discovery by using powerful semantic classifiers to categorize user questions into intents with structured outputs in a supervised way. Caron et al. (2018) propose to produce clustering assignments as pseudo labels and then train a pseudo classifier. For semi-supervised methods,  investigate the label inconsistent issue and propose a deep alignment strategy. Other semi-supervised studies approach intent discovery by guiding the clustering process with pairwise constraints, such as KCL (Hsu et al., 2018) and CDAC+ (Lin et al., 2020). Our model is also semisupervised. The significant differences from previous semi-supervised intent discovery lie in the contrastive learning of user utterance representations and intra-cluster knowledge distillation for improving the compactness of clusters.

Contrastive Learning
The main idea behind contrastive learning is to force semantic representations of similar objects to be close to each other and those of dissimilar objects to be far away from each other, which is widely used in unsupervised visual representation learning. Recent years have witnessed that contrastive learning has also been explored in textual representation learning. Fang and Xie (2020) Figure 1: The diagram of CoCoID. We first fine-tune the utterance feature extractor on intent-labeled data. After fine-tuning, we obtain representations for both labeled and unlabeled user utterances. We cluster them via the K-means cluster algorithm. Our contributions lie in the contrastive user utterance representation learning and intra-cluster knowledge distillation used in the semi-supervised clustering procedure.
pose CERT and construct positive and negative utterances through back translation. Gao et al. (2021) find that dropout can act as an efficient data augmentation method for contrastive textual representation learning. Yan et al. (2021) also explore different data augmentation strategies on contrastive learning for sentence representation modeling. In this paper, we follow Gao et al. (2021) to use different dropout masks to create positive samples.

Knowledge Distillation
Knowledge distillation usually refers to training a large teacher model and distilling its knowledge into a small student model (Hinton et al., 2015). Zhang et al. (2018) propose a strategy where an ensemble of students learn collaboratively and then teach each other. Yuan et al. (2019) find that knowledge distillation is a more general label smoothing regularization and present a teacher-free knowledge distillation framework where a student model learns from itself. Yun et al. (2020) propose a class-wise knowledge distillation method which distills the predictive distribution between different samples of the same label during training, which is similar to our intra-cluster knowledge distillation in the sense of distilling predictive distribution between samples from the same group. The significant difference is that we distill knowledge between user utterances in the same cluster under the semi-supervised setting rather than in the supervised condition for images (Yun et al., 2020).

Approach
We elaborate the proposed CoCoID in this section. The diagram of CoCoID is shown in Figure 1. It consists of three major components: semi-supervised clustering backbone, contrastive utterance representation learning and intra-cluster knowledge distillation.

Semi-supervised Clustering Backbone
The traditional clustering-based approach for intent discovery produces clustering assignments as pseudo labels (Caron et al., 2018), which may result in an inconsistent clustering assignment issue as different labels could be assigned to the same utterance in different training epochs. We therefore use the DeepAligned framework  as our semi-supervised clustering backbone, which is composed of three essential components: utterance feature extractor, fine-tuning and DeepAligned clustering (addressing the inconsistent issue). We briefly introduce the first two parts here. More details on the third component can been found in .
Utterance Feature Extractor We use BERT (Devlin et al., 2019) to extract features from user ut-terances. For notational convenience, we define the training corpus as D = {x 1 , x 2 , ..., x n } where x i is the i-th utterance with or without intent label. We assume that D is composed of D label and D unlabel that indicate utterances with intent labels and those without manual intent labels, respectively. We feed x i into BERT and extract the last layer of BERT as H i = [CLS, s 1 , s 2 , ..., s m ] where CLS is a special classification token and m denotes the length of the utterance x i . We then obtain an averaged representation for the corresponding utterance as follows: where f indicates the meaning-pooling operation.
To further enhance the feature extractor, we add a dense layer g: where σ is a ReLU activation function, W g and b g are learnable parameters.
Fine-tuning To incorporate prior knowledge from limited intent-labeled data, we follow  to fine-tune the BERT-based feature extractor on labeled data D label . Specifically, we stack a simple intent classification layer over the BERT-based feature extractor and fine-tune the extractor with a cross-entropy loss in a supervised fashion. After fine-tuning, we remove the classification layer and keep the rest of the network as feature extractor for later use in the clustering process.

Contrastive User Utterance Representation Learning
After fine-tuning, we obtain a full-fledged feature extractor. We learn representations of utterances in D through this feature extractor. Over these learned representations, we perform K-means clustering to categorize utterances into groups. We define groups as {G} K i=1 where K is the number of groups for clustering.
We define an utterance x i in question as anchor utterance and its group as G µ(x i ) where µ is a mapping function that maps x i to a cluster index ∈ [1, K]. Similar to SimCSE (Gao et al., 2021), we input each anchor utterance x i to the encoder (as shown in the red dashed box) twice with two different dropout masks d and d ′ . We let h d i and h d ′ i (calculated according to Eq. 2) denote the representation of x i with the two different dropout masks. We consider h d ′ i as the positive representation to h d i for the anchor utterance since they are only different in dropout masks. That is to say, they are semantically similar to each other. Representations of other utterances in the same minibatch are regarded as negative representations to h d i . In this way, we construct positive and negative representations for contrastive learning. The contrastive learning objective is hence optimized as follows: where N is the number of utterances in the minibatch and τ is a temperature hyperparameter. sim is a similarity measurement function, which can be computed as the cosine similarity as follows:

Intra-cluster Knowledge Distillation
Dark knowledge can not only be distilled from a large teacher model to a small student model, but also be distillable among continual instances within the same model via self-learning. The latter self-knowledge distillation is feasible as dark knowledge can be regularized to produce the same prediction pattern for instances in the same class (Yun et al., 2020). This inspires us to perform selfknowledge distillation within the same cluster. Specifically, for each anchor utterance x i , we randomly sample different utterances in G µ(x i ) (denoted as x in ) to form multiple (x i , x in i ) pairs. Let u i denote logit from x i and u in i logit from x in i . We then distill dark knowledge from u i to each sampled logit. Mean square error (MSE) is used as the intra-cluster distillation objective, which is computed as follows: where k is the number of sampled utterances, which is a hyperparameter to be tuned. Note that u 0 i is logit from the anchor utterance itself with a different dropout mask.
Forcing the predictive distributions (i.e., logit) from sampled utterances to be close to that from the anchor utterance in the same cluster, we want these utterances in the same cluster to be similar to each other. In other words, their distance could be  shortened by the proposed intra-cluster knowledge distillation so that clusters are more compact.

Training Objective
Contrastive utterance representation learning attempts to pull utterances with similar semantic intents together and push apart utterances with different intents in a minibatch-wise manner. By contrast, intra-cluster knowledge distillation is to shorten distances of utterances in the same cluster by regularizing predictive distributions in a cluster-wise way. These two approaches are complementary to each other and therefore can be combined in a unified framework CoCoID. The final joint loss of CoCoID with the two components can be formulated as: where L CE is the cross-entropy loss used in the DeepAligned model .

Experiments
We conducted a series of experiments on four widely-used datasets to evaluate the proposed Co-CoID. More details about the datasets and baselines can be found in the appendix A and B.

Evaluation Metrics
We used three metrics to evaluate performance in our experiments. Normalized Mutual Information (NMI) is commonly used in measuring the quality of clustering by estimating the similarity of clustering results to the ground-truth results. Adjusted Rand Index (ARI) treats the analysis of clustering as a series of decisions, one for each of the N (N − 1)/2 pairs of collections. The Rand index measures the percentage of decisions that are correct while ARI is the corrected-for-chance version of it, ensuring that the value for random clustering tends to be 0. In addition to the two metrics, we follow  to use the Hungarian algorithm to obtain the mapping between the predicted classes and ground-truth classes to estimate clustering Accuracy (ACC). For all metrics, the higher the score is, the better the performance is.

Settings
To make fair comparisons, we used the same settings that randomly select 10% of the data as the labeled data, and 75% of the intents as the known intents as in previous works . The division of the datasets also follows  and Lin et al. (2020). We first fine-tuned our model on the labeled data and performed contrastive learning and intra-cluster knowledge distillation during clustering. We set the wait patient as 20 to avoid overfitting. We evaluated the cluster performance with Silhouette Coefficient which is an unsupervised metric to evaluate clustering performance. After training, we evaluated the performance on test sets and averaged 10 random-seed results as our final result. Note that we set the num- ber of intents as ground-truth during clustering, which is consistent with . We employed the pre-trained BERT model (baseuncased, with 12-layer Transformer) to build the feature extractor. We set the training batch size as 128, and the learning rate as 5e −5 .The temperature for contrastive utterance representation learning was set to 1.0 for CLINC, StackOverflow and DBPedia dataset. For BANKING dataset, we set the temperature of contrastive learning as 0.05. We also followed  to freeze all BERT parameters except the last layer to avoid overfitting and speed up the training process. For each anchor utterance, we sampled 3 utterances. Table 2 shows the results on the four datasets. Our reproduced CDAC+ and DeepAlighed model are better than those reported by  and Lin et al. (2020) in most cases, indicating strong baselines to be compared. Our proposed CoCoID outperforms the 12 state-of-the-art baselines on the majority of cases on the four datasets. Particularly, we achieve an average improvement of 1.43 ACC, 1.47 ARI and 1.12 NMI over the reproduced DeepAligned (better than the original).

Main Results
Interestingly, it is worth noting that the length of utterances in datasets seems to have an impact on the performance improvements. The average length of utterances in both CLINC and StackOverflow datasets is relatively shorter than 10, hence the improvement obtained by our proposed model is limited. In contrast, on the BANKING and DB-Pedia datasets, utterances are longer. The improvements in terms of the three metrics are substantially higher than those on the other two datasets.

Ablation Study
To examine the effectiveness of the proposed two methods , we further conducted ablation study. In the last two rows of Table 2, we show the results of our model without contrastive learning and intracluster distillation. It can be found that the removal of contrastive learning will reduce the ACC of the model to some extent, while the removal of ICKD will negatively affect the NMI and ARI metric. The absense of contrastive learning causes ACC to decrease by 1.14 on average, suggesting that contrastive learning is able to boost intent discovery accuracy. The absence of intra-cluster knowledge distillation results in larger performance drops in CLINC and BANKING than those in StackOverflow and DBPedia. This indicates that ICKD is beneficial to intent discovery with a larger number of novel intents (37/19 vs. 5/4)

Analysis
We carried out in-depth analyses to investigate how the proposed methods improve intent discovery.

Analysis on the Impact of the Percentage of Labeled Data Used
As our CoCoID is a semi-supervised intent discovery approach that utilizes intent-labeled data to fine-tune the feature extractor, we would like to know the impact of the percentages that used intentlabeled data account for among all data on the performance. For this, we conducted experiments on the BANKING dataset by gradually increasing the percentage of labeled data used for fine-tuning from 0.05 to 0.2. Results of CoCoID against the two state-of-the-art models DeepAligned and CDAC+ are illustrated in Figure 2. Under all settings, all the three models benefit from the growing amount of intent-labeled data used. But our CoCoID is always the best among the three models, indicating its strong capacity in exploring intent-labeled data.

Analysis on the Impact of the Ratio of Unknown Intent Classes
Our approach is able to detect both known (predefined) and unknown (novel) intent classes. There- fore, we further analyzed the impact of the unknown intent class ratio on the performance. Figure 3 shows the results of CoCoID in comparison to CDAC+ and DeepAligned. The performance of the three models drops when the ratio of unknown intent classes increases, which is reasonable as it is more challenging to detect novel intents without labeled data than predefined intents with labeled data. However, our model achieves smaller performance drops in all three evaluation metrics along growing unknown intent class ratios, suggesting that our model is more capable of detecting novel intents than both DeepAligned and CDAC+. Figure 4 visualizes the distribution of 14 intents in the semantic space of the DBPedia dataset. We merged the training data and test data to generate utterance representations. We then visualized them through t-sne. To obtain a better global structure, we set the perplexity of t-sne to 500. Figure 4(a) visualizes the intent distribution yielded by DeepAligned while figure 4(b) displays results yielded by our CoCoID. Note that the same color across the two sub-figures represents the same intent cluster. From the visualization, we can easily find that the area of intent clusters produced by CoCoID is smaller than that by DeepAlighed. And some clusters are even compacted into strips (e.g., intent 3 and 9 in Figure 4(b)). In order to explicitly show the degree of compactness of intent clusters, we also calculated the average distance from each utterance to the cluster centroid. The average distance for DeepAligned is 0.34 while only 0.14 for our CoCoID, which strongly suggests that our proposed method does make clusters more compact.

Conclusions
In this paper, we have presented a semi-supervised intent discovery framework CoCoID. It consists of two essential components: contrastive user utterance representation learning and intra-cluster knowledge distillation. Extensive experiments and analyses on real-life challenging datasets demonstrate that CoCoID outperforms previous state-ofthe-art intent discovery models (12 baselines) and is able to learn contrastive representations and compact clusters.  Table 3: Statistics on the used datasets. Intents are divided into known (predefined) and unknown (novel intents).

A Datasets
Detailed statistics of the four datasets are shown in Table 3.
To compare with previous works, we used the same data division as in .
BANKING is a single-domain intent classification dataset (Casanueva et al., 2020), which provides a fine-grained set of intents in the banking domain. It contains 13,083 utterances labeled with 77 intents.
In our experiments, the split of training, validation and test set follows .
StackOverflow is an intent classification dataset collected and processed by Xu et al. (2015) from Kaggle.com. 4 Xu et al. (2015) randomly select 20,000 question titles from 20 different labels. In our experiments, the division of training, validation and test set follows Lin et al. (2020).
DBPedia is a DBpedia ontology dataset which contains 14 non-overlapping classes (Zhang and Le-Cun, 2015). We follow Lin et al. (2020) on data division to compare with their results fairly.

B Baselines
We compared with 12 different baselines, listed as follows.
K-means (KM): an iterative algorithm of clustering, which first randomly selects K objects as the initial cluster centroids and then assigns each object to the nearest cluster.
Agglomerative Clustering (AG): a bottom-up hierarchical clustering method which calculates the distance/similarity between classes.
SAE-KM: a method similar to K-means. The difference is that the feature extractor is a stacked autoencoder (SAE).