ClusterLLM: Large Language Models as a Guide for Text Clustering

We introduce ClusterLLM, a novel text clustering framework that leverages feedback from an instruction-tuned large language model, such as ChatGPT. Compared with traditional unsupervised methods that builds upon"small"embedders, ClusterLLM exhibits two intriguing advantages: (1) it enjoys the emergent capability of LLM even if its embeddings are inaccessible; and (2) it understands the user's preference on clustering through textual instruction and/or a few annotated data. First, we prompt ChatGPT for insights on clustering perspective by constructing hard triplet questions, where A, B and C are similar data points that belong to different clusters according to small embedder. We empirically show that this strategy is both effective for fine-tuning small embedder and cost-efficient to query ChatGPT. Second, we prompt ChatGPT for helps on clustering granularity by carefully designed pairwise questions, and tune the granularity from cluster hierarchies that is the most consistent with the ChatGPT answers. Extensive experiments on 14 datasets show that ClusterLLM consistently improves clustering quality, at an average cost of ~$0.6 per dataset.


Introduction
Text clustering, as a fundamental task in natural language processing (NLP), has a wide spectrum of applications, such as identifying public perception from social media (Park et al., 2022), analysing cause of accidents (Xu et al., 2022), and detecting emerging research topics (Martínez et al., 2022).A common practice for text clustering is to apply clustering algorithms (MacQueen, 1967;Zhang et al., 2021a) on top of pre-trained embedders (Muennighoff et al., 2022;Wang et al., 2022;Su et al., 2022) which could achieve higher performance with better pre-training quality.State-of-the-art large language models (LLMs) such as recent GPT series (Brown et al., 2020;Ouyang et al., 2022;OpenAI, 2023) have demonstrated extraordinary language capabilities for various NLP applications however, these GPT models can only be utilized through the APIs without accessible embedding vectors for clustering.Hence, LLMs cannot be directly applied on text clustering tasks.
In this paper, we provide insights on the question: Can we leverage API-based LLMs to guide text clustering efficiently?We attack this challenging question by drawing inspiration from an observation that humans represent an instance through comparing with others (Nosofsky, 2011).For instance, people often classify a new piece of music into a specific genre by relating to familiar ones.In fact, pairwise relationships have been utilized in spectral clustering (Donath and Hoffman, 1972;Cheeger, 1970) before.Nonetheless, naively traversing all the pairs within dataset is obviously intractable and too expensive for querying LLMs.
We propose CLUSTERLLM, a framework that utilizes LLM to guide a small embedder for finding text clusters with a low cost, as shown in Figure 1.It comprises two stages that are specially designed for two aspects of clustering: (1) perspective, i.e., the grouping criterion such as topic, intent and emotion and (2) granularity, i.e. the scope of clusters.
In Stage 1, we prompt LLMs with a triplet task that predicts which one of the two candidate choices is closer to anchor instance to understand the user-preferred perspectives.We choose this triplet task because (a) it is irrelevant with cluster granularity and (b) the produced triplets can finetune small embedder towards the right perspective.In order to improve sample efficiency, we further propose entropy-based triplet sampling to find the most informative triplets.Specifically, we first calculate entropy for each instance based on cluster assignment probabilities, and then identify those with highest entropy.Two candidate choices are then sampled from its nearest clusters to guarantee they are close enough to the anchor.
In Stage 2, we first obtain the cluster hierarchy that starts from instance-level clusters and iteratively merge two closest clusters until the entire dataset.And then we prompt LLMs to determine cluster granularity with a few annotated data pairs as demonstrations.We construct the data pairs to prompt by sampling from two clusters that are merged at each step of hierarchical clustering, so that they cover a wide range of granularities.And the final decision is made by measuring consistency between each level of clustering and predictions.
We extensively evaluate CLUSTERLLM on 14 datasets that include diverse tasks such as intent discovery, topic mining, type discovery, domain discovery, and emotion detection.Furthermore, these datasets span a wide range of granularities that have 10 to 150 number of clusters.We show that CLUS-TERLLM is effective overall on improving clustering quality, where the clustering performance is improved over both a deep clustering baseline and a self-supervise baseline.Moreover, the ablation study shows that our sampling strategy is effective compared to a random sampling baseline.Finally, CLUSTERLLM also outperforms clustering-error based methods on determining cluster granularity.
In summary, our contributions are three-fold: (i) We propose a framework CLUSTERLLM that utilizes sentence relations predicted from API-based LLMs to guide clustering.Furthermore, it allows users to provide textual instructions and/or fewshot annotations to specify preferences on clustering.(ii) In order to reduce API-queries, we propose a novel entropy-based sampling strategy to find the most informative triplets.Additionally, we utilize pairwise data sampled from hierarchical clustering to determine cluster granularity.(iii) Extensive experiments show that our proposed method can improve clustering performance at ∼$0.2 for perspective and ∼$0.4 for granularity with GPT-3.5.

Preliminary
Text clustering takes an unlabeled corpus D = {x i } N i=1 as input, and outputs a clustering assignment Y = {y i } N i=1 that maps the input text to cluster indices.To specify user's needs, CLUSTER-LLM integrates additional textual instruction (e.g."Select the example that better corresponds with the Query in terms of entity type.") to understand perspective and few-shot annotations (e.g."Sentence1 and Sentence2 have the same entity type ...") to determine cluster granularity.

Our CLUSTERLLM
CLUSTERLLM is based on a pre-trained small embedder (Wang et al., 2022;Su et al., 2022) (denoted as f ) which usually represents sentences individually.In contrast, inspired by human cognitive ability (Nosofsky, 2011), CLUSTERLLM considers a pair or a triplet of sentences through prompting LLMs that are trained to follow human instructions (Ouyang et al., 2022;OpenAI, 2023).Specifically, CLUSTERLLM is a two-stage framework (See Figure 2).In Section 3.1 we introduce Stage 1 that utilizes triplet task to improve clustering quality with respect to user-specified perspectives, along with a sampling strategy that reduces number of API queries.In Section 3.2, we introduce Stage 2 that leverages pairwise task to determine cluster granularity based on predictions from LLMs.

Triplet Task for Perspective
In this section, we explore how to harness a triplet task to refine the cluster structures for a userspecified perspective.A triplet task takes as input a tuple of three sentences t = (a, c 1 , c 2 ), where a is the anchor and (c 1 , c 2 ) are two choices.We then prompt LLMs to select one of (c 1 , c 2 ) that better corresponds with a using a prompt P T .Moreover, in order to specify the user's perspective, P T also requires a task instruction I T as input.The LLM should make a choice where c j ∈ {c 1 , c 2 } indicates one of the choices that LLM selects as positive and we denote the other (or negative) one as c \j .

Entropy-based Triplet Sampling
While one can randomly sample triplets to query the LLM, we demonstrate it non-efficient in experiments.In this section, we pose the question of mining informative triplets to both save the costs from querying LLMs and optimally improve the clustering.To achieve this, we resort to the current clustering results from the extracted embeddings In summary, our algorithm contains two steps: Step 1: We find the most ambiguous instances as anchors based on entropy.
Step 2: For each anchor instance, we sample two choices from two of its closest clusters.Refer to Algorithm 1 for entire process.
In Step 1, since the granularity is unknown at current stage, we perform clustering on top of Z, where the clustering hyperparameters 2 are consis- tent across datasets and only specific to the embedder model f .Cluster center µ k will thereafter be calculated for cluster k by averaging embeddings assigned to it.Following (Xie et al., 2016;Van der Maaten and Hinton, 2008), we calculate instancewise soft assignments with Student's t-distribution, where α = 1 is the degree of freedom.We then define closest clusters for instance i as K closest clusters with largest soft assignment p ik .Here, K closest is proportional to the total number of clusters K.
where we fix ϵ to be a small value, such as 2%.
We then compute entropy based on these closest clusters with renormalized probabilities p ′ ik , where . We sort the entire dataset in descending order according to the entropies H = {h i } N i=1 .We introduce two hyperparameters γ high and γ low that control the proportion interval to filter out from ordered dataset.Our hypothesis is that higher entropy (smaller γ high and γ low ) anchors form more informative triplets that we verify in Section 4.6.In Step 2, we randomly sample two clusters C 1 , C 2 from K closest closest clusters, and then sample two sentences c 1 , c 2 from each of them as choices (see line 11 and line 12).In another word, these choices should be either a positive or a hard negative to the anchor.Finally, we also remove triplets that are either repeated or have identical choice and anchor.We continue to sample triplets until reaching budget Q.Remarks.(1) Since Q is defined by the user and is independent with the dataset size, our sampling is cost-efficient.For example, in our experiments, using 1, 024 queries can improve performance on both dataset scales of ∼ 3, 000 and ∼ 50, 000.(2) From the view of ground truth, the sampled triplets might contain "both are correct" or "none of the above".However, we argue that even these triplets might provide soft aligning information, i.e. the ranking of closeness between choices.(3) Our sampling method may also be utilized in active learning to acquire human annotations when no prior knowledge is available on the categories.

Finetuning Embedder
Now that we have the triplet predictions, it is still not clear how to utilize them in clustering.Previous research rely on deep constrained clustering (Zhang et al., 2020;Manduchi et al., 2021) which are often sensitive to noisy labels (Basu et al., 2008).In this paper, we instead focus on finetuning the base embedder f towards producing an embedding space that better explains the user's perspective.We exploit both hard and in-batch negatives.Following (Su et al., 2022;Ni et al., 2022b), for a triplet t = (a, c j , c \j ) with positive c j and hard negative c \j , we optimize the following objective, where B combines c j , c \j and other in-batch negatives.τ is a temperature parameter.Following the original implementation, we also compute the loss with a and c j swapped.Finally fine-tuned embedders can be applied to find even more informative triplets with our sampling method which will further improve performance in an iterative manner.We acquire clustering assignments by running clustering algorithms on top of extracted embeddings.

Pairwise Task for Granularity
In this section, we build upon the refined embedding space in Section 3.1 to determine cluster granularity.In this paper, we convert the problem of determining granularity into finding the best step in a cluster hierarchy (see Figure 2 right), where each step denotes a unique granularity (or equally number of clusters).It is non-trivial, since different granularities can be applied to the same dataset (such as domains or topics).To tackle this challenge, we query LLM with pairwise task that predicts whether a pair of data p belong to the same cluster with a prompt P P , where w ∈ {same, different} is the binary decision, I P is the task instruction and {p d } D d=1 are few-shot demonstration pairs used for in-context learning (typically D = 4).We assume these demonstration pairs are annotated by users who have a desired cluster granularity in mind.We also combine a brief justification for each demonstration pair (see Table 12 bottom for example).

Determine Granularity with Pairwise Hierarchical Sampling
We then introduce how to sample pairs from cluster hierarchy to query LLMs and determine granularity.We assume a maximum and a minimum number of clusters (denoted as k max and k min ) following Pelleg et al. (2000) which depend on the user's expectation on the granularity.We then randomly sample λ (1 or 3 in our experiments) pairs of data from the two clusters to be merged at each step to form candidate pairs {p i } Np i=1 , where N p = λ(k max − k min ).These pairs cover the entire range of granularity between k max and k min , and will be used to query LLMs.After that, each level of granularity can be examined against LLM predictions to choose the one with the highest consistency measure M, where denotes the predictions obtained from Eq. 6 and W k represents a set of binary values indicating whether each pair of data is in the same cluster at granularity k.Empirically, we found the following performs better in our framework: use F-beta score, a weighted harmonic mean of precision and recall, as measurement M and set W p /W k as labels/predictions.Finally, for largescale datasets, we address the high time complexity of hierarchical clustering by applying it on top of mini-batch K-means.See details in Appendix A. Remarks.Similar to Section 3.1.1,pairwise hierarchical sampling can also be used to acquire human annotations.Nonetheless, the reliability of the algorithm still depends on the quality of clusters.In an extreme case where the clusters are completely random, it is unable to find granularity even though all the pairwise predictions are correct.

Experiments
We first evaluate CLUSTERLLM on clustering quality with ground truth number of clusters in Section 4.4.Then we conduct ablation studies in Section 4.6 to further analyze the effectiveness of CLUSTERLLM.Finally, we show results of determining cluster granularity in Section 4.7.

Datasets
We provide a high-level summary of evaluated datasets in this section, and see Appendix E for more descriptions.In this paper, we evaluate on a broad range of clustering datasets with various perspectives and granularities.Furthermore, to better analyze the effect of scale, each dataset has both a small-scale and a large-scale version.The two versions are different in number of data while keeping the same number of clusters.A summary of dataset statistics is shown in Table 1.Note that there is no data splits in clustering.Intent (Domain) Discovery.
Intent discovery (Zhang et al., 2021b(Zhang et al., , 2022) ) discovers unknown intents in unlabeled customer utterances.For CLINC, Massive and MTOP, we also use domains as labels to convert them into domain discovery.Type Discovery.Type Discovery (Li et al., 2022) resolves the closed-world set-up of traditional Information Extraction.In this work, we focus on three tasks: entity, relation and event type discovery.To indicate specific mentions (entities or event triggers), we directly append them behind sentences with natural language formats, such as "The relation between [ENTITY1] and [ENTITY2]".Topic Mining.We adapt three topic mining datasets from MTEB (Muennighoff et al., 2022).Emotion.We adapt GoEmo (Demszky et al., 2020), a fine-grained emotion detection dataset by removing multi-label or neutral instances.

Experiment Details
Query LLMs.The prompt only contains a taskspecific instruction (see Table 11).We set generation temperature to 0.5.Explanations are suppressed by adding a postfix:"Please respond with 'Choice 1' or 'Choice 2' without explanation" and set up a max token of 10.We then assign them to binary choices by directly checking whether one of the texts "Choice 1" or "Choice 2" is in the response.We also find that a very small amount of responses do not contain any choices and we discard them during fine-tuning.We use the Python API tool provided by OpenAI.
Triplet Sampling.For both small-or large-scale experiments, we set a budget of Q = 1, 024 triplets.We set γ low = 20% and γ high = 0.For clustering methods, we fix hyperparameters of these algorithms across datasets in Stage 1.We choose agglomerative clustering with fixed distance threshold 67 for small-scale experiments on Instructor, and 77 on E5 (the embeddings are preprocessed by standard scaler).For large-scale datasets, we choose mini-batch K-means with fixed number of clusters 100 due to its lower latency.Clustering algorithms are implemented by scikit-learn (Pedregosa et al., 2011).
Fine-tune Embedders.In this work, we focus on two state-of-the-art pre-trained embedders: Instructor (Su et al., 2022) and E5 (Wang et al., 2022).We only use the large versions.Refer to Appendix D for details.
Evaluation.To reduce cost, we run CLUSTER-LLM once for each dataset.We then run (minibatch) K-means on (large) small-scale datasets for 5 seeds with ground truth K.We show two metrics.

Compared Methods
E5 and Instructor.We directly apply (minibatch) K-means on extracted embeddings from instructor-large and e5-large.self-supervise-I(E).To verify that the performance improvement of CLUSTERLLM does not only come from domain-specific fine-tuning, instead of the more accurate triplet prediction.We propose a self-supervise fine-tuning that uses exactly the same triplets as CLUSTERLLM but only switch to self-supervised triplet predictions that select closest choices in embedding space.SCCL-I(E).We also combine Instructor and E5 with SCCL (Zhang et al., 2021a), an unsupervised deep clustering algorithm that utilizes entire dataset for training.Notice that our method uses fewer data for training.See Appendix D for details.

Main Results
We show main results with small-scale datasets in Table 2.We show several variants of our method: CLUSTERLLM-I(E) adopt Instructor or E5 as embedders.CLUSTERLLM-I(E)-iter applies the entire framework twice in an iterative manner by using previous fine-tuned model as initialization and the 1, 024 triplets inferred from new embeddings for fine-tuning.All of these use GPT-3.5 for prediction.We make the following observations: (1) CLUSTERLLM consistently improves upon both embedders.For example, CLUSTER-  LLM-I increases the performance by 6.71% on FewRel.CLUSTERLLM-E increases the performance by 9.19 on Bank77.However, we do observe that on Massive(D) and CLINC(D), there are no improvements.
(2) CLUSTERLLM outperforms deep clustering and self-supervise baselines.For instance, CLUSTERLLM-I surpasses self-supervise-I on most datasets except for two and it is also better than SCCL-I on 11 over 14 datasets.Furthermore, these improvements are consistent across both reported metrics.
(3) Combined with the results in Appendix F, applying CLUSTERLLM iteratively is beneficial, emphasizing the potential of further improvements.

Analysis on Triplet Prediction Accuracy
We attribute the improvements on clustering quality to more accurate triplet predictions.In Table 3, we show the accuracy on predicted triplets that have ground truth (exactly one positive and one negative choices based on ground truth) with two different sampling methods.Random triplet sampling uniformly samples three random instances as query and two choices, and we guarantee the two choices are different from the anchor by filtering.Furthermore, we also show a selection accuracy with Euclidean distances between embeddings as a comparison.We observe that, GPT-3.5/4consistently improves upon Instructor on high entropy ex-amples, demonstrating our hypothesis.In contrast, with random sampling, the ground truth triplets is significantly fewer and the accuracy gap is much smaller or even decreases performance.

Ablation Study
Clustering Quality.We show ablation studies on CLUSTERLLM based on Instructor in Table 4. Specifically, we present results with 3 kinds of predictions on the same set of triplets for finetuning: GPT-3.5/4,replace triplet predictions of GPT-3.5 to ground truth on those triplets that have ground truth.We observe that GPT-4 marginally improves upon GPT-3.5 given the much higher cost.When provided with human labels, CLUS-TERLLM-GT&GPT3.5 achieves the highest performance, which indicates the possibility for further improvement with more accurate predictions.We make similar observations for large-scale datasets in Table 6.Sampling Strategy.In this section, we show ablation study on entropy-based sampling.In Figure 3, we observe that clustering accuracy increases when increasing entropies (or equally decreasing mean of interval) except for GoEmo.We make two hypothesis: (1) LLMs are much better than small embedders on harder instances.( 2   x-axis shows the mean of interval where interval length is set to 20%.For example, "mean of interval= 50%" means γ high = 40% and γ low = 60% (see Section 3.1.1).♦ marks the setting for main experiments.
lected triplets even decreases performance, which demonstrates the cruciality of triplet sampling.

Determining Cluster Granularity
In this section, we show the results for determining cluster granularity.We evaluate on a subset of 8 datasets including various cluster granularities with k max = 200 and k min = 2.We compare with different methods that rely on clustering errors.For our methods, we show results with λ = {1, 3} (except for GPT-4 to reduce costs), which involve 198 & 594 pairs in total respectively.To simulate experts for providing demonstrations, we directly sample 16 pairs from small-scale datasets when λ = 3 and then choose 2 positive and 2 negative as demonstrations.Notice that we use the same demonstrations for large-scale experiments.See more details in Appendix B. We make several observations from Table 5 and  Table 7: (1) Our methods have higher ranks.Most baseline methods predict similar number of clusters for domain and intent, while our methods can effectively distinguish between the two.For instance, on MTOP(I)/(D) in Table 5, BIC predicts number of clusters 69/64 while our method (GPT-3.5, λ = 3) predicts 92/18.(2) Increasing λ generally helps (MTOP(D) in Table 5) but might not always make a large difference.(3) GPT-4 significantly improves upon GPT-3.5, probably due to its better understanding of demonstrations.
Pre-trained Embedding Model.Generic pre-trained text embedding models (Reimers and Gurevych, 2019;Gao et al., 2021;Ni et al., 2022a,b) are widely applied in text similarity, classification, clustering and information retrieval.Recently, two embedding models E5 (Wang et al., 2022) and Instructor (Su et al., 2022) have shown superior performance on a popular benchmark (Muennighoff et al., 2022).Specifically E5 is pre-trained on web-scraped data pairs with contrastive objective.Instructor is pre-trained on 330 tasks with instructions.CLUSTERLLM aims at improving these models with LLMs.

Conclusion
In this paper, we study how to leverage API-based LLMs to guide small embedders for text clustering in order to benefit from high-level language capability of LLMs and user's instructions on clustering.We propose to prompt LLMs with two kinds of sentence relationship tasks: triplet task and pairwise task.Triplet task chooses the sentence that is most similar with anchor combining with a perspective instruction from users.The predicted triplets are used for fine-tuning small embedders.
Pairwise task judges whether a pair of sentences belong to the same category hinted by few-shot demonstrations, and then the predictions are used to determine cluster granularity with a consistency measure.Extensive experiments show that our proposed framework CLUSTERLLM can improve clustering quality and propose reasonable cluster granularity at a negligible cost.However, CLUS-TERLLM still relies on the embedding model itself, which is inefficient and inapplicable on black-box embedding models.We encourage future works to explore the potential of model-free training such as constrained clustering.

A Details of Scaling up Hierarchical Clustering
A major drawback of hierarchical clustering is its O(N3 ) time complexity which makes the algorithm hard to be deployed on larger datasets.However, since we are only interested in a specific range of granularity in our scenario, the hierarchical clustering can start from an intermediate step.We address this issue by first running minibatch K-means with k max and then run hierarchical clustering with current assignments as inputs.Specifically, we use agglomerative clustering with ward's method (Ward Jr, 1963).We first calculate distances between each pair of clusters according to Murtagh and Contreras, 2011 and then provide them as inputs to nearest neighbor chain algorithm.
Finally the returned hierarchy is combined with the K-means assignments to infer clusters.

B More Details about Determining Cluster Granularity
Previous methods often employ clustering errors as a metric and they ignore user's need on the granularity.Silhouette coefficient (Rousseeuw, 1987) indicates the clustering quality without ground truths, which exploits the inter-cluster distance with nearest clusters and the intra-cluster distance.We find the granularity by choosing the one with the best silhouette coefficient.Elbow method (Thorndike, 1953) is a heuristic method that plots the clustering error with respect to different levels of granularity in the hierarchy.And then the best granularity is determined with the largest elbow length.X-means (Pelleg et al., 2000) is a variation of K-means that starts with the lowest number of clusters, and then repeatedly attempt to split the clusters by running 2-means on them and evaluate with Bayesian Information Criterion (BIC) (Goutte et al., 2001).BIC (Goutte et al., 2001) calculates BIC for each of the granularity.Cluster-Size (Zhang et al., 2021b) uses a confidence threshold to filter small clusters starting from the maximum number of cluster.For all methods, we use the same fine-tuned embeddings (CLUSTERLLM-I in Table 2).The same cluster hierarchy is used (except for X-means that relies on K-means), which is either acquired from hierarchical clustering for small-scale or our proposed two-step method in Section 3.2 for large-scale.For out methods, the weight in F-beta score is set to 0.92 through empirical selection on Bank77.Because of the high latency, results for Silhouette and X-means are not shown on large-scale datasets.After sampling 16 data pairs, we tend to choose positives with finer granularity or negatives with coarser granularity.We also consider the sentence length to minimize the cost.We use label names as justifications and we always put 2 positive before 2 negative (See Table 12 bottom).

C Analysis for Determining Granularity
Prompt Design.We show the analysis results of prompt design for determining granularity in Table 8.We experiment with two settings: (1) remove justification for all demonstrations and only keep the "Yes" or "No" answer.
(2) remove all demonstrations and any granularity-related words (such as domain) 3 .We observe that demonstrations are necessary and adding justifications have a positive impact.
Visualization of Consistency Score.We visualize consistency score with respect to the number of clusters.The consistency scores exhibit continuous variations and peak at the best number of clusters.

D Details of Embedders and Fine-tuning
For all the experiments (including those with or without fine-tuning), we use large version of both Instructor and E5 (i.e.hkunlp/instructor-large & intfloat/e5-large).
For Instructor, we use the same or similar prompt as original paper.See Table 10.
For fine-tuning, we adopt the same hyperparameters as in (Su et al., 2022), but modify the learning rate to 2e − 6, the maximum gradient steps to 3, 840 for Instructor (∼ 15 epochs) and 1, 280 for E5, and batch size to 4. We choose this gradient in the begining of our experiments by observing no performance increase after that on several datasets.Training is conducted with a single NVIDIA Quadro RTX 8000 GPU.
For SCCL-I(E), we change the maximum token length to 128 due to the limited compute resource4 .We use the same learning rate 2e − 6 as before for Instructor and 2e − 7 for E5 since we found that the performance is unstable with large learning rate.Batch size is set to 16 and we evaluate representations with K-means after 200 iterations.Also notice that we do not interrupt prompts in Instructor during data augmentation.

E Description of Datasets
Bank77 (Casanueva et al., 2020) is a popular dataset in intent discovery that focuses on creating fine-grained intent categories for a single-domain, "banking".CLINC(I) (Larson et al., 2019) is originally created for detecting utterances that falls outside of supported intents.The dataset also contains multiple domains, such as "travel", "utility" and "work".In this experiment, we discard all the out-of-scope utterances and only focus on indomain ones.Moreover, we create a domain discovery dataset CLINC(D) that uses domains as labels.Massive(I)/(D) (FitzGerald et al., 2022) and MTOP(I)/(D) (Li et al., 2021) are both from MTEB (Muennighoff et al., 2022).Here "I" denotes intent and "D" for domain (or scenario).These datasets are originally used for classification but are adapted here for clustering.We also remove those intents with only a few instances and keep English-only data.For all datasets, we use the train & test sets as large-& small-scale datasets respectively.For FewRel (Gao et al., 2019) and FewEvent (Deng et al., 2020), we first randomly split datasets into train & test sets, and then sample from train set as large-scale and test set as smallscale.For FewNerd (Ding et al., 2021), we use the original train & test splits.For StackEx, Reddit (Geigle et al., 2021) and ArxivS2S, we combine all the splits into a single dataset and remove topics that only have few instances.Finally, the datasets are randomly splitted into large-& small-scale versions.To show the dataset balancy, we show the  entropy of class distribution in Table 9.

F Results of More Iterations
We show the results over 4 iterations of CLUS-TERLLM in Figure 5.During iteration, we sample triplets from previously fine-tuned embedding space and continue to fine-tune the model with previous checkpoint as initialization.We also show the self-supervise results using the same checkpoint fine-tuned with GPT-3.5 predictions as initialization at each iteration.We observe that using GPT-3.5 predictions is almost always beneficial.The performance generally increases and saturate at the fourth iteration with the exception of GoEmo.

FewRel
Select the example that better corresponds with the Query in terms of relation type.

FewNerd
Select the example that better corresponds with the Query in terms of entity type.FewEvent Select the example that better corresponds with the Query in terms of event type.

StackEx
Select the StackExchange question that better corresponds with the Query in terms of topic.

ArxivS2S
Select the Arxiv paper title that better corresponds with the Query in terms of domain.

GoEmo
Select the sentence that better corresponds with the Query in terms of emotion expressed.Massive(I) Select the user utterance that better corresponds with the Query in terms of intent.

MTOP(I)
Select the user utterance that better corresponds with the Query in terms of intent Reddit Select the Reddit question that better corresponds with the Query in terms of topic.Massive(D) Select the user utterance that better corresponds with the Query in terms of scenario.MTOP(D) Select the user utterance that better corresponds with the Query in terms of domain.CLINC(D) Select the customer utterance that better corresponds with the Query in terms of domain.
Table 11: Prefix of prompts for triplet task.Notice while we use different prompts for domain and intent (such as CLINC(I) and CLINC(D)) in our experiments, they might be used interchangeably.

Dataset Prompt
Triplet Task Select the banking customer utterance that better corresponds with the Query in terms of intent.
Query: Should i reinstall the payment app?Choice 1: I've received my card so now I need to know how to sync it to the app.Choice 2: Can I still use the app if I switched phones?
Pairwise Task [Example1] Sentence 1: I would like to see the source of my money.
Sentence 2: My source of funds need verified.Yes.Because both intents are verify source of funds.
[Example2] Sentence 1: Is there a fee for topping up Sentence 2: What are the top up charges for US cards?Yes.Because both intents are top up by card charge.
[Example3] Sentence 1: Can I reactivate my lost card that I found this morning in my jacket pocket?Sentence 2: how to activate card?No.Because Sentence 1 has intent card linking and Sentence 2 has intent activate my card.
[Example4] Sentence 1: What will I be charged for a physical card?Sentence 2: My card is about to expire and I need to know how much it costs and how long ... No.Because Sentence 1 has intent order physical card and Sentence 2 has intent card ... Determine whether the intents of two banking customer utterances below belong to the same intent category using above examples.
Sentence 1: $1 extra has been charged on my statement, why is that?Sentence 2: Will it automatically top-up if there isn't much money left?
Please respond with 'Yes' or 'No' without explanation.

Figure 1 :
Figure 1: LLMs like ChatGPT are not applicable for text clustering directly because of the inaccessible embeddings.CLUSTERLLM resolves the dilemma by leveraging LLM as a guide on text clustering.

Figure 2 :
Figure 2: An overview of CLUSTERLLM.It utilizes LLM to guide an embedder for text clustering with a low cost.

Figure 3 :
Figure 3: Relative clustering accuracy (divided by maximum for better aligning across datasets) of CLUSTER-LLM-GPT3.5 with different range of entropy selected.x-axisshows the mean of interval where interval length is set to 20%.For example, "mean of interval= 50%" means γ high = 40% and γ low = 60% (see Section 3.1.1).♦ marks the setting for main experiments.
studies a similar problem by assigning instances to different explanations proposed by LLMs.Another recent work IDAS (Raedt et al., 2023) directly encode the concatenation of sentence and abstractive summarizations from LLMs for clustering.
Figure6: Scatter plots for t-SNE of embeddings.We select 10 classes from each datasets, denoted by colors.

Table 2 :
Comparison of clustering accuracy and NMI with known granularity for evaluation.Average over all 14 datasets are shown in the last two columns.Best results are bolded.

Table 3 :
Analysis on the triplet prediction accuracy ( † is used to produce results of CLUSTERLLM-I in Table2).Red and green mean decreased or increased performances respectively."#GT Triplets" means triplets that have ground truth (see Section 4.5 for details).

Table 4 :
Ablation study on clustering quality with Instructor as backbone and known granularity for evaluation.See more results with large-scale datasets in Table6.

Table 5 :
(Colombo et al., 2022) small-scale datasets.Maximum & minimum number of clusters are set to 200 & 2.The results are shown in format of "[#clusters] (errors)"."Rank"column is computed with 1-level ranking(Colombo et al., 2022)with inverse errors."GT" is ground truth.See results for large-scale datasets in Table7.

Table 6 :
Ablation study on clustering quality for large-scale datasets.

Table 7 :
Inferred granularity on large-scale datasets.The setting is the same as in Table5.

Table 8 :
Prompt designs in determining granularity.We use the Instructor embedding with prompts, and we report results of GPT-4 with λ = 1.

Table 12 :
One example from Bank77 on both triplet task and pairwise task.