Improved Training of Deep Text Clustering

,


Introduction
Cluster analysis plays a role in machine learning and data mining as it can classify data into different groups in an unsupervised way.Over the past decades, a large number of clustering methods with shallow models have been proposed (Ren et al., 2019;Comaniciu and Meer, 2002;Ren et al., 2018;Bishop and Nasrabadi, 2006;Ren et al., 2017;Cai et al., 2013;Huang et al., 2021).The performance of them on the complex data is limited due to the poor power of feature learning.
Recently, deep learning based clustering approach (referred to deep clustering) aims at effectively extracting more clustering-friendly features from data and performing clustering with learned features simultaneously (Chang et al., 2017;Albert et al., 2022;Ronen et al., 2022).
After in-depth research, we find that most deep clustering models are usually optimized through specific supervision information.We call it generalized supervision in clustering which is implemented through generalized labels in most cases.The generalized labels usually have different generation method according to different deep clustering algorithm.We will elaborate them in Section 2. Here, we argue that the generalized labels are full of noise due to the performance limitation of clustering model and estimation methods especially at the beginning of clustering progress, which will disturb the recognition of clustering group and influence the clustering result.To this end, inspired by Positive and unlabeled Learning (PU learning), we propose a novel method to reduce the impact of noise in generalized labels, namely Deep Clustering optimization from Generalized Labeled and Unlabeled data (DCGLU for short).
The proposed DCGLU divides the clustering samples into high confidence samples and low confidence samples according to the confidence level of the samples been correctly clustered.Obviously, the generalized labels of low confidence samples tend to have more errors (as the decision confidence usually related to accuracy (Hu et al., 2020)).In this case, we propose to regard the low confidence samples as unlabeled samples and the high confidence samples as labeled samples, then learning the model in the same fashion with PU learning to reduce the errors in generalized supervision and thus improve the clustering performance.
Overall, our contributions can be summarized as follows:(1) it proposes a new problem that exists in most deep clustering approaches and formulate it as a problem of generalized supervision with generalized label, which will facilitate the research of deep clustering; (2) it proposes a general deep clustering optimization method, which can be leveraged to reduce the impact of noise in clustering; (3) experiments show the proposed DCGLU can improve the clustering features and clustering results of strong baselines.

Problem Formulation
Given the unlabeled text dataset D = {x i , i = 1, ..., N }, where x i represents the i th text in the corpus, our goal is to cluster N samples into k different classes by an unsupervised method.Each sample in D is represented by a feature vector.We use the pre-trained language model Bert (Devlin et al., 2018) to obtain the representation e i = BERT (x i ) of each text.
As discussed in Section 1, the deep model in deep clustering methods is usually optimized by specific supervised information.We refer the specific supervision to generalized supervision, which is basically realized by generalized labels.The generalized label generation method varies among different clustering methods1 : 1) The cluster center used in K-means-based methods (Yang et al., 2017;Xie et al., 2016) can be considered as the generalized label; 2) Spectral clustering correlation methods (Chang et al., 2017;Yang et al., 2019) use the distance similarity matrix between samples to minimize the similarity distance between samples, the similarity between samples can be regarded as generalized labels; 3) Gaussian Mixture Model methods (Ronen et al., 2022) assume that the samples in each class obey an independent Gaussian distribution.The center of the Gaussian distribution can be regard as the generalized label; 4) (Caron et al., 2018) use the clustering assignment results as pseudo-label information to optimize feature generation.Thus clustering assignment results are the generalized labels.
Formally, deep clustering algorithm will generate a target distribution P , which is used to indicate the probability distribution of samples of the same category, and optimize the clustering model through stochastic gradient descent algo-rithm.Hence, the generated target distribution P can be considered as the distribution of the generalized labels Y ′ of the samples.The process of fine-tuning the deep clustering model is the process of generalized label exploitation, which can be described formally by the experience risk R δ (f ): where L is the distance metric function of f (x) and the generalized label y ′ , (x i , y ′ i ) ∼ P .

Method
The noises in generalized label Y ′ result in error propagation problem.A common used method for the problem is instance weighting (Lison and Bibauw, 2017): . w i is the importance estimation weight of each sample.Clearly, there are two main challenges for instance weighting, (1) it reduces the impact of noise while ignoring a large amount of information in data with small w i ; (2) the estimation of w i is very tricky due to the accuracy limitation of unsupervised clustering.
For challenge (2), we propose to use the confidence of samples been correctly clustered (easy estimation, see appendix A.1) to measure the quality of generalized labels.Note that the confidence cannot replace noise estimation as the generalized labels tend to have little errors with high confidence but the contrary is not necessarily true in most cases.To solve the above problem and challenge (1), we regard the samples with low confidence as the unlabeled data as we cannot make sure the correctness of their generalized labels.2Therefore, the samples are divided into high-confidence labeled samples and unlabeled (low confidence) samples according to their confidence levels by hyper-parameter t.3 Then we adapt PU learning (Liu et al., 2002)  derivation process, please refer to Appendix A.2: where y ′p + is the multi-class generalized label corresponding to the high confidence sample; in general, y ′p − can be expressed as y ′p − = 1 − y ′p + ; π p is the prior probability of positive samples.
Overall, we give the clustering optimization algorithm in the Algorithm 1 in appendix A.2, and give the architecture of DCGLU in Figure 1.Clearly, DCGLU realizes decoupling from the clustering model, therefore it can be applied to most clustering algorithms.

Experiment Setup
We apply the DCGLU method on two strong deep clustering method: DAC (Chang et al., 2017) and DeepDPM (Ronen et al., 2022), and then evaluate them on two publicly available text datasets: SNIPS (Coucke et al., 2018)(7 classes) and DBPedia (Lin et al., 2020) (14 classes).We further take K-Means(KM) (MacQueen, 1967) and agglomerative clustering(AG) (Gowda and Krishna, 1978), SAE-KM, BERT-KM, Deep embedding clustering(DEC) (Xie et al., 2016), Deep clustering network(DCN) (Yang et al., 2017), etc as the baselines into comparison to show the competitiveness of DCGLU.We keep our primary experimental setup consistent with what DeepDPM4 and DAC5 reported in their original paper.For more details of datasets, baselines and implement settings please refer to Appendix B.

Compared with strong baselines
Table 1 shows the clustering results of strong baselines and the effectiveness of DCGLU.We can draw the following observations.Firstly, the proposed DCGLU has significant improvement on two different kinds of deep clustering methods DAC and DeepDPM (with p-value < 0.01 on paired t-test).The consistency improvement of multiple algorithms, evaluation metrics and datasets indicates that DCGLU has better effectiveness and universality.
Secondly, we can see that 1) on the SNPIS dataset, with the help of DCGLU, DeepDPM succeeds in significantly outperforming the strongest baseline system (DEC) participating in the comparison (with p-value < 0.01 on paired t-test); 2) on the DBPedia dataset, DCGLU makes DAC as well as DeepDPM, the top two powerful systems, get better results.This indicates that the error propagation problem in generalized supervision described in this paper has a large impact on the performance of deep models and is prevalent (even in very strong clustering systems), while DCGLU can deal with the problem and get better results.
In Section 3, we discussed that the instance weighting method can mitigate the noise problem in generalized labels, unfortunately it is difficult to estimate the noise of generalized labels accurately in the clustering algorithm.This paper uses confidence to select high confidence labeled samples and unlabeled samples.In this case, can we use confidence as weight to mitigate the noise problem?The answer is yes but we cannot get better clustering results in experiments.Taking DAC algorithm as an example, compared with the original DAC algorithm, instance weighting (regard the confidence level as the instance weight) decreases 3.31,3.79,2.63 percentage points for NMI, ARI, and ACC metrics on SNIPS dataset, and 7.19,6.59,6.16percentage points for the three metrics on DBPedia dataset, respectively.This is because the generalized labels with low confidence level are not necessarily noise, so the instance weighting using confidence level will ignore a lot of relevant information, which makes the clustering effect appear a certain degree of degradation.

More evaluations
We evaluate the representation ability of DCGLU as learning better clustering features is very important for deep clustering algorithm (Caron et al., 2018).Good representation of clustering indicate the clustering results can be further improved by fine-tuning or distance-based clustering methods, etc.In the experiments, we leverage K-Means and DEC as the post clustering method based on the features learned by DCGLU, and get consistency and significant improvements on the two datasets (with p-value < 0.01 on paired t-test), which indicates DCGLU can learn better representations.For more details please see Appendix C.1.
The π p in Equation 2 is the key hyper-parameter in DCGLU, we conducted a detailed analysis to guide its setting in specific applications.Fortunately, we found that using a constant π p (π p = 0.5) can already achieve good results, which greatly reduces the conditions of DCGLU applications.For more details please see Appendix C.2.

Related Work
Deep clustering.Researchers have proposed many clustering methods, including center-ofmass-based clustering (MacQueen, 1967), densitybased clustering (Ester et al., 1996;Comaniciu and Meer, 2002), agglomerative clustering (AG) (Gowda and Krishna, 1978) and so on, but these traditional clustering methods fix sample features, and the clustering performance of the model tends to be poor when the extracted sample features are poor.In recent years, researchers (Xie et al., 2016;Yang et al., 2017) have combined deep neural networks to focus their work in deep clustering by feature clustering jointly.(Chang et al., 2017) proposed Deep Adaptive Cluster(DAC), which transforms the clustering problem into a binary pairwiseclassification framework to determine whether samples belong to the same class.(Hadifar et al., 2019) proposed to learn discriminative features from both an autoencoder and a sentence embedding, and then update the weights of the encoder network using assignments of clustering algorithm as supervision.(Zhang et al., 2021) proposed Supporting Clustering with Contrastive Learning (SCCL) to leverage contrastive learning to promote better separation.(Ronen et al., 2022) proposed Deep Dirichlet Process Mixture(DeepDPM) to adapt to changes in the k-values of clustering categories by dynamically performing split-and-merge operations.In this paper, we focus on the influence of noise in clustering which is not explicitly investigated before.
PU learning.Positive-unlabeled (PU) learning has been studied for a long time (De Comité et al., 1999;Du Plessis et al., 2014, 2015;Kiryo et al., 2017).For PU learning (Liu et al., 2002) is defined as follows: given a set of positive samples P and a set of unlabeled samples U , which contain hidden positive and negative samples, construct a binary classifier to classify the samples.PU learning is used in many natural language processing applications, (Xia et al., 2013)  In this paper, inspired by PU learning, we propose a novel method (DCGLU) to improve the performance of text clustering, which 1) transform the multi-classification problem in clustering into binary classification; 2) establish a binary classification selector; 3) from the perspective of empirical risk minimization, dig the correlation between generalized labeled and unlabeled samples.Clearly, there are some major differences between PU learning and DCGLU.

Conclusion
In this paper, we propose the concept of generalized supervision and generalized labels in clustering which can help to study the impact of noise in clustering and thus improve the performance of clustering.

A.1 Method of Confidence Estimation
Compared with the estimation of generalized labels, the method of obtaining confidence is relatively simple.For examples, in the method based on the cluster center, the closer the sample is to the cluster center, the higher the possibility of belonging to the center, so the confidence level can be acquired by the distance; in the method based on the similarity between samples, the higher the similarity between samples, the higher the possibility of the sample pair belonging to the same category, thus the confidence level can be acquired by the similarity between samples.

A.2 Adapting PU Learning to Deep Clustring
According to the problem of learning from unlabeled data discussed in Section 3, we find that PU learning has more in-depth research, which can promote the identification of samples with similar characteristics of positive data among unlabeled samples.In the scenario of clustering, high-confidence labeled samples and unlabeled (low confidence) samples can be regard as the positive samples and unlabeled samples in PU learning respectively.The classical PU learning (Du Plessis et al., 2014) relies on the binary unbiased risk estimator to mine the correlation between unlabeled samples and positive samples: where π p is the prior probability of positive samples.Among Eq. ( 3), where {+1, −1} corresponds to the labels of positive and unlabeled samples, respectively.
Here we introduce PU learning to deep clustering to help use the generalized labels and extend it to multiple classes to adapt for the application scenario of the clustering, in which high confidence labeled samples are considered as positive samples and unlabeled (low confidence) samples are considered as unlabeled samples, thus: where,y ′p + is the multi-class generalized label corresponding to the high confidence sample; Eq.( 8) indicates that the high confidence sample does not belong to the loss of the currently assigned generalized label, in general, y ′p − can be expressed as y ′p − = 1 − y ′p + .Based on the above development method, R − u (f ) can be extended to: However, there is no reliable multi-class generalized label for low confidence samples.Similarly, based on the maximum entropy principle, we assume that the generalized label y ′p of each unlabeled (low confidence) sample obeys a uniform distribution, i.e., the unlabeled sample expects the minimum under all category labels, so for the unlabeled sample,we transform Eq.( 9) into: where f (x u i ) is the mean of the predicted probabilities of unlabeled (low confidence) samples over all categories.The extension method, including Eq. ( 7), Eq.( 8), and Eq.( 10), can be approximated as a way to complete the learning of high confidence categorized samples and uncategorized (low confidence) samples by transforming the multi-category task split into multiple binary categories.Further, to prevent overfitting we use the non-negative risk estimator (Kiryo et al., 2017), which transforms Eq.(3) into: By substituting Eq.( 7), Eq.( 8), and Eq.( 10) into Eq.(11), the loss function of the optimized deep clustering model proposed in this paper can be obtained, which is called Deep Clustering optimization from Generalized Labled and Unlabeled learning algorithm(DCGLU).It can be added as a regularization term into the currently commonly used deep clustering algorithm, with good generalization ability: Where L ori is the original loss of deep clustering algorithm, λ is the blend factor, and R pu (f ) is the loss function of deep clustering optimization model proposed in this paper.
Algorithm 1 DCGLU if p > t then 6: Get high confidence samples X h k from X k as the positive samples X p k ; 7: Calculate the empirical risk of the positive samples R + p (f ) and R − p (f );//Eq.(7) and Eq.( 8) 8: else 9: Get low confidence samples X l k from X k as the unlabeled samples X u k ; 10: Calculate the empirical risk of the unlabeled samples R − u (f );//Eq.(10) 11: Update θ by an external SGD-like stochastic optimization algorithm; 18: end for 19: for all xi ∈ X do 20: li := f (xi); 21: ci := arg max h (l ih ); 22: end for B Experiment Setup

B.1 Datasets
We conduct experiments on two publicly available text datasets.The detailed statistics are shown in Table 2.
SNIPS It derives from (Coucke et al., 2018) and is a personal voice assistant dataset containing 14,484 voices, divided into seven categories altogether.

B.2 Baseline and Evaluation Metrics
We follow (Lin et al., 2020), and chose the baseline of the unsupervised part for comparison: KM and AG.K-means(KM) (MacQueen, 1967) and Agglomerative Clustering(AC) (Gowda and Krishna, 1978) are classical clustering algorithms, and here we represent the sentences with the averaged pre-trained 300-dimensional word embeddings from GloVe (Pennington et al., 2014).SAE-KM.We encode the sentences with the stacked autoencoder (SAE), and then execute kmeans.
BERT-KM.We encode the sentences by BERT (Devlin et al., 2018), and then execute k-means.
DEC. Deep embedding clustering(DEC) (Xie et al., 2016) learns the mapping from data space to low-dimensional feature space, and uses tdistribution iteration to fine-tune clustering model.

DCN.
Deep clustering network(DCN) (Yang et al., 2017) follows the idea of DEC and adds regularization term in the optimization process.
As an optimization strategy, in order to show the performance of our algorithm, we select two advanced deep clustering algorithms of different kinds as the main methods to test DCGLU: Deep Dirichlet Process Mixture (DeepDPM) (Ronen et al., 2022) and Deep Adaptive Clustering (DAC) (Chang et al., 2017).DeepDPM is a deep nonparametric clustering method that can adapt to k-value changes by dynamically adjusting the split and merge operations.DAC transforms the clustering problem into binary pairwise-classification framework to judge whether the samples belong to the same category.
To evaluate the experimental results, we choose three common clustering measures: Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), and clustering accuracy (ACC).To calculate clustering accuracy, we use the Hungarian algorithm (Kuhn, 1955) to find the best alignment between the predicted cluster label and the ground- truth label.The higher the score of all metrics, the better the clustering performance.

B.3 Implement Details
For both DeepDPM and DAC, we use the same BERT model to get feature vectors of text data, but as the adaptation of DeepDPM is limited for highdimensional features, in accordance with their suggestions, we use Bert-Whitening (Su et al., 2021) to reduce the dimension of text features.Text features H is reduced from 768 to 64 dimensions.For fairer comparison, we maintain the original dataset settings of (Ronen et al., 2022) and(Lin et al., 2020).Finally, we report the average results of each algorithm over ten runs.
DeepDPM is divided into clustering module and subclustering module.The model is optimized by introducing a new loss caused by Expectation-Maximization (EM) (Dempster et al., 1977) in Bayesian GMM, where every E-step is followed by a standard M-step.However, we believe that the probability of E-step generation is too incredible, and there are many misdivided samples.Therefore, we sharpen the probability of E-step generation, select high confidence labeled samples and other samples as unlabeled samples to optimize the clustering effect.DAC transforms the clustering problem into a binary pairwise-classification framework to determine whether the samples belong to the same class, so we optimize the clustering effect by treating two samples with high similarity as high-confidence labeled sample pairs and those with low similarity as unlabeled sample pairs.Our experiments conduct on pytorch 6 version 1.11.0 and 1.0.1 respectively.All experiments were performed on the NVIDIA Ge-Force RTX-2080Ti Graphical Card with 10G graphical memory.

C.1 Effect on clustering features
Learning better clustering features is one of the important directions of deep clustering algorithm research (Caron et al., 2018), on which the clustering effect can be further improved by fine-tuning or 6 https://pytorch.orgdistance-based clustering methods, etc.To explore more deeply the effect of DCGLU on clustering features, we explored the experimental results of different variants of the DAC algorithm, in which DAC-KM clusters the embedded features learned by DAC using the K-Means model, and DAC-DEC combines the idea of DEC (Xie et al., 2016) model to fine-tune and improve the clustering assignment by expectation maximization iterations.
As can be seen from Table 3 of the experimental results, each method has a certain improvement on both datasets after combining the DCGLU optimization method, which is more obvious on the DBPedia dataset, which indicates that the DCGLU method is able to optimize the clustering effect of the DAC algorithm itself, in addition to obtaining better clustering features.Note that DCGLU has limited improvement on the SNPIS dataset, and we believe this is because the performance improvement on the clustering representation cannot be efficiently transferred to the final clustering performance after post-processing, therefore the limited improvement can already indicate that DCGLU has some improvement on the clustering features.

C.2 Effect of π p
In the PU learning domain, π p denotes that the prior of positive samples can be estimated by the ratio of positive samples in the data, however, in DCGLU, the accuracy and recall of high confidence labeled sample estimates in the generalized labels of most deep clustering algorithms keep changing as the clustering algorithm learns(generally, it will be promoted first and then remain stable).This would result in π p changing dynamically throughout the training process and thus difficult to estimate.Fortunately, we found that using a constant π p can already achieve good results, which greatly reduces the conditions of DCGLU applications.
The effects of different π p on the experimental results are shown in Figure 2, where π p ∈ {0.1, 0.3, 0.5, 0.7, 0.9}.It can be seen that for different π p , the three evaluation metrics of the two data sets show the same trend of change, and the optimal effect is reached at π p = 0.5.   is overestimated, there is a significant decrease in the experimental results, while the experimental results tend to be stable when π p is small, so we suggest not to use a larger π p .This is because in the clustering process, there are still a small number of errors in the high confidence labeled samples.Hence, under the condition of ensuring performance, a smaller π p should be used to reduce the probability of introducing errors in DCGLU, so as to improve model stability and clustering effect.

C.3 Effect of time efficiency
DCGLU does not increase the complexity of the primary clustering algorithm in terms of efficiency.
We conduct a runtime complexity experiment, and combined instance selection and instance weighting to apply PU learning to cross-domain sentiment classification; (Peng et al., 2019) explored the way to perform named entity recognition (NER) using only unlabeled data and named entity dictionaries; (He et al., 2020) presented a method to improve the performance of distant supervision relation extraction with PU Learning.
in X .Parameter: Clustering model parameter θ, hyper-parameters confidence threshold t (0 ≤ t ≤ 1) and positive class-prior probability πp.Output: Cluster label ci for each xi ∈ X .1: Initialization:Initialize the cluster network parameters; 2: for number of training steps do 3: Sample batch X k from X , k indicates the batch id; 4: Input to clustering model f ,Get probability of sample pairs or samples p; 5:
Based on the generalized supervision and labels, we propose a deep clustering optimization method, namely Deep Clustering optimization from Generalized Labeled and Unlabeled data (DCGLU) which can be leveraged to most deep clustering methods of text.As we discussed in related works, DCGLU different from existing clustering and clustering optimization approaches.The experimental results demonstrate the necessity and effectiveness of the method.Zongmo Huang, Yazhou Ren, Xiaorong Pu, and Lifang He. 2021.Non-linear fusion for self-paced multiview clustering.In Proceedings of the 29th ACM International Conference on Multimedia, pages 3211-3219.

Table 2 :
Statistics of SNIPS and DBPedia datasets.# indicates the total number of sentences.
When π p

Table 3 :
The clustering results for variants of DAC.

Table 4 :
Experimental results of efficiency cost of DCGLU Table 4 shows the run time."Proportion(DCGLU Time/Base Time)" indicates the proportion of the calculation time of DCGLU in the entire program running time.As can be seen from the table, our method does not bring much additional time over the entire run time.This is because DCGLU has a time complexity of O(N), and at the same time, the additional spatial complexity brought by DCGLU is O(N).