Word Clustering Based on Un-LP Algorithm

Word clustering which generalizes speciﬁc features cluster words in the same syntactic or semantic categories into a group. It is an effective approach to reduce feature dimensionality and feature sparseness which are clearly useful for many NLP applications. This paper proposes an unsupervised label propagation algorithm (Un-LP) for word clustering which uses multi-exemplars to represent a cluster. Experiments on a synthetic 2D dataset show the strong ability of self-correcting of the proposed algorithm. Besides, the experimental results on 20NG demonstrate that our algorithm outperforms the conventional cluster algorithms.


Introduction
Word clustering is the task of the division of words into a certain number of clusters (groups or categories). Each cluster is required to consist of words that are similar to one another in syntactic or semantic construct and dissimilar to words in distinctive groups. Word clustering generalizes specific features by considering the common characteristics and ignoring the specific characteristics among the individual features. It is an effective approach to reduce feature dimensionality and feature sparseness (Han et al., 2005).
Recently, word clustering offers great potential for various useful NLP applications. Several studies have addressed dependency parsing (Koo et al., 2008;Sagae and Gordon, 2009). Momtazi and Klakow (2009) propose a word clustering approach to improve the performance of sentence retrieval in Question Answering (QA) systems. Wu et al. (2010) present an approach to integrate word clustering information into the process of unsupervised feature selection. Sun et al. (2011) use large-scale word clustering for a semi-supervised relation extraction system. It also contributes to word sense disambiguation (Jin et al., 2007), named entity recognition (Turian et al., 2010), part-of-speech tagging (Candito and Seddah, 2010) and machine translation (Uszkoreit and Brants, 2008;Jeff et al., 2011). This paper presents an unsupervised algorithm for word clustering based on a probabilistic transition matrix. Given a text document dataset, a list of words is generated by removing stop words and very unfrequent words. Each word is required to be represented by the documents in the dataset, which results in a co-occurrence matrix. By calculating the similarity of words, a word similarity graph with transition (propagation) probabilities as weight edges is created. Then, a new kind word clustering algorithm, based on label propagation, is applied.
The remaining parts of this paper are organized as follows: Section 2 formulates word clustering problem in the context of unsupervised learning. Then we describe the word clustering algorithm in Section 3 and present our experiments in Section 4. Finally we conclude our work in Section 5.

Algorithm 1 Semi-supervised LP Algorithm
Algorithm 2 Unsupervised LP Word Clustering

Input:
Input: Output: Output: YU Λ = {(Λ1, Λ2, · · · , ΛL} word-clusters 1: Begin 1: Clamp the labeled data 6: {V t+1 L , T t+1 ul } = update(Λ t+1 ) 7: End while 7: End while 8: End 8: End 9: Return YU 9: Return Λ t+1 length. We define the vector of word v i in the vocabulary to be v i =< v id 1 , v id 2 , · · · , v id N >. This allows us to define a V × N word-document matrix W D for the vocabularies. W D ij is equal to 1 if v i ∈ d j and equal to 0 otherwise. Then we take these words as the vertices of a connected graph. In this paper, we define the edge weight ω ij as the co-occurrence frequency between v i and v j . Obviously, we expect that larger edge weights allow labels to travel through more easily. So we define a is its cluster label. For example, suppose L = 3 and a word v has a label distribution y =< 0.1, 0.8, 0.1 >, it implies that v belongs to the second class.
In clustering problems, the goal is to select a set of exemplars from a dataset that are representative of the dataset and each cluster is represented by one and only one exemplar (Krause and Gomes, 2010). However, these exemplars are just all Semi-LP needs for clustering. LP lacks labeled data when is used for unsupervised learning. In this paper, we are interested in partitioning words into several clusters without any label priori using unsupervised LP (Un-LP) algorithm. Firstly we randomly select K (K ≥ L, usually K is a multiple of L) words to construct an exemplar set E = {E i } K i=1 which is different from the conventional exemplar-based cluster algorithms, assign class labels to them and construct the corresponding probabilistic transition matrix T 0 ul (initialization). These exemplars are considered as labeled words and the rest U = W − E are unlabeled words. T ul is the probability of transition from unlabeled words to labeled ones. At this step, it needs the assurance that each class could be represented by at least one exemplar and each exemplar could only be assigned one class label. Now the connected weighted graph consists of two parts: G = (E ∪ U, T ul ∪ T uu ) where T uu is the transition probability between unlabeled words. Next, our algorithm iterates between the following three steps: given a set of LP parameters, we first propagate labels to unlabeled words with the initial label distributions and get the corresponding labels (Semi − LP ). Then, these derived label distributions are used to guide the partitioning of unlabeled data (partition cluster) to L clusters. We use residual sum of squares (RSS) to choose the most centrally located words and replace the old exemplars that represent the cluster. Specifically, for a word cluster c i = {v 1 , · · · v n }, RSS i = n j=1 ω ij . Then we sort RSS i (0 < i < n) and update exemplars by the words with bigger RSS for this cluster (update). All of the above steps, summarized in Algorithm 2, are iterated until convergence.

Experimental Setup
To demonstrate properties of our proposed algorithm we investigate both a synthetic dataset and a realworld dataset. Figure 1(a) shows the synthetic dataset. For a real world example we test Un-LP on a subset of 20 Newsgroups (20NG) dataset which is preprocessed by removing common stop-words and stemming. We use the classes atheism, hardware, hockey and space for test and randomly select 300 samples from each class as the test dataset in this section. However, 20NG is not suited for word clustering evaluation. So, firstly, we reconstruct it by pair-wise testing which is a specification-based testing criterion. Then we can obtain six (C 2 4 = 6) pairwise subsets represented by {D 1 , · · · , D 6 }. In order to facilitate the evaluation, we use those words that only occur in one class for clustering.

Exemplar Self-correction
This multi-step iterative method is simple to implement and surprisingly effective even with wrong initial labeled data. To illustrate the point, we describe a simulated dataset with two-moon pattern. Obviously, the points in one moon should be more similar to each other than the points across the moons as shown in Figure 1(b). During the initialization phase, four points in the lower moon are selected and assigned with different labels. The exemplars of the upper moon are mis-labeled as shown in Figure 1(c). In the next five iteration steps, exemplars have been gradually moved to the center of the upper moon. Finally, when t ≥ 5 Un-LP converges to a fixed assignment, which achieves an ideal cluster result.  Two-moon pattern dataset without any labeled points, (b) ideal clustering results. The convergence process of unsupervised LP with t from 1 to 6 is shown from (c) to (h). Solid points are labeled data that are selected to represent the clusters.

Word Clustering Performance
This section provides empirical evidence that the proposed algorithm performs well in the problem of word clustering. Figure 2 shows the mean precisions and recalls over 10 runs of the baseline algorithms as well as Un-LP.
From Figure 2, it can be clearly observed that Un-LP (K/L = 5) yields the best performance, followed by Semi-LP with 20 labeled words. In general, the recalls with k-means and k-medoids are higher, while the precisions are much lower. Figure 2 also shows the results of other two semi-supervised word  Table 1: Top-20 words extracted by unsupervised LP word cluster algorithm.
clustering algorithms, PCK-means  and MPCK-means  with 200 must-link and cannot-link constraints. Also when comparing these unsupervised and semi-supervised approaches previously mentioned, we can find that our unsupervised algorithm consistently achieves significantly better results. Therefore, unsupervised LP seems to be a more reasonable algorithm design in terms of word clustering.

Effect of exemplar number e
We now investigate how the number of exemplar (e) for each cluster affects the clustering. In particular, we are interested in performance under conditions when the number of exemplar grows -which is the motivation for using more than one exemplars to represent a cluster. From Figure 3, we can observe that when more words are labeled, Semi-LP shows further improvement in F-value. However, the changes for PCK-means and MPCK-means are not obvious. Interestingly, even with the number of labeled data growing, Semi-LP still performs worse than Un-LP. As is shown in Figure 3, Un-LP benefits much from multi-exemplars (e ≥ 2). For D4, Un-LP is capable of achieving 99.58% in F-value when e = 7, obtaining 21.32% improvement over the baseline (e = 1). This indicates that our algorithm leverages the additional exemplars effectively.

Case Study
We conduct an experiment to illustrate the characteristics of the proposed algorithm in this subsection. We cluster the words in all the four domain datasets and select the most representative words for each cluster by sorting y i . In the experiment, we set L = 4 in order to match the class number of the dataset. Table 1 shows top-20 representative words for each cluster, where the bold words are the ones   with correct cluster label inferencing from the literal meaning. We observe that the accuracy of word clustering on 20NG is very low (28.75%), which is at variance with the preceding conclusion. One reason is that words in 20NG are stemmed, so, from Table 1 it can be clearly seen that there are some non-English words (e.g., "mcl", "hfd", "stl", etc.) that don't have actual meanings. In order to gain further insights into the reasons, the distributions of these incorrect words have been made in statistics. Partial results are shown in Table 2. From the distributions, we can find that many words marked in italics in Table 1 have been correctly clustered, although they have nothing to do with corresponding class in the literal meaning. Taking these words into account, the accuracy can reach 81.25% which demonstrates once again the effectiveness of Un-LP word clustering algorithm.

Conclusion
In this paper, we propose an unsupervised label propagation algorithm to tackle the problem of word clustering. The proposed algorithm uses a similarity graph based on co-occurrence information to encourage similar words to have similar cluster labels. One of the advantages of this algorithm is that it uses multi-exemplars to represent a cluster, which can significantly improve the clustering results.