Joint Learning of Chinese Words, Terms and Keywords

Previous work often used a pipelined framework where Chinese word segmentation is followed by term extraction and keyword extraction. Such framework suffers from error propagation and is unable to leverage information in later modules for prior components. In this paper, we propose a four-level Dirichlet Process based model (DP-4) to jointly learn the word distributions from the corpus, domain and document levels simultaneously. Based on the DP-4 model, a sentence-wise Gibbs sampler is adopted to obtain proper segmentation results. Meanwhile, terms and keywords are acquired in the sampling process. Experimental results have shown the effectiveness of our method.


Introduction
For Chinese language which does not contain explicitly marked word boundaries, word segmentation (WS) is usually the first important step for many Natural Language Processing (NLP) tasks including term extraction (TE) and keyword extraction (KE). Generally, Chinese terms and keywords can be regarded as words which are representative of one domain or one document respectively. Previous work of TE and KE normally used the pipelined approaches which first conducted WS and then extracted important word sequences as terms or keywords.
It is obvious that the pipelined approaches are prone to suffer from error propagation and fail to leverage information for word segmentation from later stages. Here, we provide one example in the disease domain, to demonstrate the common problems in current pipelined approaches and propose the basic idea of our joint learning of words, terms and keywords.
This is a correctly segmented Chinese sentence. The document containing the example sentence mainly talks about the property of "{ • (heparinoid)" which can be regarded as one keyword of the document. At the same time, the word @ •Ï Ç(thrombocytopenia) appears frequently in the disease domain and can be treated as a domain-specific term.
However, for such a simple sentence, current segmentation tools perform poorly. The segmentation result with the state-of-the-art Conditional Random Fields (CRFs) approach (Zhao et al., 2006) is as follows: where @ •Ï Ç is segmented into three common Chinese words and {• is mixed with its neighbors.
In a text processing pipeline of WS, TE and KE, it is obvious that imprecise WS results will make the overall system performance unsatisfying. At the same time, we can hardly make use of domain-level and document-level information collected in TE and KE to promote the performance of WS. Thus, one question comes to our minds: can words, terms and keywords be jointly learned with consideration of all the information from the corpus, domain, and document levels?
Recently, the hierarchical Dirichlet process (HDP) model has been used as a smoothed bigram model to conduct word segmentation (Goldwater et al., 2006;Goldwater et al., 2009). Meanwhile, one strong point of the HDP based models is that they can model the diversity and commonality in multiple correlated corpora (Ren et al., 2008;Xu et al., 2008;Zhang et al., 2010;Li et al., 2012;Chang et al., 2014). Inspired by such existing work, we propose a four-level DP based model, 2 DP-4 Model Goldwater et al. (2006) applied the HDP model on the word segmentation task. In essence, Goldwater's model can be viewed as a bigram language model with a unigram back-off. With the language model, word segmentation is implemented by a character-based Gibbs sampler which repeatedly samples the possible word boundary positions between two neighboring words, conditioned on the current values of all other words. However, Goldwater's model can be deemed as modeling the whole corpus only, and does not distinguish between domains and documents. To jointly learn the word information from the corpus, domain and document levels, we extend Goldwater's model by adding two levels (domain level and document level) of DPs, as illustrated in Figure 1.

Model Description
M DPs (H m w ;1 ≤ m ≤ M ) are designed specifically to word w to model the bigram distributions in each domain and these DPs share an overall base measure H w , which is drawn from DP (α 0 , G 1 ) and gives the bigram distribution for the whole corpus. Assuming the m th domain includes N m documents, we use H m j w (1 ≤ j ≤ N m ) to model the bigram distribution of the i th document in the domain. Usually, given a domain, the bigram distributions of different documents are not conditionally independent and similar documents exhibit similar bigram distributions. Thus, the bigram distribution of one document is generated according to both the bigram distribution of the domain and the bigram distributions of other documents in the same domain. That is, H represents the bigram distributions of the documents in the m th domain except the j th document. Assuming the j th document in the m th domain contains N j m words, each word is drawn according to H . Thus, our four-level DP model can be summarized formally as follows: Here, we provide for our model the Chinese Restaurant Process (CRP) metaphor, which can create a partition of items into groups. In our model, the word type of the previous word w i−1 corresponds to a restaurant and the current word w i corresponds to a customer. Each domain is analogous to a floor in a restaurant and a room denotes a document. Now, we can see that there are |V | restaurants and each restaurant consists of M floors. The m th floor contains N m rooms and each room has an infinite number of tables with infinite seating capacity. Customers enter a specific room on a specific floor of one restaurant and seat themselves at a table with the label of a word type. Different from the standard HDP, each customer sits at an occupied table with probability proportional to both the numbers of customers already seated there and the numbers of customers with the same word type seated in the neighboring rooms, and at an unoccupied table with probability proportional to both the constant α 3 and the probability that the customers with the same word type are seated on the same floor.

Model Inference
It is important to build an accurate G 0 which determines the prior word distribution p 0 (w). Similar to the work of Mochihashi et al. (2009), we consider the dependence between characters and calculate the prior distribution of a word w i using the string frequency statistics (Krug, 1998): where n s (w i ) counts the character string composed of w i and the symbol "." represents any word in the vocabulary V . Then, with the CRP metaphor, we can obtain the expected word unigram and bigram distributions on the corpus level according to G 1 and H w : where the subscript numbers indicate the corresponding DP levels. n(w i ) denotes the number of w i and n w (w i ) denotes the number of the bigram < w, w i > occurring in the corpus. Next, we can easily get the bigram distribution on the domain level by extending to the third DP.
where n m w (w i ) is the number of the bigram < w, w i > occurring in the m th domain.
To model the bigram distributions on the document level, it is beneficial to consider the influence of related documents in the same domain (Wan and Xiao, 2008). Here, we only consider the influence from the K most similar documents with a simple similarity metric s(d 1 , d 2 ) which calculates the Chinese character overlap ratio of two documents d 1 and d 2 . Let d j m denote the j th document in the m th domain and d j m [k](1 ≤ k ≤ K) the K most similar documents. d j m can be deemed to be "lengthened" by d j m [k](1 ≤ k ≤ K). Therefore, we estimate the count of w i in d j m as: where n d j m [k] w (w i ) denotes the count of the bigram < w, w i > occurring in d j m [k]. Next, we model the bigram distribution in d j m as a DP with the base measure H m w : (6) With CRP, we can also easily estimate the unigram probabilities p m 3 (w i ) and p d j m 4 (w i ) respectively on the domain and document levels, through combining all the restaurants.
To measure whether a word is eligible to be a term, the score function T H m (·) is defined as: This equation is inspired by the work of Nazar (2011), which extracts terms with consideration of both the frequency in the domain corpus and the frequency in the general reference corpus. Similar to Eq. 7, we define the function KH d j m (·) to judge whether w i is an appropriate keyword.
During each sampling, we make use of Eqs. (7) and (8) to identify the most possible terms and keywords. Once a word is identified as a term or keyword, it will drop out of the sampling process in the following iterations. Its CRP explanation is that some customers (terms and keywords) find their proper tables and keep sitting there afterwards.

Sentence-wise Gibbs Sampler
The character-based Gibbs sampler for word segmentation (Goldwater et al., 2006) is extremely slow to converge, since there exists high correlation between neighboring words. Here, we introduce the sentence-wise Gibbs sampling technique as well as efficient dynamic programming strategy proposed by Mochihashi et al. (2009). The basic idea is that we randomly select a sentence in each sampling process and use the Viterbi algorithm (Viterbi, 1967) to find the optimal segmentation results according to the word distributions derived from other sentences. Different from Mochihashi's work, once terms or keywords are identified, we do not consider them in the segmentation process. Due to space limitation, the algorithm is not detailed here and can be referred in (Mochihashi et al., 2009).

Data and Setting
It is indeed difficult to find a standard evaluation corpus for our joint tasks, especially in different domains. As a result, we spent a lot of time to collect and annotate a new corpus 1 composed of ten domains (including Physics, Computer, Agriculture, Sports, Disease, Environment, History, Art, Politics and Economy) and each domain is composed of 200 documents. On average each document consists of about 4800 Chinese characters. For these 2000 documents, three annotators have manually checked the segmented words, terms and keywords as the gold standard results for evaluation. As we know, there exists a large amount of manually-checked segmented text for the general domain, which can be used as the training data for further segmentation. As with other nonparametric Bayesian models (Goldwater et al., 2006;Mochihashi et al., 2009), our DP-4 model can be easily amenable to semi-supervised learning by imposing the word distributions of the segmented text on the corpus level. The news texts provided by Peking University (named PKU corpus) 2 is used as the training data. This corpus contains about 1,870,000 Chinese characters and has been manually segmented into words.
In our experiments, the concentration coefficient (α 0 ) is finally set to 20 and the other three (α 1∼3 ) are set to 15. The parameter K which controls the number of similar documents is set to 3.

Performance Evaluation
The following baselines are implemented for comparison of segmentation results: (1) Forward maximum matching (FMM) algorithm with a vocabulary compiled from the PKU corpus; (2) Reverse maximum matching (RMM) algorithm with the compiled vocabulary; (3) Conditional Random Fields (CRFs) 3 based supervised algorithm trained from the PKU corpus; (4) HDP based semisupervised algorithm (Goldwater et al., 2006) us-ing the PKU corpus. The strength of Mochihashi et al. (2009)'s NPYLM based segmentation model is its speed due to the sentence-wise sampling technique, and its performance is similar to Goldwater et al. (2006)'s model. Thus, we do not consider the NPYLM based model for comparison here. Then, the segmentation results of FMM, RMM, CRF, and HDP methods are used respectively for further extracting terms and keywords. We use the mutual information to identify the candidate terms or keywords composed of more than two segmented words. As for DP-4, this recognition process has been done implicitly during sampling. To measure the candidate terms or keywords, we refer to the metric in Nazar (2011) to calculate their importance in some specific domain or document.
The metrics of F 1 and the out-of-vocabulary Recall (OOV-R) are used to evaluate the segmentation results, referring to the gold standard results. The second and third columns of Table 1 show the F 1 and OOV-R scores averaged on the 10 domains for all the compared methods. Our method significantly outperforms FMM, RMM and HDP according to t-test (p-value ≤ 0.05). From the segmentation results, we can see that the FMM and RMM methods are highly dependent on the compiled vocabulary and their identified OOV words are mainly the ones composed of a single Chinese character. The HDP method is heavily influenced by the segmented text, but it also exhibits the ability of learning new words. Our method only shows a slight advantage over the CRF approach. We check our segmentation results and find that the performance of the DP-4 model is depressed by the identified terms and keywords which may be composed of more than two words in the gold standard results, because the DP-4 model always treats the term or keyword as a single word. For example, in the gold standard, "-W ‡ ((Lingnan Culture)" is segmented into two words "-W" and " ‡ ", "p n ¥ ã(data interface)" is segmented into "pn" and "¥ã" and so on. In fact, our segmentation results correctly treat "-W ‡ " and "p n¥ã" as words.
To evaluate the TE and KE performance, the top 50 (TE-50) and 100 (TE-100) accuracy are measured for the identified terms of one domain, while the top 5 (KE-5) and 10 (KE-10) accuracy for the keywords in one document, are shown in the right four columns of Table 1. We can see that DP-4 performs significantly better than all the other methods in TE and KE results.
As for the ten domains, we find our approach behaves much better than the other approaches on the following three domains: Disease, Physics and Computer. It is because the language of these three domains is much different from that of the general domain (PKU corpus), while the rest domains are more similar to the general domain.

Conclusion
This paper proposes a four-level DP based model to construct the word distributions from the corpus, domain and document levels simultaneously, through which Chinese words, terms and keywords can be learned jointly and effectively. In the future, we plan to explore how to combine more features such as part-of-speech tags into our model.