Word Segmenter for Chinese Micro-blogging Text Segmentation – Report for CIPS-SIGHAN’2014 Bakeoff

This paper presents our system for the CIPS-SIGHAN-2014 bakeoff task of Chinese word segmentation. This system adopts a character-based joint approach, which combines a character-based generative model and a character-based discriminative model. To further improve the performance in cross-domain, an external dictionary is employed. In addition, pre-processing and post-processing rules are utilized to further improve the performance. The final performance on the test corpus shows that our system achieves comparable results with other state-of-the-art systems.


Introduction
Because Chinese text is written without natural delimiters, word segmentation is a prerequisite and fundamental task in Chinese natural language processing. And many approaches have been proposed for this task. Among these methods, the character-based tagging approach (Xue, 2003) has become the prevailing technique for Chinese word segmentation (CWS) due to its good performance. In recent years, within the framework of character-based, much efforts (Tseng et al., 2005;Zhang et al., 2006;Jiang et al., 2008) have been made to further improve word segmentation's performance.
The character-based joint model , Wang et al., 2012 achieves a good balance between in-vocabulary (IV) words recognition and out-of-vocabulary (OOV) words identification. So, in this evaluation task, following their work we adopt the character-based joint model as our basic system, which combines a character-based discriminative model and a character-based generative model. The generative module holds a robust performance on IV words, while the discriminative module can handle the extra features easily and enhance the OOV words segmentation.
Because the 2014 SIGHAN bakeoff task of Chinese Word Segmentation is an opened evaluation task and no training set is provided, the OOV problem will be more serious. Although the discriminative module can handle some cases of OOV, the performance is less preferable if no technique is utilized. So to further improve the performance of the basic system and minimize the OOV, we employ an external dictionary containing a large set of unknown words from different domains. Another notable problem is the Microblog text segmentation because Microblog has become a new Internet literary which is different from the genres of common text. To make our system more robust on Microblog text, we propose several simple but novel pre-processing and post-processing approaches in our system.
The final results show that our system performs well on test set and achieves comparable segmentation results with other participants.

Character-Based Joint Model
The character-based joint model in our system consists of two basic components:  The character-based discriminative model.  The character-based generative model.
The character-based discriminative model (Xue, 2003) is based on a Maximum Entropy (ME) framework (Ratnaparkhi, 1998) and can be formulated as follows: End of character k c in its associated word respectively, and S denotes that it's a Single-character word. For example, the word "北京市 (Beijing City)" will be assigned with the corresponding tags as: " 北 /B (North) 京 /M (Capital) 市 /E (City)". This discriminative model can incorporate extra features easily and the Maximum Entropy Modeling Toolkit 1 given by Zhang Le is used to implement the module. In our experiments, this model is trained with Gaussian prior 1.0 and 600 iterations.
The character-based generative module is a character-tag-pair-based trigram model (Wang et al., 2009) and can be expressed as below: SRI Language Modeling Toolkit 2 (Stolcke, 2002) is used to train the generative trigram model with modified Kneser-Ney smoothing (Chen and Goodman, 1998) in our experiments.
The character-based joint model combines the above discriminative module and the generative module with log-linear interpolation as follows: Where the parameter is the weight for the generative model and can be obtained from the development set.   k Score t will be directly used to search for the best sequence. We set an empirical value 0.4 to  as there is no development-set for various domains.

Features
The feature templates used in the character-based discriminative model are listed below: In the above templates, n C represents a Chinese character and the index n indicates the position. For example, when we consider the third character "奥" in the sequence "北京奥运会", template (a) results in the features as following: =京运. Template (d) is the feature of character type and five type classes are defined: dates ("年", "月", " 日", the Chinese character for "year", "month" and "day" respectively) represents class 0; foreign alphabets represent class 1; Arabic and Chinese numbers represent class 2; punctuation represents class 3 and other characters represent class 4. For example, when considering the character "，" in the sequence "八月，阿Q", the feature will be set to "20341".

External Dictionary
OOV words is a main problem faced by a Chinese word segmenter and it will lead to lower accuracy if the sentence to be segmented contains many OOV words. To address the problem of OOV words, we use an external dictionary containing a large set of predefined words. We following the method presented in Low et al. (2005) to use the dictionary. In this method, some sequence of neighboring characters around 0 C will be looked up in a dictionary using maximum match strategy. And the longest matching word W will be chosen. Let 0 t be the boundary tag of 0 C in W, L the number of characters in W, and   11 CC  be the character immediately following (preceding) C 0 in the sentence. We then add the following features derived from the dictionary: (e) 0 Lt (f)   0 1,0,1 n C t n  For example, consider the sentence "北京奥运 会...". When processing the current character 0 C "京", we will try to match the following candidates "京", "北京", "京奥", "北京奥", "京奥运", "北京奥运" and "京奥运会" against existing word in the external dictionary. Assuming that both "京奥" and "京奥运" are found in the dictionary, then the longest matching word "京奥运" will be chosen. And the value of W, 0 t , L, 1 C  and 1 C are "京奥运", B, 3, "北" and "奥" respectively.
In this work, we collect dictionaries from the Internet, including the title of Wikipedia 3 , the title of Hudong Baike 4 , Sogou word bank 5 and some other internet dictionaries. Finally, we obtain a dictionary containing 5,893,038 words in our system.

Restrictions in Constructing Lattice
When considering a character in the sequence, we take the type information of both the previous and the next character into consideration and use some restrictions to obtain a better tag lattice . The restrictions are listed as follows:  If the previous, the current and the next characters are all English or numbers, we would fix the current tag to be "M";  If the previous and the next characters are both English or numbers, while the current character is a connective symbol such as "-", "/", "_", "\" etc., we would also fix the current tag to be "M";  Otherwise, all four tags {B, E, M, S} would be given to the current character.

Rule-based Adaptation
The state-of-the-art Chinese word segmentation systems can achieve a quite high performance on well-formed text, while the performance of Microblog text segmentation is not satisfying due to the specificity of Microblog text. For example, there are lots of emotion symbols, URLs, abbreviations, consecutive and identical punctuations and special characters in Microblog text. In order to make our system more robust on segmenting Microblog data, we propose some heuristic preprocessing and post-processing rules to avoid some segmentation errors.

Pre-processing
As mentioned above, the Microblog texts contain much noise like special format words and characters. And such kind of noise will affect the segmentation performance. In order to remove these noise, we will pre-process the text before segmentation. Since URL, email and consecutive punctuations should be treated as one word and these content types can be easily recognized using the regex expressions, we first replace all these content to special characters before segmentation, and then restore all the special characters to the original characters after the segmentation. Table  1 shows the content type we will process in the pre-processing stage. 5 http://pinyin.sogou.com/dict/

Post-processing
We use some heuristic rules to further postprocess the results generated by the segmenter and the rules are described below: 1) Numeral and Quantifier: In our results, some numerals and quantifiers such as "两个" and "三张" are segmented as one unit. But in fact, the numeral and quantifier should be segmented into two words except some few words like "一个". So we use a simple rule to split these cases in which the previous word is a numeral and the next word is a quantifier. 2) Continuous mimetic words: There are many continuous mimetic words in Microblog, such as "哈哈哈哈哈", "呵呵 呵" . This kind of words should be treated as one unit. But our system splits each character into one word. Hence, we apply a rule to group the continuous mimetic words together. 3) Emoticons: some consecutive punctuations like ":-)" represent an emoticon and have some certain meanings. These emoticons should be grouped together. We have collected a list of emoticons from the web. For any consecutive punctuations, we join them together as a single word if they appear in the emoticon list.

Data sets
Since the Chinese word segmentation task focuses on the performance of multi-domain, we use five datasets as our test data. Four of the five datasets are the test data of SIGHAN10 closed track and the rest one is the 500 Microblog messages released by SIGHAN12. Hence, our test data covers 5 domains: Literature (Testing-A, containing 671 sentences), Computer (Testing-B, containing 1,330 sentences), Medicine (Testing-C, containing 1,309 sentences), Finance (Testing-D, containing 561 sentences) and Microblog (Testing-E, containing 500 sentences). The training data of our segmenter consists of two parts: one is the Peking University Corpora (PKU) from January to June and the other is manually annotated Microblog data which contains nearly 7000 sentences.

Experimental Results
We first evaluate our approach on the five test datasets using different strategies. The results are shown in Table 2 and the evaluation criterion is F-score. The strategies we used are:  Joint: represents the result of our model without dictionary.  +Dic: represents the result of our model using the external dictionary.  +Rule: represents the result of our model using the external dictionary and the preprocessing and post-processing rules. As Table 2 shows, our joint model performs well on all the five datasets even though the domain of the training data which is mainly composed of news data is different from the test sets. This shows that our character-based joint model is very robust and can achieve a good balance between in-vocabulary (IV) words recognition and OOV words identification After the external dictionary added, the performance increased a lot, which shows the external dictionary is very useful and can help alleviate the OOV problem efficiently. Finally, we adopt the pre-processing and post-processing rules in our system, the performance can be further improved on all testing set except Testing-C. Since the final test data will be multi-domain, we add all the five datasets to the training data and retrain the segmentation model. Then we apply the retrained model to the final test data (containing 1,665 sentences) and the performance is shown in Table 3. Table 3 shows that our system can achieve an F-score of 0.9578.

Conclusion
Our system is based on a character-based joint model, which combines a generative module and a discriminative module. In addition, we employ an external dictionary and propose several preprocessing and post-processing rules to further improve the performance. Our system achieves comparable performance with other participants.