The CIPS-SIGHAN CLP 2014 Chinese Word Segmentation Bake-off

This paper summarizes the SIGHAN 2014 Chinese Word Segmentation bake-off in several aspects such as dataset, evaluation results. In addition, we analyze errors of segmentation by instance and make a suggestion for improving segmentation systems


Goal of the Chinese word segmentation bake-off
Chinese Word Segmentation is the preliminary step for Chinese information processing, which is extremely important and never neglected. Due to the properties of Chinese, the performance of Chinese word segmentation has an effect on the following analysis of Chinese text. As the organizer of the bake-off in Chinese word segmentation, not only do we show the performance of all participated systems, but also try to find out the weak point of these systems. In this way, participants are able to learn advantages of their systems and realize the problems which they did not pay attention to so that they could improve their system according to our feedbacks, which turns out to promote the study of Chinese word segmentation.

Size of dataset
The dataset used in the SIGHAN2014 Chinese word segmentation bake-off is formed by sampling instances which are difficult to segment from approximately 1.3T Chinese corpus. This is a huge challenge for us. While sampling instances, we found that the distribution of sentences which are hard to segment does not depend on domains, in other words, these sentences appear in every domain.

Domains of dataset
Compared with the SIGHAN 2012 Chinese word segmentation bake-off which only focuses on the microblog domain, the dataset used in the shared task in SIGHAN2014 is formed by sampling sentences from a variety of domains. The dataset involves many subjects in both social sciences and natural sciences, and genres involved in the dataset are also taken into consideration. In this way, we can more clearly evaluate if current segmentation techniques can perform well in a wide range of domains.

Makeup of dataset
The SIGHAN2014 Chinese word segmentation bake-off mainly uses single sentences and paragraphs for evaluations. Additionally, discourses are also included.
As is known to all, there are two kinds of ambiguities in Chinese word segmentation -overlapping ambiguity and combinatorial ambiguity, which are difficult to deal with. In addition, OOV (out of vocabulary), which includes neologisms, abbreviations and uncommon terminology, is a challenge for Chinese word segmentation as well.

Evaluation Results
Precision, recall and F-measure are used to evaluate participants' systems, just as previous bakeoffs did. Since the number of participants is not large (6 institutes and 7 systems), we can analyze the systems in detail for finding the weak points of the systems, which would promote the study of Chinese word segmentation. Table 1: Distribution of P,R,F of systems participating in this bake-off

Automatic Evaluation
For automatic evaluation, Precision, recall and Fmeasure are used to evaluate participants' systems.
The performance of 7 systems of 6 institutes participating in the bake-off is shown in Table1 Table 4: Differences between the best system and the worst system in 2012 and 2014

Why manual inspection
In previous SIGHAN segmentation shared task, precision, recall and F-measure are only metric for evaluating systems. Although these metrics can reflect systems' performance to some extent, they cannot clearly show the specific weak point of the systems. It is likely that a system achieving high PRF does not deal with some details well and makes some silly mistakes. On the other hand, some systems whose PRF is not high can address some specific segmentation problems well. Of course, other factors such as the size of dictionary might also affect the results.
Since SIGHAN 2012 Chinese word segmentation bake-off, we have attempted to introduce evaluations for some specific cases, which could inform participants of the approximate accuracy range of each case and allow them to learn the weak points of their systems.
By manual inspection, we found some typical mistakes which should have been corrected but were not solved by most systems.

Methods of manual inspection
We use different types of lines (a single line, double line or dash line) to indicate how to segment a sequence of Chinese characters.  As shown in table 5, only one system segments the sequence without any mistake. In contrast, one of the systems makes many mistakes when segmenting simple terms, which may arise from the problem of word-collection or some further problems.

Excessive word-collection may have an adverse effect
In table 6, only one system segments '对方'. It can be verified by table 7 that this system did not include '对方' in its dictionary.
As shown in table 6 and table 7, a system which includes '对方' in its dictionary segments '对方' correctly while others make a mistake here. We hope that the system actually pays attention to the detail rather than happen to segment it well. There are many similar cases such as '平等' and '杜鹃'.
Example 6： 公司派张世平等一批技术骨干 和管理人员到国外学习。 "杜鹃" in example 7 is a noun while it is a person's name in example 2. Therefore, 杜鹃 should be segmented in example 2.   etc. To address these problems, an effective personal name recognition method is necessary.

A lack of attention to details
Example 8：进攻者比防御者更容易包围对 方的全部军队以及切断它们的退路,因为防御 者处于驻止状态,而进攻者是针对防御者的这 种状态进行运动的。 Example 8 is an instance in test set. In this sentence, 进攻者 appears three times and 防御 者 appears twice. Nonetheless, some systems cannot deal with these terms consistently. The cause of the phenomenon is that the systems do not exploit the context well.

Example 9：于廿七号晚上出发，
In example 9, seldom has 廿七号 been used in written language in recent years. However, a good system is supposed to take into consideration these cases. Incorrect segmentations are shown as follows.

Conclusion
Although languages have many properties in common, their unique characters do not allow researchers to directly use techniques for processing other languages to process Chinese.
In addition, when devoted to language study, one can find that Chinese has significant uniqueness and flexibility, which should be paid much attention to. Only by carefully analyzing unique properties of Chinese can researchers come up with a better solution to improving their systems. Even though Chinese is so flexible that one can-not use a rule to describe the problems of Chinese word segmentation, researcher can try multiple rules to optimize their systems in multiple aspects and multiple levels, which requires them to be mindful of details.
As the organizers of this Chinese word segmentation bake-off, we may need to scrutinize details and make a standard which is detailed and easy to operate. For the bake-off, we are going to explore a better evaluation method which can show the results of systems more reasonably and objectively.