An Improved Graph Model for Chinese Spell Checking

In this paper, we propose an improved graph model for Chinese spell checking. The model is based on a graph model for generic errors and two independently-trained models for specific errors. First, a graph model represents a Chinese sentence and a modified single source shortest path algorithm is performed on the graph to detect and correct generic spelling errors. Then, we utilize conditional random fields to solve two specific kinds of common errors: the confusion of “ 在 ” (at) (pinyin is ‘zai’ in Chinese), “ 再 ” (again, more, then) (pinyin: zai) and “ 的 ” (of) (pinyin: de), “ 地 ” (-ly, adverb-forming particle) (pinyin: de), “ 得 ” (so that, have to) (pinyin: de). Finally, a rule based system is exploited to solve the pronoun usage confusions: “ 她 ” (she) (pinyin: ta), “ 他 ” (he) (pinyin: ta) and some others fixed collocation errors. The proposed model is evaluated on the standard data set released by the SIGHAN Bake-off 2014 shared task, and gives competitive result.

In this paper, we propose an improved graph model for Chinese spell checking. The model is based on a graph model for generic errors and two independentlytrained models for specific errors. First, a graph model represents a Chinese sentence and a modified single source shortest path algorithm is performed on the graph to detect and correct generic spelling errors.

Introduction
Spell checking is a routine processing task for every written language, which is an automatic mechanism to detect and correct human spelling errors. Given sentences, the goal of the task is to return the locations of incorrect words and suggest the correct words. However, Chinese spell checking (CSC) is very different from that in English or other alphabetical languages from the following ways.
Usually, the object of spell checking is words, but "word" is not a natural concept in Chinese, since there are no word delimiters between words in Chinese writing. An English "word" consists of Latin letters. While a Chinese "word" consists of characters, which also known as "漢字" (Chinese character) (pinyin 1 is 'han zi' in Chinese). Thus, essentially, the object of CSC is misused characters in a sentence. Meanwhile, sentences for CSC task are meant to computer-typed but not those handwritten Chinese. In handwritten Chinese, there exist varies of spelling errors including non-character errors which are probably caused by stroke errors. While in computer-typed Chinese, a non-character spelling error is impossible, because any illegal Chinese characters will be filtered by Chinese input method engine so that CSC never encounters "out-of-character (OOC)" problem. Thus, the Chinese spelling errors come from the misuse of characters, not characters themselves.
Spelling errors in alphabetical languages, such as English, are always typically divided into two categories: • The misspelled word is a non-word, for example "come" is misspelled into "cmoe"; • The misspelled word is still a legal word, for example "come" is misspelled into "cone".
While in Chinese, if the misspelled word is a nonword, the word segmenter will not recognize it as a word, but split it into two or more words with fewer characters. For example, if "你好世界" in Example 1 of Table 1 is misspelled into "你好世節", the word segmenter will segment it into "你好/世/節" instead of "你好/世節". For non-word spelling error, the misspelled word will be mis-segmented.  Thus CSC cannot be directly applied those edit distance based methods which are commonly used for alphabetical languages. CSC task has to deal with word segmentation problem first, since misspelled sentence could not be segmented properly by word segmenter.
There also exist Chinese spelling errors which are unrelated with word segmentation. For example, "好好地出去玩" in Example 2 of Table 1 is misspelled into "好好的出去玩", but both of them have the same segmentation. So it is necessary to perform further specific process.
In this paper, based on our previous work (Jia et al., 2013b) in SIGHAN Bake-off 2013, we describe an improved graph model to handle the CSC task. The improved model includes a graph model for generic spelling errors, conditional random fields (CRF) for two special errors and a rule based system for some collocation errors.

Related Work
Over the past few years, there were many methods proposed for CSC task.  developed a phrase-based spelling error model from the clickthrough data by means of measuring the edit distance between an input query and the optimal spelling correction.  explored the ranker-based approach which included visual similarity, phonological similarity, dictionary, and frequency features for large scale web search. (Ahmad and Kondrak, 2005) proposed a spelling error model from search query logs to improve the quality of query. (Han and Chang, 2013) employed maximum entropy models for CSC. They trained a maximum entropy model for each Chinese character based on a large raw corpus and used the model to detect the spelling errors.
Two key techniques, word segmentation (Zhao et al., 2006a;Zhao and Kit, 2008b;Zhao et al., 2006b;Zhao and Kit, 2008a;Zhao and Kit, 2007;Zhao and Kit, 2011;Zhao et al., 2010) and language model (LM), are also popularly used for C-SC. Most of those approaches can fall into four categories. The first category consists of the methods that all the characters in a sentence are assumed to be errors and an LM is used for correction (Chang, 1995;Yu et al., 2013). (Chang, 1995) proposed a method that replaced each character in the sentence based on a confusion set and computed the probability of the original sentence and all modified sentences according to a bigram language model generated from a newspaper corpus. The method based on the motivation that all the typos were caused by either visual similarity or phonological similarity. So they manually built a confusion set as a key factor in their system. Although the method can detect misspelled words well, it was very time consuming for detection, generated too much false positive results and was not able to refer to an entire paragraph. ) developed a joint error detection and correction system. The method assumed that all characters in the sentence may be errors and replaced every character using a confusion set. Then they segmented all new generated sentences and gave a score of the segmentation using LM for every sentence. In fact, this method did not always perform well according to .
The second category includes the methods that all single-character words are supposed to be errors and an LM is used for correction, for example (Lin and Chu, 2013) . They developed a system which supposed that all single-character words may be typos. They replaced all single-character words by similar characters using a confusion set and segmented the newly created sentences again. If a new sentence resulted in a better word segmentation, spelling error was reported. Their system gave good detection recall but low false-alarm rate.
The third category utilizes more than one approaches for detection and an LM for correction.  used two different systems for error detection. The first system detected error characters based on unknown word detection and LM verification. The second one solved error detection based on a suggestion dictionary generated from a confusion set. Finally, two systems were combined to obtain the final detection result. (He and Fu, 2013) divided typos into three categories which were character-level errors (CLEs), word-level errors (WLEs) and context-level errors (CLEs), and three different methods were used to detect the different errors respectively. In addition to using the result of word segmentation for detection, (Yeh et al., 2013) also proposed a dictionarybased method to detect spelling errors. The dictionary contained similar pronunciation and shape information for each Chinese character. (Yang et al., 2013) proposed another method to improve the candidate detections. They employed high confidence pattern matching to strengthen the candidate errors after word segmentation.
The last category is formed by the methods which use word segmentation for detection and different models for correction Chiu et al., 2013).  used support vector machine (SVM) to select the most probable sentence from multiple candidates. They used word segmentation and machine translation model to generate the candidates respectively. The SVM was used to rerank the candidates. ) not only applied LM, but also used various topic models to cover the shortage of LM. (Chiu et al., 2013) explored statistical machine translation model to translate the sentences containing typos into correct ones. In their model, the sentence with the highest translation probability which indicated how likely a typo was translated into its candidate correct word was chosen as the final correction sentence.

The Revised Graph Model
The graph model (Jia et al., 2013b) of SIGHAN Bake-off 2013 is inspired by the idea of shortest path word segmentation algorithm which is based on the following assumption: a reasonable segmentation should maximize the lengths of all segments or minimize the total number of segments (Casey and Lecolinet, 1996). A directed acyclic graph (DAG) is thus built from the input sentence similar. The spelling error detection and correction problem is transformed to a single source shortest path (SSSP) problem on the DAG. Given a dictionary D and a similar characters C, for a sentence S of m characters {c 1 , c 2 , . . . , c m }, the original vertices V of the DAG in (Jia et al., 2013b) are: where w −,0 = "<S>" and w n+1,− = "</S>" are two special vertices represent the start and end of the sentence.
However, the graph model cannot be applied to continuous word errors. Take the following sentence as an example, "健康" (health) (pinyin: jian kang) is misspelled into "建缸" (pinyin: jian gang). Because the substitution strategy does not simultaneously substitute two continuous characters.

The Improved Graph Model
The graph model based on word segmentation in (Jia et al., 2013b) includes the revised graph model in section 3 still has its limitations. For a sentence, in the graph construction stage, the substitution is only applied to the situation that the number of words after segmenting has to be decreased, which means there exists new longer word after segmentation. In addition, if the segmentation result of a sentence is a single character, the graph model does not work, because a single character will not be substituted. For example in the following two sentences, the "他" (he) (pinyin: ta) in the first sentence should be corrected into "她" (she) (pinyin: ta) and the "的" (of)(pinyin: de) in the second sentence should be corrected into "地" (-ly, adverb-forming particle) (pinyin: de), however, the graph model does not work for this case.
• 雖然我不在我的國家，不能見到媽媽，可 是我要給'他' (him) (pinyin: ta)打電話！ Translation after correction: Though I'm not in my country so that I cannot see my mum, I would like to call her! • 我們也不要想太多；我們來好好'的' (of) (pinyin: de)出去玩吧！ Translation after correction: We would not worry too much, just enjoy ourselves outside now! The graph model is also powerless for the error situation that the wrong character was segmented into a legal word. Take the following sentence as an example, the word "心裡" (in mind, at heart) (pinyin: xin li) will be not separated after the building the graph, so "裡" (pinyin: li) could not be corrected into "理" (pinyin: li).
For the sake of alleviating the above limitations of the graph model, we utilize CRF model to deal with two kinds of errors, and a rule based system is established to cope with the pronoun errors: "她" (she) (pinyin: ta), "他" (he) (pinyin: ta) and collocation errors.

CRF Model
Two classifiers using CRF model are respectively trained to tackle the common character usage confusions: 在" (at) (pinyin: zai), 再" (again, more, then) (pinyin: zai) and "的" (of)(pinyin: de), "地" (-ly, adverb-forming particle) (pinyin: de), "得"(so that, have to) (pinyin: de). We assume that the correct character selection is related with its neighboring two words and part-of-speech (POS) tags. The classifiers are trained on a large fivegram token set which is extracted from a large POS tagged corpus. The feature selection algorithm is according to (Zhao et al., 2013;Wang et al., 2014;Jia et al., 2013a). The feature set for CRF model is as follows: w j,−2 , pos j,−2 , w j,−1 , pos j,−1 , w j,0 , pos j,0 , w j,1 , pos j,1 , w j,2 , pos j,2 where j is the token index to indicate its position, w j,0 is the current candidate character and pos j,0 is its POS tag. ICTCLAS (Zhang et al., 2003) is adopted for POS tagging.

The Rule Based System
To effectively handle pronoun usage errors for "她" (she) (pinyin: ta), and "他" (he) (pinyin: ta) and other collocation errors, we design a rule based system extracted from the development set.
The Table 3 is the rules we set for solving the pronoun usage errors, where the pref ix[i] is the current word w[i]'s prefix in a sentence. For the others rules, we divide them into five categories, which are presented in Table 4 -Table 8. In Table 4, we only present several typical rules in Rule 3. The negation symbol "¬" in the Table 6 and Table 7 means that the word in corresponding position is not the one in the brackets. Each rule in the tables is verified by the Baidu 2 search engine. If the error situation is legally emerged in the search result, we will not correct the error any more.

Data Sets and Resources
The proposed method is evaluated on the data sets of SIGHAN Bake-off shared tasks in 2013 and 2014. In Bake-off 2013, the sentences were collected from 13 to 14-year-old students' essays in formal written tests . In Bakeoff 2014, the sentences were collected from Chinese as a foreign language (CFL) learners' essays selected from the National Taiwan Normal University (NTNU) learner corpus 3 . All the data sets are in traditional Chinese. In Bake-off 2013, the essays were manually annotated with different labels (see Figure 1). There is at most one error in each sentence. However, the development set in Bake-off 2014 is enlarged and the error types (see Figure 2) are more diverse.

The Improved Graph Model
We treat the graph model without filters in Bakeoff 2013 as our baseline in Bake-off 2014. The edge function is the linear combination of similarity and log conditional probability: where ω 0 ≡ 0 which is omitted in the equation, and ω s for different kinds of characters are shown in Table 11. The LM is set to bigram according to (Yang et al., 2012). Improved Kneser-Ney method is used for LM smoothing (Chen and Goodman, 1999).
Type ωs same pronunciation same tone 1 same pronunciation different tone 1 similar pronunciation same tone 2 similar pronunciation different tone 2 similar shape 2 Table 11: ω s used in ω L .
We utilize the correction precision (P), correction recall (R) and F1 score (F) as the metrics. The computational formulas are as follows: • Correction precision: P = number of correctly corrected characters number of all corrected characters ; • Correction recall: R = number of correctly corrected characters number of wrong characters of gold data ; • F1 macro: We firstly use the revised graph model in section 3 to tackle the continuous word errors. The results achieved by the graph model and its revision on Dev14B with different β are shown in Figure 3 respectively. We can see that the result with the revised graph model is not improved, and even worse than the baseline. Therefore, for the improved graph model in Bake-off 2014, we remain use the graph model in Bake-off 2013 without any modification.  To observe the performance of the improved graph model in detail, on the three development sets: Dev13, Dev14C, Dev14B, we report the results from the following settings: 1. CRF. We use the CRF model to process the common character usage confusions: "在" (at) (pinyin: zai), "再" (again, more, then) (pinyin: zai) and "的" (of) (pinyin: de),    3. Graph+CRF. In this setting, the graph model with different β in ω L is performed on the CRF results. For each development set, an optimal β could be found to obtain the optimal performance.

4.
CRF+Graph+Rule_Post. Based on the results of the Graph+CRF model, we add the rule based system. Similarly, the optimal β could be found.

5.
CRF+Rule_Pre+Graph. Different from the third setting, we firstly utilize the rule based system on the development sets, and then use the graph model with different β in ω L .
6. CRF+Rule_Pre+Graph+Rule_Post. Based on the results of CRF+Rule_Pre+Graph model, we add the rule based system at last.
In Table 14, we compare different improved graph models on the development sets, in which we set β as 6 in ω L . We could find that though the results of the improved graph model on Dev13 are relatively declined, the results both on the Dev14C and Dev14B are improved. The results in Table 14 prove that CRF model and the rule based system are effective to cover the shortage of the graph model.

Results
In Bake-off 2014, we submit 3 runs, using the CR-F+Rule_Pre+Graph model and the weight function ω L , of which the β is set as 0, 6, and 10, respectively. The results on Test14 are listed in

Conclusion
In this paper we present an improved graph model to deal with Chinese spell checking problem. The model includes a graph model and two independently-trained models. To begin with, the graph model is utilized to solve generic spell checking problem and SSSP algorithm is adopted as the model implementation. Furthermore, a CRF model and a rule based system are used to cover the shortage of the graph model. The effectiveness of the proposed model is verified on the data released by the SIGHAN Bake-off 2014 shared task and our system gives competitive results according to official evaluation..