Overview of SIGHAN 2014 Bake-off for Chinese Spelling Check

This paper introduces a Chinese Spelling Check campaign organized for the SIGHAN 2014 bake-off, including task description, data preparation, performance metrics, and evaluation results based on essays written by Chinese as a foreign language learners. The hope is that such evaluations can produce more advanced Chinese spelling check techniques.


Introduction
Chinese spelling errors frequently arise from confusion between multiple Chinese characters which are phonologically and visually similar, but semantically distinct (Liu et al., 2011).The SIGHAN 2013 Chinese Spelling Check Bakeoff was the first campaign to provide data sets as benchmarks for the objective performance evaluation of Chinese spelling checkers (Wu et al. 2013).The collected data set is publicly available at http://ir.itc.ntnu.edu.tw/lre/sighan7csc.htm.The competition resulted in the integration of effective NLP techniques in the development of Chinese spelling checkers.Language modeling was used to glean extra semantic clues and collect web resources together to identify and correct spelling errors (Chen et al., 2013).A hybrid model was proposed to combine language models and statistical machine translation for spelling error correction (Liu et al. 2013).A linear regression model was trained using phonological and orthographic similarities to correct misspelled characters (Chang et al. 2013).Web-based measures were adopted to score candidates for Chinese spelling error correction (Yu et al., 2013).A graph model was used to represent the sentence, using the single source shortest path algorithm for correcting spelling errors (Jia et al. 2013) SIGHAN 2014 Bake-off, again features a Chinese Spelling Check task, providing an evaluation platform for the development and implementation of automatic Chinese spelling checkers.Given a passage composed of several sentences, the checker should identify all possible spelling errors, highlight their locations and suggest possible corrections.While previous tasks were based on essays written by native Chinese speakers, the current task is based on essays written by learners of Chinese as a Foreign Language (CFL), which should provide a greater challenge The rest of this article is organized as follows.Section 2 provides an overview of the SIGHAN 2014 Bake-off Chinese Spelling Check task.Section 3 introduces the data sets used for evaluation.Section 4 proposes evaluation metrics.Section 5 compares results for the various contestants.Finally, we conclude this with findings and future research directions in Section 6.

Task Description
This task evaluates Chinese spelling checker performance based on Chinese text passages consisting of several sentences with and without spelling errors.The checker should identify incorrect characters in the passage and suggest corrections.Each character or punctuation mark occupies 1 spot for counting location.The input instance is given a unique passage number PID.If the sentence contains no spelling errors, the checker should return "PID, 0".If an input passage contains at least one spelling error, the output format is "PID [, location, correction]+", where the symbol "+" indicates there is one or more instance of the predicting element "[, location, correction]"."Location" and "correction" respectively denote the location of incorrect character and its correct version.Table 1 presents some examples.In Ex. 1, the 15 th character "無" is wrong, and should be "舞".There are 3 wrong characters in Ex. 2, and correct characters "生," "直," and "關" should be used in locations 3, 26, and 35, respectively.Location "0" denotes that there is no spelling error in Ex.

Data Preparation
The learner corpus used in our task was collected from the essay section of the computer-based Test of Chinese as a Foreign Language (TOCFL), administered in Taiwan.The writing test is designed according to the six proficiency levels of the Common European Framework of Reference (CEFR).A total of 1714 essays were typed online (i.e., not hand-written), and then spelling errors were manually annotated by trained native Chinese speakers who also provided corrections corresponding to each error.The essays were then split into three sets as follows

• Training Set
This set included 1,301 selected essays with a total of 5,284 spelling errors.Each essay is represented in SGML format shown in Fig. 1.The title attribute is used to describe the essay topic.
Each passage is composed of several sentences, and each passage contains at least one spelling error, and the data indicates both the error's location and corresponding correction.All essays in this set are used to train the developed spelling checker.

<ESSAY title= "寫給即將初次見面的筆友的
A total of 20 passages were given to participants to familiarize themselves with the final testing process.Each participant can submit several runs generated using different models with different parameter settings.In addition to make sure the submitted results can be correctly evaluated, participants can fine-tune their developed models in the dryrun phase.The purpose of dryrun is output format validation only, and no dryrun outcomes were considered in the official evaluation Table 2 shows a statistical summary of the prepared test set.The set consists of 1,062 testing passages, each with an average of 50 characters.Half of these passages contained no spelling errors, while the other half included at least one spelling error each for a total of 792 spelling errors used to evaluate the spelling checkers.The evaluation was conducted as an open test.In addition to the data sets provided, registered re-search teams were allowed to employ any linguistic and computational resources to detect and correct spelling errors.Besides, passages written by CFL learners may suffer from grammatical errors, missing or redundant words, poor word selection, or word ordering problems.The task in question focuses exclusively on spelling error correction.

Test Set
Stat.

Number of essays 413
Number

Performance Metrics
Table 3 shows the confusion matrix used for performance evaluation.In the matrix, TP (True Positive) is the number of passages with spelling errors that are correctly identified by the spelling checker; FP (False Positive) is the number of passages in which non-existent errors are identified; TN (True Negative) is the number of passages without spelling errors which are correctly identified as such; FN (False Negative) is the number of passages with spelling errors for which no errors are detected.
Correctness is determined at two levels.
(1) Detection level: all locations of incorrect characters in a given passage should be completely identical with the gold standard.(2) Correction level: all locations and corresponding corrections of incorrect characters should be completely identical with the gold standard.The following metrics are measured at both levels with the help of the confusion matrix.

FP TN
Table 3. Confusion matrix for evaluation.

Evaluation Results
Table 4 summarizes the submission statistics for 19 participant teams including 10 from universi-ties and research institutions in China (BIT, CAS, CAU, LYFYU, NJUPT, PKU, SCAU, SJTU, SUDA, and ZJOU), 8 from Taiwan (ITRI, KUAS, NCTU & NTUT, NCYU, NTHU, NTOU, SinicaCKIP, and SinicaSLMP) and one private firm (Lingage).Among 19 registered teams, 13 teams submitted their testing results.In formal testing phase, each participant can submit at most three runs that adopt different models or parameter settings.In total, we had received 34 runs.Table 5 summarizes the participants' developed approaches and the usage of linguistic resources for this bake-off evaluation.Among 13 teams that participated the official testing, KUAS and PKU did not submit their reports of developed models.We can observe that most of participants adopt statistical approaches such as ngram model, language model, and machinelearning model.In addition to the Bakeoff 2013 CSC Datasets, some linguistic resources are used popularly for this bake-off evaluation such as Sinica Corpus, Web as Corpus, Google Web 1T N-gram, and Chinese Gigaword Corpus.

Participant (Ordered by abbreviations of names) #Runs
Beijing Table 6 shows the task testing results.In addition to accurate error detection and correction, another key performance criteria is reducing the rate of false positives, i.e., the mistaken identification of errors where none exist.The research teams, KUAS, NCTU&NTUT, NCYU and SU-DA, achieved very low false positive rates, i.e., less than 0.05.
Detection-level evaluations are designed to identify spelling errors and highlight their locations in the input passages.Accuracy is a key performance criterion, but accuracy can be affected by the distribution of testing instances.A neutral baseline can be easily achieved by always reporting all testing errors are correct without errors.According to the test data distribution, the baseline system can achieve an accuracy level of 0.5.Some systems (i.e., CAS, KUAS, and NCYU) achieved promising results exceeding 0.6.Each participating team was allowed submit up to three iterative runs based on the same input, and several teams sent different runs aimed at optimizing either recall or precision rates.We thus used the F1 score to reflect the tradeoff between precision and recall.In the testing results, KUAS provided the best error detection results, providing a high F1 score of 0.633.
For correction-level evaluations, the systems need to locate errors in the passages and indicate the corresponding correct characters.The correction accuracy provided by the KUAS submission (0.7081) significantly outperformed the other teams.However, in terms of correction precision, the spelling checker developed by KUAS and NCYU outperforms the others at 0.8.Most systems were unable to effectively correct spelling errors, with the better systems (CAS, and KUAS) achieving a correction recall rate of slightly above 0.3.The system developed by KUAS provided the highest F1 score of 0.6125 for spelling error correction.It is difficult to correct all spelling errors found in the input passages, since some sentences contain multiple errors and only correcting some of them are regarded as a wrong case.In summary, none of the submitted systems provided superior performance in all metrics, though those submitted by KUAS, NCYU, and CAS provided best overall performance.

Conclusions and Future Work
This paper provides an overview of SIGHAN 2014 Bake-off Chinese spelling check, including task design, data preparation, evaluation metrics, and performance evaluation results.The task also encourages the proposal of unorthodox and innovative approaches which could lead to a breakthrough.Regardless of actual performance, all submissions contribute to the common effort to produce an effective Chinese spell checker, and the individual reports in the Bake-off proceedings provide useful insight into Chinese language processing.We hope the data sets collected for this Bakeoff can facilitate and expedite the development of effective Chinese spelling checkers.All data sets with gold standards and evaluation tool are publicly available for research purposes at http://ir.itc.ntnu.edu.tw/lre/clp14csc.htm.
Based on the results of this Bake-off, we plan to build new language resources to improve existing and develop new techniques for computer-aided Chinese language learning.In addition, new data sets obtained from CFL learners will be investigated for the future enrichment of this research topic.

Table 2 .
Descriptive statistics of the test set.

Table 4 .
Submission statistics for all participants

Table 6 .
Testing results of our Chinese spelling check task.