System Combination for Grammatical Error Correction

Different approaches to high-quality grammatical error correction have been proposed recently, many of which have their own strengths and weaknesses. Most of these approaches are based on classi-ﬁcation or statistical machine translation (SMT). In this paper, we propose to combine the output from a classiﬁcation-based system and an SMT-based system to improve the correction quality. We adopt the system combination technique of Heaﬁeld and Lavie (2010). We achieve an F 0 . 5 score of 39.39% on the test set of the CoNLL-2014 shared task, outperforming the best system in the shared task.


Introduction
Grammatical error correction (GEC) refers to the task of detecting and correcting grammatical errors present in a text written by a second language learner. For example, a GEC system to correct English promises to benefit millions of learners around the world, since it functions as a learning aid by providing instantaneous feedback on ESL writing.
Research in this area has attracted much interest recently, with four shared tasks organized in the past several years: Helping Our Own (HOO) 2011 and 2012 (Dale and Kilgarriff, 2010;Dale et al., 2012), and the CoNLL 2013 and 2014 shared tasks Ng et al., 2014). Each shared task comes with an annotated corpus of learner texts and a benchmark test set, facilitating further research in GEC.
Many approaches have been proposed to detect and correct grammatical errors. The most dominant approaches are based on classification (a set of classifier modules where each module addresses a specific error type) and statistical ma-chine translation (SMT) (formulated as a translation task from "bad" to "good" English). Other approaches combine the classification and SMT approaches, and often have some rule-based components.
Each approach has its own strengths and weaknesses. Since the classification approach is able to focus on each individual error type using a separate classifier, it may perform better on an error type where it can build a custom-made classifier tailored to the error type, such as subject-verb agreement errors. The drawback of the classification approach is that one classifier must be built for each error type, so a comprehensive GEC system will need to build many classifiers which complicates its design. Furthermore, the classification approach does not address multiple error types that may interact.
The SMT approach, on the other hand, naturally takes care of interaction among words in a sentence as it attempts to find the best overall corrected sentence. It usually has a better coverage of different error types. The drawback of this approach is its reliance on error-annotated learner data, which is expensive to produce. It is not possible to build a competitive SMT system without a sufficiently large parallel training corpus, consisting of texts written by ESL learners and the corresponding corrected texts.
In this work, we aim to take advantage of both the classification and the SMT approaches. By combining the outputs of both systems, we hope that the strengths of one approach will offset the weaknesses of the other approach. We adopt the system combination technique of (Heafield and Lavie, 2010), which starts by creating word-level alignments among multiple outputs. By performing beam search over these alignments, it tries to find the best corrected sentence that combines parts of multiple system outputs.
The main contributions of this paper are as fol-lows: • It is the first work that makes use of a system combination strategy to improve grammatical error correction; • It gives a detailed description of methods and experimental setup for building component systems using two state-of-the-art approaches; and • It provides a detailed analysis of how one approach can benefit from the other approach through system combination.
We evaluate our system combination approach on the CoNLL-2014 shared task. The approach achieves an F 0.5 score of 39.39%, outperforming the best participating team in the shared task.
The remainder of this paper is organized as follows. Section 2 gives the related work. Section 3 describes the individual systems. Section 4 explains the system combination method. Section 5 presents experimental setup and results. Section 6 provides a discussion and analysis of the results. Section 7 describes further experiments on system combination. Finally, Section 8 concludes the paper.

Grammatical Error Correction
Early research in grammatical error correction focused on a single error type in isolation. For example, Knight and Chander (1994) built an article correction system for post-editing machine translation output.
The classification approach has been used to deal with the most common grammatical mistakes made by ESL learners, such as article and preposition errors (Han et al., 2006;Chodorow et al., 2007;Tetreault and Chodorow, 2008;Gamon, 2010;Dahlmeier and Ng, 2011;Rozovskaya and Roth, 2011;, and more recently, verb errors (Rozovskaya et al., 2014b). Statistical classifiers are trained either from learner or non-learner texts. Features are extracted from the sentence context. Typically, these are shallow features, such as surrounding n-grams, part-of-speech (POS) tags, chunks, etc. Different sets of features are employed depending on the error type addressed.
The statistical machine translation (SMT) approach has gained more interest recently. Earlier work was done by Brockett et al. (2006), where they used SMT to correct mass noun errors. The major impediment in using the SMT approach for GEC is the lack of error-annotated learner ("parallel") corpora. Mizumoto et al. (2011) mined a learner corpus from the social learning platform Lang-8 and built an SMT system for correcting grammatical errors in Japanese. They further tried their method for English (Mizumoto et al., 2012).
Other approaches combine the advantages of classification and SMT (Dahlmeier and Ng, 2012a) and sometimes also include rule-based components. Note that in the hybrid approaches proposed previously, the output of each component system might be only partially corrected for some subset of error types. This is different from our system combination approach, where the output of each component system is a complete correction of the input sentence where all error types are dealt with.

System Combination
System combination is the task of combining the outputs of multiple systems to produce an output better than each of its individual component systems. In machine translation (MT), combining multiple MT outputs has been attempted in the Workshop on Statistical Machine Translation (Callison-Burch et al., 2009;Bojar et al., 2011).
One of the common approaches in system combination is the confusion network approach (Rosti et al., 2007b). In this approach, a confusion network is created by aligning the outputs of multiple systems. The combined output is generated by choosing the output of one single system as the "backbone", and aligning the outputs of all other systems to this backbone. The word order of the combined output will then follow the word order of the backbone. The alignment step is critical in system combination. If there is an alignment error, the resulting combined output sentence may be ungrammatical. Rosti et al. (2007a) evaluated three system combination methods in their work: • Sentence level This method looks at the combined N-best list of the systems and selects the best output.
• Phrase level This method creates new hypotheses using a new phrase translation table, built according to the phrase alignments of the systems.
• Word level This method creates a graph by aligning the hypotheses of the systems. The confidence score of each aligned word is then calculated according to the votes from the hypotheses.
Combining different component sub-systems was attempted by CUUI (Rozovskaya et al., 2014a) and CAMB (Felice et al., 2014) in the CoNLL-2014 shared task. The CUUI system employs different classifiers to correct various error types and then merges the results. The CAMB system uses a pipeline of systems to combine the outputs of their rule based system and their SMT system. The combination methods used in those systems are different from our approach, because they combine individual sub-system components, by piping the output from one sub-system to another, whereas we combine the outputs of whole systems. Moreover, our approach is able to combine the advantages of both the classification and SMT approaches. In the field of grammatical error correction, our work is novel as it is the first that uses system combination to improve grammatical error correction.

The Component Systems
We build four individual error correction systems. Two systems are pipeline systems based on the classification approach, whereas the other two are phrase-based SMT systems. In this section, we describe how we build each system.

Pipeline
We build two different pipeline systems. Each system consists of a sequence of classifier-based correction steps. We use two different sequences of correction steps as shown in Table 1. As shown by the table, the only difference between the two pipeline systems is that we swap the noun number and the article correction step. We do this because there is an interaction between noun number and article correction. Swapping them generates system outputs that are quite different.
Step Pipeline 1 (P1) Pipeline 2 (P2 )  1  Spelling  Spelling  2  Noun number  Article  3  Preposition  Preposition  4  Punctuation  Punctuation  5  Article  Noun number  6 Verb form, SVA Verb form, SVA We model each of the article, preposition, and noun number correction task as a multi-class classification problem. A separate multi-class confidence weighted classifier (Crammer et al., 2009) is used for correcting each of these error types. A correction is only made if the difference between the scores of the original class and the proposed class is larger than a threshold tuned on the development set. The features of the article and preposition classifiers follow the features used by the NUS system from HOO 2012 (Dahlmeier et al., 2012). For the noun number error type, we use lexical n-grams, ngram counts, dependency relations, noun lemma, and countability features.
For article correction, the classes are the articles a, the, and the null article. The article an is considered to be the same class as a. A subsequent post-processing step chooses between a and an based on the following word. For preposition correction, we choose 36 common English prepositions as used in (Dahlmeier et al., 2012). We only deal with preposition replacement but not preposition insertion or deletion. For noun number correction, the classes are singular and plural.
Punctuation, subject-verb agreement (SVA), and verb form errors are corrected using rulebased classifiers. For SVA errors, we assume that noun number errors have already been corrected by classifiers earlier in the pipeline. Hence, only the verb is corrected when an SVA error is detected. For verb form errors, we change a verb into its base form if it is preceded by a modal verb, and we change it into the past participle form if it is preceded by has, have, or had.
The spelling corrector uses Jazzy, an open source Java spell-checker 1 . We filter the suggestions given by Jazzy using a language model. We accept a suggestion from Jazzy only if the suggestion increases the language model score of the sentence.

Statistical Machine Translation
The other two component systems are based on phrase-based statistical machine translation (Koehn et al., 2003).
It follows the wellknown log-linear model formulation (Och and Ney, 2002): where f is the input sentence, e is the corrected output sentence, h m is a feature function, and λ m is its weight. The feature functions include a translation model learned from a sentence-aligned parallel corpus and a language model learned from a large English corpus. More feature functions can be integrated into the log-linear model. A decoder finds the best correctionê that maximizes Equation 1 above.
The parallel corpora that we use to train the translation model come from two different sources. The first corpus is NUCLE (Dahlmeier et al., 2013), containing essays written by students at the National University of Singapore (NUS) which have been manually corrected by English instructors at NUS. The other corpus is collected from the language exchange social networking website Lang-8. We develop two versions of SMT systems: one with two phrase tables trained on NU-CLE and Lang-8 separately (S1), and the other with a single phrase table trained on the concatenation of NUCLE and Lang-8 data (S2). Multiple phrase tables are used with alternative decoding paths . We add a word-level Levenshtein distance feature in the phrase table used by S2, similar to (Felice et al., 2014;Junczys-Dowmunt and Grundkiewicz, 2014). This feature is not included in S1.

System Combination
We use MEMT (Heafield and Lavie, 2010) to combine the outputs of our systems. MEMT uses METEOR (Banerjee and Lavie, 2005) to perform alignment of each pair of outputs from the component systems. The METEOR matcher can identify exact matches, words with identical stems, synonyms, and unigram paraphrases.
MEMT uses an approach similar to the confusion network approach in SMT system combination. The difference is that it performs alignment on the outputs of every pair of component systems, so it does not need to choose a single backbone. As MEMT does not choose any single system output as its backbone, it can consider the output of each component system in a symmetrical manner. This increases word order flexibility, as choosing a single hypothesis as the backbone will limit the number of possible word order permutations.
After creating pairwise alignments using ME-TEOR, the alignments form a confusion network. MEMT will then perform a beam search over this graph to find the one-best hypothesis. The search is carried out from left to right, one word at a time, creating a partial hypothesis. During beam search, it can freely switch among the component systems, combining the outputs together into a sentence. When it adds a word to its hypothesis, all the words aligned to it in the other systems are also marked as "used". If it switches to another input sentence, it has to use the first "unused" word in that sentence. This is done to make sure that every aligned word in the sentences is used. In some cases, a heuristic could be used to allow skipping over some words (Heafield et al., 2009).
During beam search, MEMT uses a few features to score the hypotheses (both partial hypotheses and full hypotheses): • Length The number of tokens in a hypothesis. It is useful to normalize the impact of sentence length.
• Language model Log probability from a language model. It is especially useful in maintaining sentence fluency.
• Backoff The average n-gram length found in the language model.
• Match The number of n-gram matches between the outputs of the component systems and the hypothesis, counted for small order n-grams.
The weights of these features are tuned using Z-MERT (Zaidan, 2009) on a development set. This system combination approach has a few advantages in grammatical error correction. ME-TEOR not only can match words with exact matches, but also words with identical stems, synonyms, and unigram paraphrases. This means that it can deal with word form, noun number, and verb form corrections that share identical stems, as well  as word choice corrections (with synonyms and unigram paraphrases). Also, MEMT uses a language model feature to maintain sentence fluency, favoring grammatical output sentences.
In this paper, we combine the pipeline system P 1 (Table 1) with the SMT system S1, and also combine P 2 with S2. The two component systems in each pair have comparable performance. For our final system, we also combine all four systems together.

Experiments
Our approach is evaluated in the context of the CoNLL-2014 shared task on grammatical error correction. Specific details of the shared task can be found in the overview paper (Ng et al., 2014), but we summarize the most important details relevant to our study here.

Data
We use NUCLE version 3.2 (Dahlmeier et al., 2013), the official training data of the CoNLL-2014 shared task, to train our component systems. The grammatical errors in this corpus are categorized into 28 different error types. We also use the "Lang-8 Corpus of Learner English v1.0" 2 (Tajiri et al., 2012) to obtain additional learner data. English Wikipedia 3 is used for language modeling and collecting n-gram counts. All systems are tuned on the CoNLL-2013 test data (which serves as the development data set) and tested on the CoNLL-2014 test data. The statistics of the data sets can be found in Table 2.

Evaluation
System performance is evaluated based on precision, recall, and F 0.5 (which weights precision twice as much as recall). Given a set of n sentences, where g i is the set of gold-standard edits for sentence i, and e i is the set of system edits for sentence i, precision, recall, and F 0.5 are defined as follows: where the intersection between g i and e i for sentence i is defined as The official scorer for the shared task was the MaxMatch (M 2 ) scorer 4 (Dahlmeier and Ng, 2012b). The scorer computes the sequence of system edits between a source sentence and a system hypothesis that achieves the maximal overlap with the gold-standard edits. Like CoNLL-2014, F 0.5 is used instead of F 1 to emphasize precision. For statistical significance testing, we use the sign test with bootstrap re-sampling on 100 samples.

Pipeline System
We use ClearNLP 5 for POS tagging and dependency parsing, and OpenNLP for chunking 6 . We use the WordNet (Fellbaum, 1998) morphology software to generate singular and plural word surface forms.
The article, preposition, and noun number correctors use the classifier approach to correct errors. Each classifier is trained using multi-class confidence weighted learning on the NUCLE and Lang-8 corpora. The classifier threshold is tuned using a simple grid search on the development data set for each class of a classifier.

SMT System
The system is trained using Moses , with Giza++ (Och and Ney, 2003) for word alignment. The translation table is trained using the "parallel" corpora of NUCLE and Lang-8. The table contains phrase pairs of maximum length seven. We include five standard parameters in the translation table: forward and reverse phrase translations, forward and reverse lexical translations, and phrase penalty. We further add a word-level Levenshtein distance feature for S2.
We do not use any reordering model in our system. The intuition is that most error types do not involve long-range reordering and local reordering can be easily captured in the phrase translation table. The distortion limit is set to 0 to prohibit reordering during hypothesis generation.
We build two 5-gram language models using the corrected side of NUCLE and English Wikipedia. The language models are estimated using the KenLM toolkit (Heafield et al., 2013) with modified Kneser-Ney smoothing. These two language models are used as separate feature functions in the log-linear model. Finally, they are binarized into a probing data structure (Heafield, 2011). Tuning is done on the development data set with MERT (Och, 2003). We use BLEU (Papineni et al., 2002) as the tuning metric, which turns out to work well in our experiment.

Combined System
We use an open source MEMT implementation by Heafield and Lavie (2010) to combine the outputs of our systems. Parameters are set to the values recommended by (Heafield and Lavie, 2010): a beam size of 500, word skipping using length heuristic with radius 5, and with the length normalization option turned off. We use five matching features for each system: the number of exact unigram and bigram matches between hypotheses and the number of matches in terms of stems, synonyms, or paraphrases for unigrams, bigrams, and trigrams. We use the Wikipedia 5-gram language model in this experiment.
We tune the combined system on the development data set. The test data is input into both the pipeline and SMT system respectively and the output from each system is then matched using METEOR (Banerjee and Lavie, 2005). Feature weights, based on BLEU, are then tuned using Z-MERT (Zaidan, 2009). We repeat this process five times and use the weights that achieve the best score on the development data set in our final combined system.

Results
Our experimental results using the CoNLL-2014 test data as the test set are shown in Table 3. Each system is evaluated against the same gold standard human annotations. As recommended in Ng et al.  ensure a fairer evaluation (i.e., without using alternative answers). First, we can see that both the pipeline and SMT systems individually achieve relatively good results that are comparable with the third highest ranking participant in the CoNLL-2014 shared task. It is worth noting that the pipeline systems only target the seven most common error types, yet still perform well in an all-error-type setting. In general, the pipeline systems have higher recall but lower precision than the SMT systems.
The pipeline system is also sensitive to the order in which corrections are applied; for example applying noun number corrections before article corrections results in a better score. This means that there is definitely some interaction between grammatical errors and, for instance, the phrase a houses can be corrected to a house or houses depending on the order of correction.
We noticed that the performance of the SMT system could be improved by using multiple translation models. This is most likely due to domain differences between the NUCLE and Lang-8 corpus, e.g., text genres, writing style, topics, etc. Note also that the Lang-8 corpus is more than 10 times larger than the NUCLE corpus, so there is some benefit from training and weighting two translation tables separately.
The performance of the pipeline system P1 is comparable to that of the SMT system S1, and likewise the performance of P2 is comparable to that of S2. The differences between them are not statistically significant, making it appropriate to combine their respective outputs.
Every combined system achieves a better result than its component systems. In every combination, there is some improvement in precision over the pipeline systems, and some improvement in recall over the SMT systems. The combination of the better component systems (P1+S1) is also statistically significantly better than the combination of the other component systems (P2+S2). Combining all four component systems yields an even better result of 39.39% F 0.5 , which is even better than the CoNLL-2014 shared task winner. This is significant because the individual component systems barely reached the score of the third highest ranking participant before they were combined.

Discussion
In this section, we discuss the strengths and weaknesses of the pipeline and SMT systems, and show how system output combination improves performance. Specifically, we compare P1, S1, and P1+S1, although the discussion also applies to P2, S2, and P2+S2.
Type performance. We start by computing the recall for each of the 28 error types achieved by each system. This computation is straightforward as each gold standard edit is also annotated with error type. On the other hand, precision, as mentioned in the overview paper (Ng et al., 2014), is much harder to compute because systems typically do not categorize their corrections by error type. Although it may be possible to compute the precision for each error type in the pipeline system (since we know which correction was proposed by which classifier), this is more difficult to do in the SMT and combined system, where we would need to rely on heuristics which are more prone to errors. As a result, we decided to analyze a sample of 200 sentences by hand for a comparatively more robust comparison. The results can be seen in Table 4.
We observe that the pipeline system has a higher recall than the SMT system for the following error types: ArtOrDet, Mec, Nn, Prep, SVA, Vform, and Vt. Conversely, the SMT system generally has a higher precision than the pipeline system. The combined system usually has slightly lower precision than the SMT system, but higher than the pipeline system, and slightly higher recall than the SMT system but lower than the pipeline system. In some cases however, like for Vform correction, both precision and recall increase.
The combined system can also make use of corrections which are only corrected in one of the systems. For example, it corrects both Wform and Pform errors, which are only corrected by the SMT system, and SVA errors, which are only corrected by the pipeline system.
Error analysis. For illustration on how system combination helps, we provide example output from the pipeline system P1, SMT system S1, and the combined system P1+S1 in Table 5. We illustrate three common scenarios where system combination helps: the first is when P1 performs better than S1, and the combined system chooses the corrections made by P1, the second is the opposite where S1 performs better than P1 and the combined system chooses S1, and the last is when the combined system combines the corrections made by P1 and S1 to produce output better than both P1 and S1.

Additional System Combination Experiments
We further evaluate our system combination approach by making use of the corrected system outputs of 12 participating teams in the CoNLL-2014 shared task, which are publicly available on the shared task website. 7 Specifically, we combined the system outputs of the top 2, 3, . . . , 12 CoNLL-2014 shared task teams and computed the results. In our earlier experiments, the CoNLL-2013 test data was used as the development set. However, the participants' outputs for this 2013 data are not available. Therefore, we split the CoNLL-2014 test data into two parts: the first 500 sentences for the development set and the remaining 812 sentences for the test set. We then tried combining the n best performing systems, for n = 2, 3, . . . , 12. Other than the data, the experimental setup is the same as that described in Section 5.5. Table 6 shows the ranking of the participants on the 812 test sentences (without alter-  Table 4: True positives (TP), false negatives (FN), false positives (FP), precision (P), recall (R), and F 0.5 (in %) for each error type without alternative answers, indicating how well each system performs against a particular error type.
System Example sentence Source Nowadays , the use of the sociall media platforms is a commonplace in our lives .

P1
Nowadays , the use of social media platforms is a commonplace in our lives . S1 Nowadays , the use of the sociall media platforms is a commonplace in our lives . P1+S1 Nowadays , the use of social media platforms is a commonplace in our lives .

Gold
Nowadays , the use of social media platforms is commonplace in our lives . Source Human has their own rights and privacy .

P1
Human has their own rights and privacy . S1 Humans have their own rights and privacy . P1+S1 Humans have their own rights and privacy .

Gold
Humans have their own rights and privacy . Source People that living in the modern world really can not live without the social media sites .

P1
People that living in the modern world really can not live without social media sites . S1 People living in the modern world really can not live without the social media sites . P1+S1 People living in the modern world really can not live without social media sites .

Gold
People living in the modern world really can not live without social media sites .  native answers). Note that since we use a subset of the original CoNLL-2014 test data for testing, the ranking is different from the official CoNLL-2014 ranking. Table 7 shows the results of system combination in terms of increasing numbers of top systems. We observe consistent improvements in F 0.5 when we combine more system outputs, up to 5 best performing systems. When combining 6 or more systems, the performance starts to fluctuate and degrade. An important observation is that when we perform system combination, it is more effective, in terms of F 0.5 , to combine a handful of high-quality system outputs than many outputs  of variable quality. Precision tends to increase as more systems are combined although recall tends to decrease. This indicates that combining multiple systems can produce a grammatical error correction system with high precision, which is useful in a practical application setting where high precision is desirable. Figure 1 shows how the performance varies as the number of combined systems increases.

Conclusion
We have presented a system combination approach for grammatical error correction using MEMT. Our approach combines the outputs from two of the most common paradigms in GEC: the pipeline and statistical machine translation ap- proach. We created two variants of the pipeline and statistical machine translation approaches and showed that system combination can be used to combine their outputs together to yield a superior system. Our best combined system achieves an F 0.5 score of 39.39% on the official CoNLL 2014 test set without alternative answers, higher than the top participating team in CoNLL 2014 on this data set. We achieved this by using component systems which were individually weaker than the top three systems that participated in the shared task.