CLEME: Debiasing Multi-reference Evaluation for Grammatical Error Correction

Evaluating the performance of Grammatical Error Correction (GEC) systems is a challenging task due to its subjectivity. Designing an evaluation metric that is as objective as possible is crucial to the development of GEC task. However, mainstream evaluation metrics, i.e., reference-based metrics, introduce bias into the multi-reference evaluation by extracting edits without considering the presence of multiple references. To overcome this issue, we propose Chunk-LEvel Multi-reference Evaluation (CLEME), designed to evaluate GEC systems in the multi-reference evaluation setting. CLEME builds chunk sequences with consistent boundaries for the source, the hypothesis and references, thus eliminating the bias caused by inconsistent edit boundaries. Furthermore, we observe the consistent boundary could also act as the boundary of grammatical errors, based on which the F$_{0.5}$ score is then computed following the correction independence assumption. We conduct experiments on six English reference sets based on the CoNLL-2014 shared task. Extensive experiments and detailed analyses demonstrate the correctness of our discovery and the effectiveness of CLEME. Further analysis reveals that CLEME is robust to evaluate GEC systems across reference sets with varying numbers of references and annotation style.


Introduction
Grammatical Error Correction (GEC) is a local substitution task that aims to correct all grammatical errors in a given ungrammatical text (Bryant et al., 2022;Ma et al., 2022;Ye et al., 2022).The GEC task has attracted more and more attention due to its practical value in daily life (Kaneko  et al., 2022;Li et al., 2022a,b;Liang et al., 2023).However, it is intractable to evaluate GEC systems since GEC is a highly subjective task showing very low inter-annotator agreement (IAA) (Choshen and Abend, 2018).Most datasets improve compatibility by incorporating multiple references to guarantee a more realistic evaluation of the model performance.
Automatically evaluating the quality of corrected texts is essential for the development of GEC systems.The common GEC metrics can be classified into two broad categories: referencebased and reference-less.Reference-based metrics evaluate GEC systems by comparing hypotheses and human-annotated references in terms of edits (Dahlmeier and Ng, 2012;Bryant et al., 2017) or n-grams (Napoles et al., 2015).Reference-less metrics are proposed to evaluate GEC systems without references.However, Deutsch et al. (2022) demonstrate that reference-less metrics are inherently biased and limited in their ability to evaluate generated text.Reference-based metrics evaluate GEC systems in an interpretable way, providing useful insights for model construction.Thus we focus on reference-based metrics.
As shown in Figure 1, existing reference-based metrics such as ERRANT first extract edits and then compute F 0.5 scores by comparing the edits of hypotheses and references.However, these edits arXiv:2305.10819v1[cs.CL] 18 May 2023 are extracted separately without considering the presence of multiple references.We argue that the edits bias GEC multi-reference evaluation since equally good corrections may be unfairly rewarded.Specifically, the ungrammatical segment the technologies were is equally well-corrected by Ref. 1 and Ref. 2. However, if a hypothesis takes the same corrections as Ref. 1 (i.e., [the → ] and [were → have], TP=2), it will be rewarded less than the corrections of Ref. 2 (i.e., [the → ], [technologies → technology] and [were → has], TP=3).
In this paper, we propose Chunk-LEvel Multireference Evaluation (CLEME), providing unbiased F 0.5 scores for GEC multi-reference evaluation.Inspired by Gotou et al. (2020), CLEME transforms the source, the hypothesis and all the references into chunk sequences with consistent boundaries by chunk partition process, thus debiasing the GEC multi-reference evaluation.
Existing metrics assume the corrections of grammatical errors are dependent.Whenever there is more than one reference for a source, the metrics try each reference in turn, and then the highest score is taken as the final score.Additionally, we observe that the corrections of grammatical errors in terms of chunks can be considered approximately independent.For example, the ungrammatical segments the technologies were and for shown in Figure 1 can be corrected independently, i.e., the correction of the segment the technologies were has no effect on the correction of for.Based on the observation, F 0.5 scores can be computed in another way.Specifically, we iterate chunks of a hypothesis and a chunk is considered correct as long as it matches one of the corresponding chunks of references.In this case, the hypothesis in Figure 1 will be rewarded 2TP rather than 1TP and 1FP, which is the traditional case.Experimental results show that CLEME achieves the highest or comparable correlations between reference-based and human judgments at the corpus-and sentence-level.
In summary, our contributions are three folds: (1) We propose CLEME, a reference-based metric that evaluates GEC systems at the chunklevel, providing unbiased F 0.5 scores for multireference evaluation.
(2) We observe that the corrections of grammatical errors in terms of chunks are approximately independent.Based on the observation, we compute F 0.5 scores in another way.
(3) Extensive experiments show our proposed CLEME achieves a new state-of-the-art on several reference sets of the CoNLL-2014 evaluation task.Human evaluation experiments are also conducted to further confirm the effectiveness of our approach.
2 Preliminary Study

Consistent Boundaries
We determine consistent chunk-level boundaries by chunk partition process to debias multi-reference evaluation as shown in Figure 2. We first extract the edit sets of the hypothesis and the references and then merge the edits with overlapping intervals into a chunk.Please note that the source, the hypothesis and all the references will be segmented into chunk sequences with the same number of chunks regardless of the number of their tokens.The chunk partition process is intuitive since we can locate and examine all possible corrections of an erroneous chunk.For example, the chunk by the can be corrected in two ways, i.e., with in Ref. 1 and through in Ref.
2. The resulting chunks can be classified into three categories: 1) unchanged chunks contain the same text segments as the source sentence, 2) corrected chunks contain different non-empty text segments compared with the source sentence, and 3) dummy chunks are empty chunks.

Boundaries of Grammatical Errors
In Figure 2, overlapping edits are merged into corrected/dummy chunks, which are separated by unchanged chunks.Therefore, are chunk boundaries the boundaries of grammatical errors?
Dataset To answer the question, we conduct experiments on the BN-10GEC dataset (Bryant and Ng, 2015).The dataset consists of 1,312 source sentences, which are the same as the CoNLL-2014 shared task test data (Ng et al., 2014).Each source sentence corresponds to 10 references given by ten native British English speakers, including two official annotators of CoNLL-2014, the first author of the paper, and seven freelancers recruited via the online recruitment website.
Experiment Setup For each source sentence, we sample 9 references and run the chunk partition introduced in Section 2.1.The resulting chunk sequences are determined together by all 9 references.The edits of the remaining reference {e 1 , • Correction Dependence Assumption Correction Independence Assumption Figure 2: Overview of our approach CLEME.CLEME 1) first extracts edits of the hypothesis and the references, 2) merges the overlapping edits into chunks, and then 3) computes the F 0.5 scores based on two different assumptions.
are used to calculate the following three statistics: 1) The ratio of In-Corrected-Chunk (ICC) gives the proportion of edits included by corrected/dummy chunks of the other references.An edit is included by a chunk if the interval of the edit is included by that of the chunk.
2) The ratio of In-Unchanged-Chunk (IUC) gives the proportion of edits included by unchanged chunks of the other references.
3) The Cross-Chunk (CC) ratio computes the proportion of edits extending the original boundaries.These statistics are calculated as follows: where M is the number of edits of the remaining reference.If the edit e i is included by a corrected/dummy chunk, the function f 1 (e i ) returns 1, otherwise 0. Similarly, if the edit e i is included by an unchanged chunk, the function f 2 (e i ) returns 1, otherwise 0. We sample different 9 references for chunk partition in each run and repeatedly calculate the statistics using the remaining reference.
Results Table 1 reports the results of BN-10GEC.
The number of corrected and dummy chunks are less than that of edits since overlapping edits are merged into a chunk.90.66% edits are included by the corrected/dummy chunks, which means the grammatical errors to be corrected have been considered by the other references.The proportion of edits included by corrected chunks is 7.74%.These edits may be over-corrected since the other references believe no grammatical errors are needed to correct.To our surprise, as many as 1.61% edits cross the chunk boundaries.Therefore, the chunk boundaries are stable enough to serve as the boundaries of grammatical errors to a certain degree2 .Further, we have the following assumption.Correction independence assumption: Different grammatical error corrections are independent.
That is, the correction of a grammatical error does not affect the corrections of others in an ungrammatical sentence.Based on the assumption, F 0.5 scores can be computed in another way, which will be introduced in Section 3.

Chunk Evaluation
As shown in Figure 2, each chunk is composed of edit operation(s), start index, end index, and correct tokens.Conventional reference-based metrics like M 2 and ERRANT compute F 0.5 scores based on the correction dependence assumption.They compute an F 0.5 score for each reference in turn and choose the one that leads to the best performance for the source sentence.CLEME-dependent also computes F 0.5 scores in this way by treating corrected/dummy chunks as edits.
CLEME-independent is proposed to compute F 0.5 scores based on the correction independence assumption.A corrected/dummy chunk of the hypothesis is correct if it matches one of the corresponding chunks of references.

Length Weighting
The average length of chunks is much longer than that of edits shown in Table 1, resulting in the unfairness of chunk evaluation if a longer chunk is rewarded equally with a shorter one.Therefore, we add length weighting to the chunk evaluation.The intuition of length weighting is to compensate for long chunk matching.The weights of True Positives (TPs), False Positives (FPs), and False Negatives (FNs) are computed as follows:3 , cmin, cmax , (4) where α 1 , α 2 and α 3 are scale factors for TPs, FPs and FNs respectively, x is the length of the chunk, is the average length of chunks, and the function clip (v, a, b) clips value v between a and b.The curves of length weighting are shown in Figure 3. Formally, given a system corrected/dummy chunk set C H and a gold corrected/dummy chunk set C R , we apply length weighting on each chunk to compute precision, recall and F 0.5 as follows: where β = 0.5 is usually used, which weighs precision twice as much as recall.

Corpus-level v.s. Sentence-level
We compute F 0.5 scores at corpus-and sentencelevel following Gong et al. (2022).Corpus-level metrics compute an F 0.5 score over the entire dataset.Sentence-level metrics compute an F 0.5 score over each sentence of the dataset and evaluate GEC systems by using the average F 0.5 score.CLEME-dependent and CLEME-independent are corpus-level metrics, and their sentence-level variants are respectively SentCLEME-dependent and SentCLEME-independent.
Additionally, CLEME can evaluate GEC systems by accuracy scores, which is usually not implemented by conventional reference-based metrics.Please refer to Appendix A.4 for the introduction and analyses of accuracy scores.

Correlations with Human judgments
Dataset To demonstrate the effectiveness and robustness of our approach, we measure correlations between reference-based metrics and human judgments on multiple reference sets, including CoNLL-2014 (Grundkiewicz et al., 2015), BN-10GEC (Bryant and Ng, 2015) and SN-8GEC (Sakaguchi et al., 2016). 4All the reference sets are based on the CoNLL-2014 shared task (Ng et al., 2014), consisting of 1,312 source sentences.SN-8GEC collected 8 references sets of annotations from both experts and non-experts, including 4 sets of minimal edits (2 by experts and 2 by nonexperts) designed to make the original sentences technically grammatical, and 4 sets of fluency edits (2 by experts and 2 by non-experts) designed to elicit native-sounding, fluent text.Statistics of all the reference sets are reported in Appendix A.2.
Human judgments created for the outputs of 13 GEC systems (including the unchanged source text) are presented by Grundkiewicz et al. (2015), where eight English native-speaker judges were asked to rank the output of 13 systems from best to worse.Two system ranking lists are generated respectively by Expected Wins (EW) (Macháček and Bojar, 2013) and TrueSkill (TS) (Sakaguchi et al., 2014).
Experiment Settings Following Gong et al. (2022); Chollampatt and Ng (2018), we compute Pearson correlation coefficient γ and Spearman correlation coefficient ρ between reference-based metrics and human judgments based on corpus-level ranking.We tune the hyperparameters on CoNLL-2014 and keep the hyperparameters on the other reference sets to show the ability of CLEME to adapt to different reference sets.The hyperparameters of our approach are reported in Appendix A.3.

Evaluation Metrics
We compare our approach with the following reference-based metrics:5 • GLEU and its sentence-level variant Sent-GLEU are n-gram based metrics, which reward hypothesis n-grams that overlap with the reference but not the source and penalize hypothesis n-grams that overlap with the source but not the reference (Napoles et al., 2015).
• M 2 and SentM 2 dynamically extract the hypothesis edits with the maximum overlap of gold annotations (Dahlmeier and Ng, 2012).
Results Table 2 reports the correlations between reference-based metrics and human judgments.For the corpus-level metrics, GLEU achieves the highest correlations on BN-10GEC and NE-fluency reference sets.However, GLEU suffers from negative correlations on NE-Minimal, which is caused by low-quality annotations of NE-Minimal, 6 indicating that GLEU may not be a robust metric (Sakaguchi et al., 2016).ERRANT performs slightly better than M 2 .Our proposed CLEME-dependent and CLEME-independent can make better and full use of consistent chunk boundaries, thus performing slightly better than ERRANT on most reference sets.It should be noted that CLEMEindependent achieves comparable performance to CLEME-dependent, showing the effectiveness of computing F 0.5 scores based on correction independence assumption.
All the sentence-level metrics perform better than their corpus-level version, which is consistent with the findings (Gong et al., 2022;Napoles et al., 2016).Our approach aligns better with human judgments compared to existing referencebased metrics on most reference sets.SentCLEME- dependent performs best on BN-10GEC and NE-Fluency, and performs on a par with the best metric on E-Fluency, which means SentCLEMEdependent is more suitable for fluent reference sets.This phenomenon is intuitive since fluent editing is more likely to follow the correction dependence assumption.On the other hand, SentCLEMEindependent achieves higher correlations on E-Minimal and NE-Minimal, which also conforms to our intuition that minimal editing is more likely to follow the correction independence assumption.These results suggest that reference sets prefer one of the correction assumptions.
Our approach achieves higher correlations on (N)E-Fluency rather than (N)E-Minimal, while SentM 2 and SentERRANT perform worse on E-Fluency than E-Minimal.This is because our approach evaluates GEC systems using longer chunks rather than scrappy edits, which could better reflect whether a grammatical error is fluently corrected.

Human Evaluation
Experiments show that evaluating GEC systems based on the correction independence assumption could work effectively in most situations.In this section, we demonstrate whether the correction independence assumption makes sense for humans.We define the correction independence of a pair of chunks as that the correction of a chunk is irrelevant to the correction of another one.A simple  case is shown in Appendix A.5.We conduct human evaluation experiments on 1,000 sentences from BN-10GEC (Bryant and Ng, 2015), with each source corresponding to 10 references.Three annotators were required to judge whether a pair of chunks is correction-independent.
Table 3 reports the ratio of correction independence and inter-annotator agreement (IAA) across three annotators.We calculate Cohen's-κ (Cohen, 1960) as our IAA statistics.More than 90% pairs of chunks are correction-independent for all the annotators, indicating that it is reasonable to evaluate GEC systems based on the correction independence assumption.Considering the subjectivity of GEC task, the IAA statistics show that it is relatively easy to judge whether a pair of chunks is correction-independent (Bryant and Ng, 2015).-dependent v.s.-independent The precision and recall of (Sent)CLEME-independent is slightly larger than those of (Sent)CLEME-dependent since the former could overestimate the performance of GEC systems, while the latter could underestimate the performance of GEC systems.Both of them respectively provide an upper bound and a lower bound for the performance of GEC systems.
System-level v.s.Sentence-level The precision, recall, and F 0.5 scores of sentence-level metrics are significantly larger than those of system-level.
A possible reason is that the precision and recall are weighed down by a small number of difficult samples with many corrected/dummy chunks.

Ablation Study
We report detailed ablation analyses of our approaches on BN-10GEC -we have similar findings on other reference sets.

Number of References
Since CLEME is designed for multi-reference evaluation, it degenerates to conventional reference-based metrics such as M 2 and ERRANT when only one reference is available.Here we demonstrate how correlations change against an increasing number of references.The results reported in Figure 4a show the correlations of CLEME-dependent and CLEME-independent do not change significantly.However, Pearson correlations of SentCLEMEdependent and SentCLEME-independent are consistently higher than system-level metrics and steadily increase with the increasing number of references.Therefore, we recommend evaluating GEC systems using sentence-level metrics rather than corpus-level metrics.
Parameter Sensitivity Analysis The scale factors introduced in Section 3.2 determine the change degree of weights with the length of chunks.We report corrections between our proposed metrics and human judgments for different scale factors in Fig- ure 4b.The results show that our proposed metrics are robust to the selection of hyperparameters.
6 Related Work

Reference-based Metrics
Reference-based metrics score GEC systems under the guidance of manually written references.M 2 scorer (Dahlmeier and Ng, 2012) calculates an optimal edit sequence between a source sentence and a system hypothesis that achieves the highest overlap with the gold-standard annotation.
The optimal edit sequence is subsequently scored using the F 0.5 score measure to represent the performance of each system.However, the optimality in terms of overlap does not mean the opti- mality of GEC evaluation.Therefore, Bryant et al. (2017) proposed ERRANT, which improves the edit extraction using a linguistically-enhanced alignment algorithm and merging rules, making it more likely that tokens with similar linguistic properties are aligned.However, ERRANT is languagedependent and bias still exists in multi-reference evaluation settings.Inspired by BLEU (Papineni et al., 2002) in NMT (Dong et al., 2023;Cheng et al., 2023a), Napoles et al. (2015) proposed GLEU, an n-gram based metric for GEC evaluation.The comparison of GEC metrics is shown in Table 6.
Although reference-less metrics (Napoles et al., 2016) can achieve very high agreement with human judgments, they are short of interpretability as metrics for GEC evaluation.Essentially, reference-less metrics are equivalent to evaluating GEC systems using other trained GEC systems.

Meta Evaluation Methods
It is intractable to determine which GEC metric is best.A reasonable GEC metric should consider multiple aspects, including correlation with human judgments, interpretability and efficiency.Inspired by WMT human evaluation campaigns (Callison-Burch et al., 2008;Cheng et al., 2023b), Grundkiewicz et al. (2015) collected human rankings for the 13 system outputs (including the unchanged source text) from the CoNLL-2014 shared task (Ng et al., 2014).Two ranking methods, Expected Wins (EW) and TrueSkill (TS) were applied to rank systems from best to worse.Sakaguchi et al. (2016) collected 8 (2 × 2 × 2) annotations from both experts and non-experts (minimal and fluency, expert and non-expert, two corrections each) and showed that GEC metrics work differently in different reference sets.Napoles et al. (2019) explored how GEC metrics work in new domains.

Conclusion
This paper proposes CLEME, a novel referencebased metric for multi-reference GEC evaluation.Compared with conventional reference-based met-rics, CLEME evaluates GEC systems at the chunklevel, providing unbiased F 0.5 scores for multireference evaluation.We explore the feasibility and effectiveness of evaluating GEC systems based on either the correction dependence assumption or the correction independence assumption.The results of extensive experiments demonstrate that our approach achieves higher correlations than existing reference-based metrics on most reference sets.We further conduct a human evaluation experiment to verify the rationality of evaluating GEC systems based on the correction independence assumption.In the future, we would like to evaluate GEC systems by combining dependent and independent assumptions.It is also worthwhile to improve the robustness of accuracy-based metrics.

Limitations
We do not test the effectiveness of CLEME in other languages though CLEME can be easily extended to other languages.In addition, all the reference sets used in our experiments are based on the CoNLL-2014 shared task, a second-language dataset.More experiments on different datasets are needed to show the robustness of our approaches.We believe that the perspective of correction independence assumption could also be introduced in GEC datasets of other languages and domains for more in-depth analysis and exploration.
Recent reference-less metrics have already outperformed reference-based metrics including ours in terms of correlations.However, our approach could evaluate GEC systems in an interpretable way, which is a huge advantage over reference-less metrics.

Ethics Statement
In this paper, we verify the effectiveness of our proposed approach using CoNLL-2014, BN-10GEC, and SN-8GEC reference sets, all of which are from publicly available datasets or resources on legitimate websites without sensitive data involved.All the baselines used in our experiments are also publicly available metrics and we have cited the corresponding authors.All the datasets and baselines involved are consistent with their intended use.
Additionally, we conduct the human evaluation experiments of the correction independence assumption by employing three postgraduate students in foreign linguistics and applied linguistics as parttime annotators.Two authors of this paper com-piled the annotation guidelines and each annotator received comprehensive training and knew the intent of the annotation before annotation.The whole annotation process for an annotator could be finished in about totally 6 working hours.All annotators were paid for their work and the average salary is about $5 per hour respectively.

A.1 Statistics of GEC Benchmarks
Table 7 reports the statistics of popular GEC benchmarks.Mainstream benchmarks usually contain multiple references for a given source sentence to guarantee a more realistic evaluation of the model performance.We recommend readers to the survey (Bryant et al., 2022) for more details.

A.2 Statistics of Reference Sets
Table 8 reports the statistics, in-chunk rates, and cross-chunk rates of all the reference sets used in our experiments.

A.3 Hyperparameters
The hyperparameters of our proposed CLEME consist of the scale factors α and the thresholds.We tune the hyperparameters on CoNLL-2014 and keep the hyperparameters on the other reference sets.The hyperparameters of CLEME are reported in Table 9.

A.4 Accuracy
Conventional reference-based metrics such as M 2 and ERRANT are unable to compute accuracy scores since they do not define True Negatives (TNs) 7 .Our approach defines TNs and implements the computation of accuracy scores.We define TNs of CLEME as hypothesis unchanged chunks which match the chunks of references.Similarly, accuracy could be computed based on dependence or independence assumption in both corpuslevel and sentence-level settings, i.e., 1) CLEMEdependent-acc, 2) CLEME-independent-acc, 3) SentCLEME-dependent-acc, and 4) CLEMEindependent-acc.
The results of correlations are reported in Table 10.Accuracy-based metrics perform differently at the corpus-and sentence-level, which is similar to the findings (Napoles et al., 2016(Napoles et al., , 2019)).Two accuracy-based corpus-level metrics, i.e., CLEME-dependent-acc and CLEME-independentacc, result in negative correlations.However, their sentence-level metrics, i.e., SentCLEMEdependent-acc and SentCLEME-independent-acc perform well, achieving the highest correlations on some reference sets.One notable difference between accuracy-based metrics and F 0.5 -based met-7 An exception is I-measure (Felice and Briscoe, 2015), which adopts an extended version of the Writer-Annotation-System evaluation scheme (Chodorow et al., 2012).rics at the sentence-level is their stability or robustness on different reference sets.F 0.5 -based metrics are more robust to different reference sets, where SentCLEME-(in)dependent achieves comparable correlations with the best metric on all the reference sets.However, the performance of accuracybased metrics lags far behind other metrics on some reference sets.A deeper investigation into this phenomenon is needed to understand the instability of accuracy-based metrics.

A.5 Correction Independence
We define the correction independence of a pair of chunks as that the correction of a chunk is irrelevant to the correction of the other, as shown in Table 11.Chunk 2 and Chunk 4 are correctiondependent since the correction of chunk 2 family do from Ref.9 must be matched with the correction of Chunk 4 help then from Ref.9 in this case.However, Chunk 6 is correction-independent with Chunk 2 (or 4) since the correction of Chunk 6 does not influence the correction of Chunk 2 (or 4).

A.6 Detailed Evaluation Results
Table 12 and Table 13 report the detailed evaluation results of ERRANT and our proposed metrics on CoNLL-2014 and BN-10GEC reference sets.

A.7 Effect of the Number of References
We report the detailed results of correlations against an increasing number of references in Figure 5.

A.8 Effect of Scale Factors
We report the detailed results of correlations for different scale factors in Figure 6.We report True Positives (TPs), False Positives (FPs), False Negatives (FNs) and True Negatives (TNs) with or without length weighting (LW).
Nowadays the technologies were improved a lot compared for the last century.Source Nowadays technology has improved a lot compared with the last century.Ref. 2 Nowadays the technologies were improved a lot compared for the last century.Source Nowadays technologies have improved a lot compared to the last century.Ref. 1 Nowadays technology has improved a lot compared with the last century.Ref. 2 Nowadays technologies have improved a lot compared the last century.Hyp. with Nowadays technologies have improved a lot compared with the last century.Hyp.

Figure 1 :
Figure 1: A comparison of edits automatically extracted by ERRANT and CLEME.An orange block is an edit.

Figure 3 :
Figure 3: Curves of length weighting with different scale factors α for = 2.All the curves pass through the point ( , 1.0).A curve with a larger scale factor has a greater slope.

Figure 5 :Figure 6 :
Figure 5: Effect of the number of references on BN-10GEC.We report both Pearson and Spearman correlations using Expected Wins and TrueSkill ranking methods with an increasing number of references.

Table 1 :
Statistics of the BN-10GEC dataset.

Table 3 :
A comparison of correction independence annotations across three annotators.

Table 6 :
A comparison of GEC metrics.GLEU is indeterministic since it involves sampling operation.

Table 11 :
A case of correction independence.We apply chunk partition to the source and all the references.

Table 12 :
A comparison of detailed evaluation results across 13 GEC systems on the CoNLL-2014 shared task.