How Good (really) are Grammatical Error Correction Systems?

Standard evaluations of Grammatical Error Correction (GEC) systems make use of a fixed reference text generated relative to the original text; they show, even when using multiple references, that we have a long way to go. This analysis paper studies the performance of GEC systems relative to closest-gold – a gold reference text created relative to the output of a system. Surprisingly, we show that the real performance is 20-40 points better than standard evaluations show. Moreover, the performance remains high even when considering any of the top-10 hypotheses produced by a system. Importantly, the type of mistakes corrected by lower-ranked hypotheses differs in interesting ways from the top one, providing an opportunity to focus on a range of errors – local spelling and grammar edits vs. more complex lexical improvements. Our study shows these results in English and Russian, and thus provides a preliminary proposal for a more realistic evaluation of GEC systems.


Introduction
Grammatical Error Correction (GEC) systems are typically evaluated using reference-based evaluation measures. This is common in language generation tasks, where the system output is compared against a set of gold references, such as the set of correct translations in Machine Translation or the set of valid corrections for a source sentence in GEC. Importantly, the references are generated relative to the original text and are independent of the system outputs. In GEC, the space of valid outputs for a given source sentence is very large, making it extremely difficult to evaluate. Specifically, reference-based evaluations (most GEC datasets contain one reference correction) are known to underestimate system performance (Chodorow et al., 2012;Felice and Briscoe, 2015;Bryant and Ng, 2015). Bryant and Ng (2015) showed that using against Reference Gold (RG) vs. Closest Golds (CGs) generated for each hypothesis. Observe dramatic drop in performance between top hypothesis and the rest in RG evaluation, vs. stability in CG evaluation; and, large gaps between scores even for the top hypotheses in RG vs. CG evaluations. two references is better than one, and the results improve further with more references; however, they used references that were generated relative to the original text. Choshen and Abend (2018b) further demonstrated that the issue can be only slightly alleviated but not completely solved by increasing the number of references. This is because many errors have a long-tailed distribution on valid corrections. One can expect that as GEC systems mature and manage to address more complex errors, the underestimation of their performance will be further exacerbated.
Choshen and Abend (2018b) discuss another consequence of having a large space of valid corrections, pertaining to training. They show that GEC systems have a strong tendency to undercorrect, due to using the one-reference-gold approach for tuning (and training) the systems. Essentially, due to the low likelihood of a system's proposed change being matched to gold, GEC systems are discouraged from proposing corrections, and propose far fewer corrections compared to humans. The under-correction phenomenon is more pronounced for errors with a large number of correction candidates. For example, mistakes on closed-class words, e.g. errors in determiners, where the number of valid corrections is small, suffer from under-correction to a lesser extent than mistakes in word choice. As a result, current systems generally prefer to make small targeted changes on closed-class errors.
We further study the effects of having a large space of valid corrections on GEC system development and automatic evaluation. Given a potentially erroneous sentence, we assume there is a space of gold references corresponding to it. Evaluation is typically done by drawing one gold from this set. We refer to this as reference gold (RG). We generate a new gold that is as close as possible to the system output (hypothesis), by correcting the hypothesis itself, instead of the original text. We call it closest gold (CG). We show by how much performance is really underestimated when a reference gold is used instead of the closest one, and claim that the latter should reflect the true performance of the system. We use a ranked 10-best list of hypotheses for a given source sentence, produced by state-of-the-art GEC systems on two English and two Russian GEC datasets. We generate CGs for hypotheses at different ranks.
Our findings are as follows. First, evaluation against RGs shows a large performance gap between the top hypothesis and the rest. We show that the reason for this is that lower-ranked hypotheses propose more diverse changes, including lexical changes, that have a lower chance to match RGs. In contrast, evaluation against CGs reveals that qualitatively, there is very little degradation in the hypotheses, when considering the top-10 list. While RG evaluation reveals severe drops in F-score and, in particular, precision, we find that relative to CGs quality does not substantially degrade. This is illustrated in Figure 1 for one of the datasets; we show more results in Section 4.1.
Second, contrary to the observation made by Choshen and Abend (2018a) about GEC systems being disinsentivized to propose corrections, we find that it only applies to the top-ranked hypothesis. 1 Moreover, the number of proposed edits increases steadily with the hypothesis rank.
We further evaluate the output by computing the post-editing effort, i.e. the number of edits needed to correct the output hypothesis. We show that post-editing effort is very similar for top hypothesis and lower-ranked hypotheses, reinforcing the claim that lower-ranked hypotheses do not degrade in quality. Finally, we evaluate the types of edits by hypothesis rank and show that lower-ranked hypotheses propose more diverse lexical changes, in contrast to the top hypotheses that mostly attempt local spelling and grammar misuse.
Our analysis should provide insight into better training and evaluation practices for GEC. A better understanding of the under-correction phenomenon and the diversity and quality of the lower-ranked hypotheses can help improve the current training and tuning framework that relies on texts with single RGs and, arguably, hinders the development of GEC systems that can potentially address more complex linguistic phenomena.
Next, we discuss reference-based evaluation. Section 3 presents the definitions and experimental setup. Section 4 presents the evaluation using closest golds. Section 5 analyzes the edits proposed by top and lower-ranked hypotheses.

Reference-Based Evaluation
The standard approach to evaluating GEC systems is to use reference-based measures, that is comparing the system output to a reference that has been generated by a human annotator who corrected mistakes in the original source sentence. We refer to these as reference golds (RGs). It is common to instruct annotators to follow the principle of "minimal edits", that is making the smallest number of edits to render the sentence grammatical and wellformed. We follow a similar principle with our annotators, and the key distinction of our approach is that standard evaluations use golds that are independent of the system outputs, whereas we are creating golds by directly correcting the hypotheses output by the system. We note that there have been other proposals that argue that this principle still does not make the output fluent and propose generating references based on fluency (Sakaguchi et al., 2016). As suggested by Choshen and Abend (2018b), correcting for fluency further increases the space of valid corrections for a sentence, and we do not attempt to do this in this work.
Reference-based evaluations include several measures, such as the MaxMatch scorer M 2 (Dahlmeier and Ng, 2012), GLEU (Napoles et al., 2015), ERRANT (Bryant et al., 2017), and Imeasure (Felice and Briscoe, 2015). These met-rics have some commonalities, e.g. both Max-Match and ERRANT measure precision, recall, and F-score. M 2 has been used with different beta parameter values, the default is beta = 0.5, weighting precision twice as high as recall, which is more common than assigning equal weights and has been shown to have stronger correlation with human ratings (Grundkiewicz et al., 2015). GLEU focuses on the fluency aspect -it is an extension of the BLEU metric in Machine Translation (Papineni et al., 2002). I-measure emphasizes accuracy and calculates the weighted accuracy of correction and detection. Napoles et al. (2019) proposed GMEG-Metric, that is an ensemble of existing metrics, and showed that its correlation with human judgments is higher on several GEC datasets. The MaxMatch metric has been widely used in evaluating GEC systems in many published works and in several shared tasks (Ng et al., , 2014, and we adopt it in this work. We use the default beta value of 0.5 and refer to the result as F-score.

Definitions and Experimental Setup
We start with some definitions and then describe the experimental setup.

Definitions
Given the original learner sentence (a source sentence), a state-of-the-art (neural) GEC system generates a ranked list of outputs, referred to as hypotheses. We refer to the top hypothesis as H 1 , and, similarly, to other hypotheses by the rank that they occupy. A system is evaluated using reference-based metrics where for each source sentence there is at least one corresponding corrected version that was generated by a human expert. We refer to this corrected version as Reference Gold (RG). The set of possible correct versions for a given source sentence is very largepossibly infinite -and any single reference gold is just a single point in that space. Most of the GEC evaluation sets contain one RG for each source sentence, although some (English) datasets contain more (CoNLL-test has 2 and an additional set of 8 generated later, and JFLEG (Napoles et al., 2017) has 4 fluency-based references). System performance is computed by scoring the topranked hypothesis H 1 for each sentence against the corresponding RG.
In addition to RGs, we create for each pair of (source, H i ), where H i is the system hypothesis, another gold, which is generated by an expert by correcting the hypothesis itself. We refer to this gold as closest gold (CG i ) relative to H i . The annotators who generated CGs were instructed to apply the minimal edit principle -i.e. correct the output to ensure it is grammatical and also preserves the meaning of the original source sentence. We thus assume that CG is as close as possible to the system output.
Given a pair of sentences, edit distance is the minimum number of edits (deletions, replacements, insertions, not necessarily single-tokens) needed, so that the sentences match. A gold edit is an edit between a source sentence and an RG or CG. A proposed edit is an edit between a source sentence and a hypothesis. A correct edit is an edit in the intersection of gold and proposed edits. We define Dist (S,RG) to be the number of edits between source and reference gold, and Dist (H i ,CG i ) to be the number of edits between a hypothesis H i and CG i relative to this hypothesis. The last one is interesting for practical purposes, since it is the post-editing effort required to completely correct the text. These are shown below.
• S -original text • H i -hypothesis at rank i • RG -reference gold • CG i -closest gold to hypothesis H i • Gold edit -an edit between a source sentence and an RG or CG • Proposed edit -an edit between a source sentence and a system hypothesis • Correct edit -a proposed edit that is also a gold edit relative to a system hypothesis and specific reference • Dist (S,RG) -edit distance between source and reference gold • Dist (H i , RG) -edit distance between hypothesis at rank i and reference gold • Dist (H i , CG i ) -edit distance between hypothesis at rank i and its closest gold Table 1 shows a sample source sentence, 2 system hypotheses, the RG, and two CGs, one for each hypothesis. Dist (S,H 1 ) is 2 and includes 2 proposed edits ("reallistic" → "realistic" and "a" → ∅). Dist (S,RG) is 2 and includes 2 gold edits ("reallistic" → "realistic" and "had" → "gave"). The number of correct edits relative to RG and H 1 is 1 ("reallistic" → "realistic"). Dist (H 10 , RG) is 4 (4 word replacement edits and one insertion S In addition , I think that the settings are very reallistic and the actors had a great performance .

H1
In addition , I think that the settings are very realistic and the actors had great performance .

H10
In addition , I think that the settings are very realistic and the actors performed very well .

RG
In addition , I think that the settings are very realistic and the actors gave a great performance .

CG1
In addition , I think that the settings are very realistic and the actors had great performances . CG10 In addition , I think that the settings are very realistic and the actors performed very well . Table 1: Example of an original sentence (source); the system output (hypotheses at ranks 1 and 10, H 1 and H 10 ); the reference gold (RG), and two additional golds generated on top of each of the hypotheses (CG 1 and CG 10 ). edit), while Dist (H 10 , CG 10 ) is 0. The three golds -two CGs and the RG -illustrate the notion of semantic equivalence (multiple ways of correcting the same source sentence, while preserving its meaning), not reflected in the standard evaluation.

Experimental Setup
We perform experiments on 2 English and 2 Russian datasets, using diverse NMT GEC model frameworks. The English datasets include the commonly used benchmarks -CoNLL-14 (Ng et al., 2014;Dahlmeier et al., 2013), and the BEA corpus (Bryant et al., 2019). The Russian datasets include the RULEC-GEC corpus (Rozovskaya and Roth, 2019) (henceforth RULEC) and another dataset of Russian learner writing that has been recently collected from the online language learning platform Lang-8 (Mizumoto et al., 2011) and annotated by native speakers. 2 We refer to this dataset as Lang8. CoNLL-14 contains two primary RGs against which the systems are standardly evaluated, while the other datasets include one RG for each sentence. We report results using one RG for each dataset for uniformity, and note that the results for the second CoNLL RG are very similar. These datasets were selected with the goal of evaluating on diverse data both in terms of genre and target language.
For the English datasets, we apply a stateof-the-art BERT-Fuse NMT system that incorporates BERT into an encoder-decoder Transformer model by (Kaneko et al., 2020). We obtained a ranked hypothesis list from the authors.
For RULEC, we use the outputs of a state-ofthe-art Transformer model that uses pre-training on synthetic data and is fine-tuned on RULEC development data (Naplava and Straka, 2019). For the Lang8 corpus, we use a different state-of-theart architecture, a Convolutional Neural Network model proposed in Chollampatt and Ng (2018b) for English. We re-implement it for Russian. The model is trained on RULEC training data and synthetic data, and uses language model re-ranking. This model is also tuned on RULEC development data. Our evaluation shows that the models are competitive: the Transformer model performs better by 4 points on the RULEC corpus than the CNN model, while the CNN model outperforms the Transformer on Lang8 by 2 points. However, we stress that our goal is not to compare these models, as we selected several model architectures and datasets to provide a more comprehensive analysis and evaluation that spans across diverse models and datasets.
Generating Closest Golds We consider the top-10 ranked hypothesis list for each dataset and study four hypotheses at the ranks 1, 2 5, and 10, to evaluate the quality of the hypotheses at various ranks and to determine how much quality degrades from the top hypothesis downwards. For each of the 4 hypotheses H i , i ∈ {1, 2, 5, 10}, a closest gold CG i relative to this hypothesis is generated by post-editing the hypothesis for grammatical errors and other misuse.
Annotation 100 source sentences from each dataset were selected uniformly at random and 4 hypotheses at different ranks were annotated for each sentence. The English outputs were annotated by two annotators -one native English speaker and a fluent non-native speaker. Each annotator corrected all hypotheses for one of the datasets. This was done to ensure consistency across different hypotheses. The Russian outputs were corrected by one native Russian speaker. All of the annotators have a Master's degree and previous annotation experience. The annotators followed the standard annotation protocol in grammar correction, in that they were instructed to follow the minimal-edits principle in correcting the sentences, while also ensuring the output is wellformed and adequate (i.e. the meaning of the original source sentence is preserved), for which they also consulted the source sentence.

Evaluating True System Performance
We start by evaluating each hypothesis output H i for each dataset against reference gold RG and its corresponding closest gold CG i . We show that evaluation relative to RG is always pessimistic and, given a hypothesis generated by a GEC system, there is always a much better gold. Table 2 shows the results of evaluating each system hypothesis against reference golds and closest golds for two datasets -BEA (English) and RULEC (Russian). Results for all datasets are in Appendix (Table 7). The CG result is significantly higher than the performance relative to RG in all cases. For the top hypothesis, the Fscores increase by 19 points on BEA and 17 points on RULEC. Improvements are greater for lowerranked hypotheses. The improvements for BEA are 34, 36, and 37, for ranks 2, 5 and 10, respectively, and for RULEC -34, 28, and 31.

Reference Gold vs. Closest Gold
The most substantial changes occur in precision: between 23 and 41 points on BEA, and 24 and 40 for RULEC (similar changes for the other datasets). It should be emphasized that precision improvements relative to RG are greater for lower-ranked hypotheses. This is interesting and suggests that while lower-ranked hypotheses propose significantly more changes than the topranked one (see column "Proposed edits"), a lot of those edits are valid corrections, even though they are not recognized in RGs: observe that despite the fact that more edits are being proposed with lower hypotheses rank, the number of correct edits (shown in the table) relative to RG goes down. For instance, 84 out of 125 proposed edits are correct in BEA H 1 , while only 75 out of 200 proposed are valid in H 5 . This is consistent across the datasets and indicates that changes proposed in lower-ranked hypotheses are less likely to be included in the RGs.
Recall is also improved in CGs relative to RGs, although not as dramatically. Recall increases by 10-25 points on CoNLL, 12-34 points for BEA, 7-22 points for RULEC, and 2-12 points for Lang8.
The results in the table strongly indicate that the n-best list does not produce hypotheses of degrading quality. On the contrary, the precision of the proposed corrections remains impressively high (in most cases, well above 50 and often into 70 or 80), which is not reflected in the reference- based evaluation scoring against a reference gold. We further illustrate the finding in Figure 2, where for each dataset, we show F-scores of the 4 hypotheses against RGs, and scores against their corresponding CGs. The first observation is that performances in the first group are much lower than in the second group for each corpus. But it is also clear that the first group shows strong degradation relative to the top-ranked hypothesis in the RG evaluation -performance goes down as you go from H 1 to H 10 , while in the second group the performance remains almost the same across the four hypotheses.
Further, as shown in Table 2, the number of correct edits is significantly higher when evaluated against CGs. For instance, the number of correct edits increases from 75 to 163 for BEA H 5 , and from 48 to 105 for RULEC H 2 . Additionally, the number of gold edits in CGs is much higher than in RGs, and is also greater for lower-ranked hypotheses. For instance, there are 202 golds edits in BEA RG, 217 edits in BEA CG 1 and 282 edits in BEA CG 10 . This is consistent across the datasets and suggests that the edits proposed by the models are not necessarily the minimal edits that most of the GEC annotations adopt. This may be why most of the proposed edits in the lower-ranked hypotheses are not found in the RGs. Table 8 in Appendix also evaluates each hypothesis against CGs relative to the other hypotheses, showing that evaluation against CG always produces superior results.

Quality Estimation with Edit Rate
We have shown above that the quality of the hypotheses does not degrade with rank, and in some cases hypotheses at lower ranks even result in higher F-score than the top-ranked hypothesis, when scored against CG, while the evaluation against RGs is strongly biased against lowerranked hypotheses. We now wish to evaluate hypotheses quality using the edit rate, i.e. the number of edits needed to fix the output hypothesis so that it matches its corresponding CG. This quality estimation approach that considers the number of edits required to "fix" the hypothesis is used in Machine Translation (Snover et al., 2006), where the quality of a system output is measured as the minimum number of edits needed to transform the system output so that it matches a reference. To this end, a "targeted" reference is created for a translated sentence, by editing the hypothesis until it is both fluent and has the same meaning as the (original) reference(s). The reason for this is that estimating quality against gold "non-targeted" reference ignores notions of semantic equivalence (see also Table 1), thereby underestimating output quality. Thus, a targeted reference provides a more accurate measure of translation quality. Chollampatt and Ng (2018a) proposed a quality estimation model for GEC that builds on this idea of measuring output quality as the number of edits required to fix the hypothesis. However, they make the strong assumption that, unlike in MT, in GEC targeted references need not be created, as RGs can be substituted for CGs, because both human anno-tators and automatic GEC systems are trained to make minimal changes. As we showed in the previous section, this is not the case and using RGs severely underestimates system performance, and, as a consequence, post-editing effort.
We now use CGs to estimate hypotheses quality in terms of post-editing effort in Table 3. We show the number of proposed edits, the number of correct edits relative to CG, and the number of gold edits in the corresponding CG. (The number of proposed, gold, and correct edits also appears in Tables 2 and 7 but is shown here in Table 3 for convenience). The post-editing effort is shown in the last column, estimated as the number of edits required to make the hypothesis output fluent and grammatical, i.e. the edit distance between a hypothesis and its corresponding CG. The number of edits was computed automatically using the ERRANT tool (Bryant et al., 2017) that, given a pair of sentences (source, hypothesis), will produce a set of edits needed to transform the source into the target. The post-editing effort is not necessarily the smallest for the top hypothesis. In fact, on BEA, the smallest value is obtained for H 2 (55 edits), while H 1 and H 10 are similar (86 and 87). On the other datasets, there is no significant difference for the English datasets across the 4 different hypotheses, while on the Russian datasets there is slight degradation for hypotheses 5 and 10, while H 1 and H 2 are close. This supports our finding above that lower-ranked hypotheses are of high quality. As a side note, our post-edit estimation assumes that there are no errors that have  We show the number of proposed edits, the number of correct edits relative to CG, the number of gold edits in CG, and the post-editing effort required to make the hypothesis fluent and grammatical, estimated as the number of edits between the hypothesis and its CG. more impact than others. In Section 5, we actually show that the top-ranked hypotheses mostly contain changes on "simpler errors", and, arguably, the lower-ranked hypotheses might even involve less post-editing effort given the more complex errors they manage to fix.

Do GEC Systems Undercorrect?
We first compare the number of edits in each hypothesis to the number of edits in the original gold (Table 4). The top-ranked hypothesis makes only a fraction of edits compared to RGs. Generally, RGs contain 2.5-3 more edits than the top-1 hypothesis. This is consistent with the analysis in Choshen and Abend (2018b) that shows that GEC systems are disincentivized to make corrections due to the low-coverage bias. What is notable, however, is that the number of edits substantially increases with the hypothesis rank. In particular, the second-ranked hypothesis contains on average twice as many edits as the first one, and the number of edits continues to increase. Hypotheses 5 and 10 contain a similar number of edits compared to the number of edits in RG. The under-correction issue is further studied in the next section.  For each corpus, total number of tokens is shown. The majority of edits are single-token replacements, deletions or insertions. The last row shows the number of gold edits in the reference gold for each dataset.

Edit Analysis by Hypothesis Rank
We now analyze and compare the edits in the topranked hypothesis and in H 10 , in order to understand better how the edits differ with hypothesis rank. For the English datasets, we apply ER-RANT (Bryant et al., 2017) to extract edits using pairs of parallel sentences (source, hypothesis). ERRANT then uses English-language specific rules based on part-of-speech and linguistic knowledge to assign each edit its linguistic type, such as preposition, noun number, etc. We further group the edits into one of the following two categories: spelling/grammar changes and lexical changes. The first category includes punctuation, spelling, orthography, and grammatical corrections that typically require local context and small changes and are also limited in the number of candidate corrections. These include determiner errors, verb agreement and form, noun number and punctuation, and morphological changes. Lexical changes comprise the categories denoted by ERRANT as "Other", "Verb", "Noun", "Pronoun", "Adverb", which include mostly lexical errors, e.g. changing "get" to "earn", verb tense errors that require wider context and thus are trickier to correct. The number of edits by type is shown in Table 5. Lexical changes are marked with a (*).
In the lower part of the table, we show the distribution of edits between the two categories: in CoNLL, spell/grammar changes account for 51.2% of all changes in the RG and for 74.3% in the top-ranked hypothesis. Lexical changes make up 48.8% in RG, while only 25.7% in the topranked hypothesis, although this number increases to 36.2% in the H 10 . In BEA, 49.5% of RG edits are lexical, while in the top-ranked hypothesis these account for 41.7% and that number goes up to 51.5% for the H 10 .
Looking at the number of edits in each category, it can be observed, that the under-correction phenomenon (for the top-ranked hypothesis) is particularly pronounced for lexical errors. In the spellgrammar category, the number of proposed edits is very close to (or even exceeds) the number of edits of this type in RG in both datasets (the only exception is perhaps the punctuation errors). For example, 33 determiner errors are present in CoNLL RG, while there are 25 in H 1 . In contrast, 37 errors of category "Other" are in RG for CoNLL but only 8 changes of this type are in the top-ranked hypothesis. In fact, in both CoNLL and BEA, in the top-ranked hypotheses, the majority of the changes are minimal/local changes (74.3% in the CoNLL dataset and 58.3% in the BEA dataset).
We perform a similar analysis for the Russian datasets, where the edits are classified manually by our annotator (due to lack of automatic tool). We find similar behavior (see Appendix A). However, in the most challenging categories (lexical and "Other"), which both comprise word changes, the situation is more severe: the top-ranked hypothesis proposes 0 changes. Overall, the undercorrection phenomenon for lexical errors is more pronounced for the Russian language.
Overall, the under-correction phenomenon is especially pronounced for top hypotheses in the lexical error category. The percentage of lexical edits with respect to the total number of edits in the RG and CGs is much higher than in the top hypotheses. Thus, under-correction is mostly a problem for lexical errors, but is partially rectified in the lower-ranked hypotheses, illustrated in Figure 3 that shows the percentage of lexical edits in RG, H 1 , and H 10 for each dataset.

Discussion
We study the current evaluation and training schema in GEC, using 4 datasets in 2 languages and several state-of-the-art model architectures. We make several observations. First, we show that the quality of the systems is significantly better than we think, when we evaluate relative to the closest gold vs. reference gold. And the reason is there are many golds and we show that there is always a gold that is close to the prediction, and we should take this result as the actual performance of the model. 3 Moreover, as we showed, using the CGs provides additional knowledge about the type of errors various hypotheses generate, further guiding the community towards developing additional insights that can be used also in targeting specific models for specific users (based on their abilities, for example). Our second observation is that the top hypothesis is not actually better than the lower-ranked hypotheses in the 10-best list, even though the current evaluations are strongly biased towards the top hypothesis. Third, because of the way we train, lower-ranked hypotheses relative to the reference gold are as good or sometimes qualitatively better than the top hypothesis because of the diversity of the type of mistakes that they attempt to correct.
Recommendations based on the paper findings We view this paper as an analysis paper that we hope can contribute to a better understanding of the current issues in the GEC field. We hope that the proposed analysis can give an opportunity to researchers to think about directions for addressing these issues. That said, we believe that our results may serve as a preliminary proposal for developing better ways for evaluating for GEC systems, and would like to outline several recommendations based on our findings. We believe the findings should be useful for thinking about how to modify the training and tuning paradigm in GEC.
Regarding training and tuning, the current schema of using learner texts with single RGs hinders development of GEC systems that, as we show, can potentially address more complex linguistic phenomena and language misuse. For training and tuning, perhaps, it would make sense to generate multiple references by creating additional references that contain paraphrases of the original gold reference. In terms of evaluation, the findings might inspire researchers to think of better ways to evaluate GEC system outputs. For example, instead of computing exact match, we could include paraphrases so as not to penalize hypotheses that propose more liberal sentencerewrites. A different approach might be to choose lower-ranked hypotheses, since they are as good, and they have some other useful properties, such as the language phenomena they are able to correct that the top hypothesis cannot.

A Edit Analysis for the Russian Datasets
For the Russian corpora, since there is no automated tool that classifies edits by type, we manually classify the edits in the hypotheses and RGs. We first extract all proposed edits using the ER-RANT tool (The errant tool both extracts edits given a pair of sentences, and classifies these by type). The edit-extraction component is languageindependent, whereas the type-classification is English-based). These edits are then manually classified into one of the grammar categories relevant for Russian. We use the error classification schema in Rozovskaya and Roth (2019) but combine certain types, e.g. we group together noun and adjective case errors, verb tense/aspect errors, and noun and adjective number errors. Unlike English, Russian does not have determiner errors. We similarly group the error type into two categories, spelling/grammar and lexical. However, we assign the noun/adjective number errors and morphology errors to the second group (lexical), as we expect these to display more variability due to the number of of different endings for adjective/noun number because of declensions and gender, and the large number of morphological variants compared to English. The statistics are shown in Table 6.
First, observe that the distribution in the gold references of lexical and spelling/grammar changes in the lower part of the table is similar to the English datasets. 40% or more of all gold edits are lexical. In the top-ranked hypothesis, only 31.5% and 14.6% of edits in RULEC and Lang8, respectively are of this type. This is similar to the results for the English datasets in Table 5, however in the most challenging categories (lexical and "other"), which both comprise word changes, the situation is more severe: the top-ranked hypothesis proposes 0 changes. Overall, the undercorrection phenomenon is more pronounced for the Russian language.