Enhancing Grammatical Error Correction Systems with Explanations

Grammatical error correction systems improve written communication by detecting and correcting language mistakes. To help language learners better understand why the GEC system makes a certain correction, the causes of errors (evidence words) and the corresponding error types are two key factors. To enhance GEC systems with explanations, we introduce EXPECT, a large dataset annotated with evidence words and grammatical error types. We propose several baselines and anlysis to understand this task. Furthermore, human evaluation verifies our explainable GEC system’s explanations can assist second-language learners in determining whether to accept a correction suggestion and in understanding the associated grammar rule.


Introduction
Grammatical Error Correction (GEC) systems aim to detect and correct grammatical errors in a given sentence and thus provide useful information for second-language learners. There are two lines of work for building GEC systems. Sequenceto-sequence methods (Rothe et al., 2021;Flachs et al., 2021;Zhang et al., 2022) take an erroneous sentence as input and generate an error-free sentence autoregressively. Sequence labeling methods (Omelianchuk et al., 2020;Tarnavskyi et al., 2022a) transform the target into a sequence of text-editing actions and use the sequence labeling scheme to predict those actions.
With advances in large pre-trained models (Devlin et al., 2018; and availability of high-quality GEC corpora (Ng et al., 2014;Bryant et al., 2019), academic GEC systems (Omelianchuk et al., 2020;Rothe et al., 2021) have achieved promising results on benchmarks and serve as strong backbones for modern writing * Work was done during the internship at Tencent AI lab. † Corresponding authors.
As a result, I enjoy study accounting.
As a result, I enjoy studying accounting.

GEC systems
correct grammatical errors without giving specific reasons Explainable-GEC system corrects grammatical errors with explanation Input Change 'study' to 'studying', because after 'enjoy' it should follow a "gerund".
As a result, I enjoy studying accounting.
"Gerund" Error assistance applications (e.g., Google Docs 1 , Grammarly 2 , and Effidit (Shi et al., 2023) 3 ). Although these academic methods provide high-quality writing suggestions, they rarely offer explanations with specific clues for corrections. Providing a grammaraware explanation and evidence words to support the correction is important in second-language education scenarios (Ellis et al., 2006), where language learners need to "know why" than merely "know how". As a commercial system, Grammarly does provide evidence words, but in very limited cases, and the technical details are still a black box for the research community. Though some existing work has attempted to enhance the explainability of GEC's corrections (Bryant et al., 2017;Omelianchuk et al., 2020;Kaneko et al., 2022), they do not provide intrasentence hints (i.e., evidence words in the sentence). To fill this gap, we build a dataset named EXPlainble grammatical Error CorrecTion (EXPECT) on the standard GEC benchmark (W&I+LOCNESS (Bryant et al., 2019)), yielding 21,017 instances with explanations in total. As shown in Figure 1, given a sentence pair consisting of an erroneous sentence and its corrected counterpart, our explainable annotation includes: 1) Evidence words in the erroneous sentence.
Error tracing could be rather obscure for second-language beginners. For example, give an erroneous sentence, "As a result, I enjoy study accounting." where "study" should be corrected to "studying", a beginning learner might mistakenly attribute "studying" to "accounting" because they both have an "ing" suffix. However, the correct attribution should be "enjoy". Such incorrect judgment may lead the language learner to draw wrong conclusions (e.g., A verb needs to have an "ing" suffix if a subsequent verb does so), which significantly disturbs the learning procedure. To remedy this, EXPECT provides annotated evidence words, which enable training models to automatically assist second-language learners in finding error clues.
2) Error types of the grammatical errors, ranging among the 15 well-defined categories by consulting the pragmatic errors designed by Skehan et al. (1998) and Gui (2004). Language learning consists of both abstract grammar rules and specific language-use examples. A model trained with EXPECT bridges the gap between the two parts: such a model can produce proper error types, automatically facilitating language learners to infer abstract grammar rules from specific errors in an inductive reasoning manner. Further, it allows learners to compare specific errors within the same category and those of different categories, benefiting the learner's inductive and deductive linguistic reasoning abilities.
To establish baseline performances for explainable GEC on EXPECT, we explore generationbased, labeling-based, and interaction-based methods. Note that syntactic knowledge also plays a crucial role in the human correction of grammatical errors. For example, the evidence word of subject-verb agreement errors can be more accurately identified with the help of dependency parsing. Motivated by these observations, we further inject the syntactic knowledge produced by an external dependency parser into the explainable GEC model.
Experiments show that the interaction-based method with prior syntactic knowledge achieves the best performance (F 0.5 =70.77). We conduct detailed analysis to provide insights into developing and evaluating an explainable GEC system. Human evaluations suggest that the explainable GEC systems trained on EXPECT can help second language learners to understand the corrections better. We will release EXPECT (e.g., baseline code, model, and human annotations) on https: //github.com/lorafei/Explainable_GEC.

Related Work
Some work formulates GEC as a sequence-tosequence problem. Among them, transformerbased GEC models (Rothe et al., 2021) have attained state-of-the-art performance on several benchmark datasets (Ng et al., 2014;Bryant et al., 2019) using large PLMs (Raffel et al., 2020) and synthetic data . To avoid the low-efficiency problem of seq2seq decoding, some work (Awasthi et al., 2019;Omelianchuk et al., 2020;Tarnavskyi et al., 2022b) formats GEC as a sequence labeling problem and achieves competitive performance. Both lines of work focus on improving the correction performance and decoding speed but cannot provide users with further suggestions.
Several methods have been proposed to provide explanations for GEC systems. ERRANT (Bryant et al., 2017) designs a rule-based framework as an external function to classify the error type information given a correction. GECToR (Omelianchuk et al., 2020) pre-defines g-transformations tag (e.g., transform singular nouns to plurals) and uses a sequence labeling model to predict the tag as explanations directly. Example-based GEC (Kaneko et al., 2022) adopts the k-Nearest-Neighbor method (Khandelwal et al., 2019) for GEC, which can retrieve examples to improve interpretability. Despite their success, their explanations are restricted by pre-defined grammar rules or unsupervised retrieval. They may not generalize well to real-life scenarios due to the limited coverage of the widely varying errors made by writers. In contrast, our annotated instances are randomly sampled from real-life human-written corpora without restriction, thus providing a much larger coverage. Nagata (2019); Nagata et al. (2020); Hanawa et al. (2021), and Nagata et al. (2021) propose a feedback comment generation task and release two corresponding datasets, which, to our knowledge, are the only two publicly available and large-scale datasets focusing on GEC explanations. The task aims to generate a fluent comment describing the erroneous sentence's grammatical error. While this task integrates GEC and Explainable-GEC into one text generation task, we only focus on Explainable-GEC and formulate it as a labeling task, which is easier and can avoid the high computational cost of seq2seq decoding. Furthermore, the evaluation of feedback comment generation mainly relies on human annotators to check if the error types are correctly identified and if the grammatical error correction is proper in the generated text, which is time-consuming and susceptible to the variations resulting from subjective human judgment. In contrast, our token classification task can be easily and fairly evaluated by automatic metrics (e.g., Fscore), favoring future research in this direction.

Dataset
To facilitate more explainable and instructive grammatical error correction, we propose the EXPECT, an English grammatical error explanation dataset annotated with 15 grammatical error types and corresponding evidence words.

Data Source
We annotate EXPECT based on W&I+LOCNESS (Bryant et al., 2019), which comprises 3,700 essays written by international language learners and native-speaking undergraduates and corrected by English teachers. We first select all sentences with errors from essays. For a sentence with n errors, we repeat the sentence n times and only keep a single unique error in each sentence. Then, we randomly sample and annotate 15,187 instances as our training set. We do the same thing for the entire W&I+LOCNESS dev set, and split it up into test and development sets evenly.
In order to better align with real-world application scenarios, we have additionally annotated 1,001 samples based on the output of the conventional GEC models. We randomly sampled the output of T5-large (Rothe et al., 2021) and GECToR-Roberta (Omelianchuk et al., 2020) on the W&I+LOCNESS test set. We also report whether the corrections of the GEC model were right.

Error Type Definition
Following the cognitive model of second language acquisition (Skehan et al., 1998;Gui, 2004), we design error types among three cognitive levels as follows: Single-word level error is in the first and lowest cognitive level. These mistakes usually include misuse of spelling, contraction, and orthography, which are often due to misremembering. Since there is no clear evidence for those errors, we classify them into type others.
Inter-word level error is in the second cognitive level, which usually stems from a wrong understanding of the target language. Most error types with clear evidence lie at this level because it represents the interaction between words. This level can be further split into two linguistic categories, syntax class and morphology class: (1) In the view of syntax, we have seven error types, including infinitives, gerund, participles, subject-verb agreement, auxiliary verb, pronoun and noun possessive.
(2) In the view of morphology, we have five error types, including collocation, preposition, word class confusion, numbers, and transitive verbs.
Discourse level error is at the highest cognitive level, which needs a full understanding of the context. These errors include punctuation, determiner, verb tense, word order and sentence structure. Since punctuation, word order, and sentence structure errors have no clear evidence words, we also classify them into type others.
The complete list of error types and corresponding evidence words are listed in Figure 2. The definition of each category is shown in Appendix A.1.

Annotation Procedure
Our annotators are L2-speakers who hold degrees in English and linguistics, demonstrating their proficiency and expertise in English. The data are grouped into batches of 100 samples, each containing an erroneous sentence and its correction. The annotators are first trained on labeled batches until their F 1 scores are comparable to those of the main author. After that, annotators are asked to classify the type of the correction and highlight evidence words that support this correction on the unlabeled batches. The evidence should be informative enough to support the underlying grammar of the correction meanwhile complete enough to include all possible evidence words. For each complete batch, we have an experienced inspector to re-check 10% of the batch to ensure the annotation quality. According to inspector results, if F 1 scores for the annotation are lower than 90%, the batch is rejected and assigned to another annotator.

Data Statistics
The detailed statistics of EXPECT have listed in

Evaluation Metrics
We consider our task as a token classification task. Thus we employ token-level (precision, recall, F 1 , and F 0.5 ) and sentence-level (exact match, label accuracy) evaluation metrics. Specifically, the exact match requires identical error types and evidence words between label and prediction, and the label accuracy measures the classification performance of error types. To explore which automatic metric is more in line with human evaluation, we compute Pearson correlation (Freedman et al., 2007) between automatic metrics and human judgment. As shown in Table 2, F 0.5 achieves the highest score in correlation. And precision is more correlated with human judgment than recall. The reason may be that finding the precise evidence words is more instructive than extracting all evidence words for explainable GEC.

Methods
In this section, we define the task of explainable-GEC in Section 4.1 and then introduce the labeling-   Figure 3: An illustration of labeling-based methods with syntax for solving explainable GEC. On the right is the dependency parsing tree of the corrected sentence, where the correction word are is marked in red, and 1st and 2nd-order nodes are marked with red circles. based baseline method in Section 4.2, and the interaction method in Section 4.3.

Task Formulation
The task input is a pair of sentences, including an erroneous sentence X = {x 1 , x 2 , ..., x n } and the corresponding corrected sentence Y = {y 1 , y 2 , ..., y m }. The two sentences usually share a large ratio of overlap. The difference between the two sentences is defined as a span edit {(s x , s y )}. The task of explainable GEC is to find the grammar evidence span E x within X and predict corresponding error type classes c. Take Figure 3 as an example, s x = "are" and s y = "is", the evidence span E x ="Evidence words".

Labeling-based Method
We adopt the labeling-based method for explainable GEC.
Input. We concatenate the erroneous sentence X and the corresponding error-free sentence Y , formed as Correction Embedding. To enhance the positional information of the correction, we adopt a correction embedding e c to encode the position of the correction words in the sentence X and Y . We further add e c to embeddings in BERT-based structure as follow: where e t is the token embeddings, and e p is the position embeddings.
Syntactic Embedding. There is a strong relation between evidence words and syntax as shown in Section 5.3. Hence we inject prior syntactic information into the model. Firstly, given the corrected sentence Y and its span edit (s x , s y ), we parse sentence Y with an off-the-shell dependency parser from the AllenNLP library (Gardner et al., 2018). For each word in s y , we extract its first-order dependent and second-order dependent words in the dependency parse tree. For example, as shown in Figure 3, the correction word s y = "are", the first-order dependent word is "important", and the second-order dependent words are "words", and "for", and they are marked separately. By combining all correction edits' first-order words and second-order words, we construct the syntactic vector d Y ∈ R m for sentence Y . Dependency parsing is originally designed for grammatical sentences.
To acquire the syntax vector of the erroneous sentence X, we use the word alignment to map the syntax-order information from the corrected sentence to the erroneous sentence, yielding d X ∈ R n . We then convert [d X , d Y ] to syntactic embedding e s , and add to the original word embedding: e = e t + e p + e c + e s Encoder. We adopt a pre-trained language model (e.g. BERT) as an encoder to encode the input e, yielding a sequence of hidden representation H.
Label Classifier. The hidden representation H is fed into a classifier to predict the label of each word. The classifier is composed of a linear classification layer with a softmax activation function.
wherel i is the predicted label for i-th word, W and b are the parameters for the softmax layer.
Training. The model is optimized by the loglikelihood loss. For each sentence, the training object is to minimize the cross-entropy between l i andl i for a labeled gold-standard sentence.

Interaction-based Method
Although labeling-based methods model the paired sentences in a joint encoder, it still predicts two separate outputs independently. The dependencies between the erroneous sentence and the corrected sentence are not explicitly modeled. Intuitively, the alignment between the erroneous sentence and the corrected sentence can be highly informative. We propose an interactive matrix to jointly model the alignment and the evidence span. In particular, we adopt a bi-affine classifier to model the multiplicative interactions between the erroneous sentence and the corrected sentence. Assume that the hidden representation of the erroneous sentence and the corrected sentence are H e and H c , respectively. We first use two separate feed-forward networks to map the hidden representation into an erroneous query representation and a corrected key representation: Then a bi-affine attention (Dozat and Manning, 2016) is adopted to model the interaction between H q and H k : where U ∈ R |H|×|H|×|L| , |H| and |L| indicates the hidden size and the size of the label set.
Training. Similar to the labeling-based method, the training objective is to minimize the crossentropy between M andM given a labeled goldstandard sentence: Syntactic Interactive Matrix. Similar to Syntactic Embedding, we use a syntactic interactive matrix to better merge the syntactic knowledge into the model. We construct the syntactic interactive matrix D syn in the same way as the syntactic embedding above, except for using a interactive matrix rather than a flat embedding. Figure 4 shows an example of a syntactic matrix, where the row of the correction index in the erroneous sentence is placed with a syntactic vector of the corrected sentence, whereas the column of the correction index in a corrected sentence is placed with erroneous sentence's syntactic vector. Then a two-layer MLP is used to map D syn to H syn : H syn is then used as an auxiliary term to calculate the interaction matrix M. Eq 6 is reformulated as:

Baseline Methods
Human performance is reported. We employ three NLP researchers to label the test set and report the average score as human performance. Generation-based method frames the task as a text generation format. It utilizes a pre-trained generation model to predict the type of error and generate a corrected sentence with highlighted evidence words marked by special tokens.
Labeling-based (error only) method uses only erroneous sentences as input and predicted explanation directly.
Labeling-based (correction only) method uses only corrected sentences as input and predicted explanation directly.
Labeling-based (with appendix) method uses only erroneous sentences or corrected sentences and appends correction words at the end of the sentence.
Labeling-based (error and correction) method concatenate erroneous and corrected sentences as described in Section 4.2.

Main Results
The model performance under different settings are shown in Table 3.
We evaluate the model performance across a variety of settings, including generation-based, labeling-based, and interaction-based, as well as syntactic-based and non-syntactic-based. First, we find that generation-based methods do not outperform labeling-based methods and suffer from poor inference efficiency due to auto-regressive decoding. In addition, interaction-based methods exhibit higher precision but lower recall compared to labeling-based methods. This is likely due to the interaction between two sentences helping the model identify more evidence words. Based on labeling-based methods, adding syntactic information has a marginal 0.28 F 0.5 point increase, while for interaction-based methods, the performance increases by 1.94 F 0.5 point. This suggests that syntactic information can generally provide an indication for identifying evidence words. And the interaction matrix better incorporates syntactic information into the model. Particularly, we found correction embeddings are pretty important for this task. With correction embeddings, the performance increases by 2.64 F 0.5 points on Dev set and 1.16 points on Test set. Finally, interaction-based methods with syntactic knowledge achieve the best performance when measured by precision, F 0.5 , exact match, and accuracy.

Impact of Syntactic Knowledge
To further explore the role of syntactic knowledge in boosting the explainable GEC performance, we first analyze the relation between evidence words and correction words' adjacent nodes in the dependency parsing tree. As shown in Table 4   of instances have at least one evidence word within correction words' first-order nodes, and 27.02% of instances' all evidence words stay within secondorder nodes. We can infer that syntactic knowledge can in a way narrow the search space of extracting evidence words.

Model Performance across Syntactic Distance.
We compare F 0.5 scores for instances whose evidence words are in and out of the 1st and 2nd dependent orders in Figure 5. The overall performance decreases when evidence words are outside the 2nd dependent order, indicating that the model has trouble in handling complex syntactic structure. But after injecting the syntactic knowledge, the performance increases in all sections, suggesting the effectiveness of syntactic representation.  Figure 6: F 0.5 score comparison of syntax-related error types between syntactic methods and non-syntactic methods. POS -POS Confusion.  Benefit of Syntactic Representation. We report F 0.5 scores on specific error types before and after injecting syntactic information into the models in Figure 6. Dependency parsing is a common tool to detect SVA (Sun et al., 2007). The performance on SVA indeed increases with the syntax. We also find four other error types which are closely associated with syntactic information, including auxiliary verb, collocation, POS confusion and number, whose performance increases significantly for both the labeling-based method and the interactionbased method.

Impact of Sentence Length
Table 5 illustrates the model performance across different lengths of erroneous sentences. As the sentence length increases, the performance of all methods decreases significantly, which is consistent with human intuition. Longer sentences may contain more complex syntactic and semantic structures, which are challenging for models to capture.

Result on Real-world GEC System
We employ the gold correction as the input during both the training phase and the inference phase. However, in a practical scenario, this input would be replaced with the output of a GEC system. To evaluate the performance of the explainable system equipped with real-world GEC systems, we use interaction-based methods with syntactic knowledge trained on EXPECT, and directly test using samples that are annotated from the outputs of the GEC model on the W&I+LOCNESS test set. The F 0.5 scores obtained are 57.43 for T5-large outputs and 60.10 for GECToR-Roberta outputs, which significantly underperforms 68.39. This may be caused by the training-inference gap as mentioned and the error propagation of the GEC system.

Human Evaluation
To assess the effectiveness of the explainable GEC for helping second-language learners understand corrections, we randomly sample 500 instances with gold GEC correction and 501 outputs decoded by an off-the-shelf GEC system GECTOR (Omelianchuk et al., 2020), and predict their evidence words and error types using the interactionbased model with syntactic knowledge. We recruit 5 second-language learners as annotators to evaluate whether the predicted explanation is helpful in understanding the GEC corrections. The results show that 84.0 and 82.4 percent of the model prediction for gold GEC correction and GECTOR has explanations, and 87.9 and 84.5 percent of the explanations of EXPECT and gold GEC correction, respectively, are helpful for a language learner to understand the correction and correct the sentence. This show that the explainable GEC system trained on EXPECT can be used as a post-processing module for the current GEC system.

Case Study
We identify two phenomena from our syntactic and non-syntactic models based on labeling models: Distant Words Identification. The nonsyntactic model makes errors because it does not incorporate explicit syntactic modeling, particularly in long and complex sentences where it is difficult to identify distant evidence words. As shown in the first case of Figure 7, the nonsyntactic model fails to consider evidence words, such as "apply", that is located far away from the correction. However, the syntactic-based model is able to identify the evidence word "apply".
Dependency Parsing Errors. Some evidence word identification errors are from the misleading parsing results in the long sentence (Ma et al., 2018). As shown in the second case of Figure 7, the model with syntactic knowledge is actually using an inaccurate parse tree in the green box from the off-the-shelf parser, which results in identifying redundant word "off ".
Undertaking a scholarship and admission to one of the universities I have selected above will provide me with the opportunity to apply the knowledge gained at high school [into->in] a business setting. On the other hand, many teens who take a year off end up [to spend->spending] it in the wrong way . Figure 7: Case study. The first case shows the identification problem for distant evidence words. The second case shows the error caused by wrong dependency parsing results.

Conclusion
We introduce EXPECT, an explainable dataset for grammatical error correction, which contains 21,017 instances with evidence words and error categorization annotation. We implement several models and perform a detailed analysis to understand the dataset better. Experiments show that injecting syntactic knowledge can help models to boost their performance. Human evaluation verifies the explanations provided by the proposed explainable GEC systems are effective in helping second language learners understand the corrections. We hope that EXPECT facilitates future research on building explainable GEC systems.

Limitations
The limitations of our work can be viewed from two perspectives. Firstly, we have not thoroughly investigated seq2seq architectures for explainable GEC. Secondly, the current input of the explainable system is the gold correction during training, whereas, in practical applications, the input would be the output of a GEC system. We have not yet explored methods to bridge this gap.

Ethics Consideration
We annotate the proposed dataset based on W&I+LOCNESS, without copyright constraints for academic use. For human annotation (Section 3.3 and Section 5.6), we recruit our annotators from the linguistics departments of local universities through public advertisement with a specified pay rate. All of our annotators are senior undergraduate students or graduate students in linguistic majors who took this annotation as a part-time job.
We pay them 60 CNY an hour. The local minimum salary in 2022 is 25.3 CNY per hour for part-time jobs. The annotation does not involve any personally sensitive information. The annotated is required to label factual information (i.e., evidence words inside the sentence.).