Multi-Class Grammatical Error Detection for Correction: A Tale of Two Systems

In this paper, we show how a multi-class grammatical error detection (GED) system can be used to improve grammatical error correction (GEC) for English. Specifically, we first develop a new state-of-the-art binary detection system based on pre-trained ELECTRA, and then extend it to multi-class detection using different error type tagsets derived from the ERRANT framework. Output from this detection system is used as auxiliary input to fine-tune a novel encoder-decoder GEC model, and we subsequently re-rank the N-best GEC output to find the hypothesis that most agrees with the GED output. Results show that fine-tuning the GEC system using 4-class GED produces the best model, but re-ranking using 55-class GED leads to the best performance overall. This suggests that different multi-class GED systems benefit GEC in different ways. Ultimately, our system outperforms all other previous work that combines GED and GEC, and achieves a new single-model NMT-based state of the art on the BEA-test benchmark.


Introduction
Grammatical error detection (GED) is the task of automatically detecting grammatical errors in text, while grammatical error correction (GEC) is the task of also correcting these errors. Both tasks have obvious pedagogical applications that can benefit both teachers and students in online language learning. GED is typically cast as a binary sequence labelling task, where each token is classified as either correct or incorrect (Rei and Yannakoudakis, 2016;Bell et al., 2019), while GEC is often considered a sequence-to-sequence translation task, where systems learn to "translate" an ungrammatical input sentence to a grammatical output sentence (Yuan and Briscoe, 2016;Junczys-Dowmunt et al., 2018;Kiyono et al., 2019;Lichtarge et al., 2020). Recent work has also begun to treat GEC as a sequence labelling task, where tokens are classified in terms of edit operations (Awasthi et al., 2019;Omelianchuk et al., 2020). We similarly treat GED as a sequence labelling task and GEC as a sequence-to-sequence task, but additionally investigate different ways to combine and extend both approaches.
In particular, we first experiment with pretrained language models, and show that simply finetuning ELECTRA (Clark et al., 2020) leads to significant improvements in binary GED and achieves a new state of the art. Given that binary detection is limited in terms of the specific error type information it can provide to downstream tasks however, we also extend our GED system to 4-class, 25-class and 55-class error detection using different error type tagsets derived from the ERRANT framework (Bryant et al., 2017).
To illustrate the value of multi-class GED for downstream GEC, we extend the Transformer encoder-decoder model (Vaswani et al., 2017) and employ a multi-encoder GEC model (Yuan and Bryant, 2021). We also introduce a two-step training strategy which only requires additional input information from a small dataset for fine-tuning. Specifically, we experiment with two methods that use GED to inform GEC: i) we use GED predictions as auxiliary input to fine-tune a model; and ii) we use GED predictions as a means to re-rank system output during post-processing.
To summarise, we present the first study using multi-class GED to improve GEC for English. Our main contributions are: • We obtain a new state of the art in binary GED on three benchmark datasets.
• We empirically show that current Transformerbased language models are capable of much more fine-grained error detection, with minimal impact to overall binary F 0.5 .
• We propose a novel multi-encoder GEC model and two-step training strategy, which has proven to be effective at incorporating an additional GED signal.
• We demonstrate how multi-class GED can improve a GEC model by i) using GED predictions during fine-tuning; and ii) using GED predictions as a basis for re-ranking.
• We report competitive performance with the state of the art using a single model without task-specific adaptation.

Previous work
Early approaches to GED focused on specific error types, and in particular article and preposition errors, which are among the most frequent in nonnative English learner writing (Han et al., 2004;Tetreault and Chodorow, 2008). More general open-class GED systems were later developed using parse and text-based features (Gamon, 2011). Rei and Yannakoudakis (2016) presented the first work using a neural approach and framed GED as a binary sequence labelling problem, classifying each token in a sentence as either correct or incorrect. Subsequent improvements were obtained through auxiliary objectives Rei, 2017), and incorporating artificial training data Kasewa et al., 2018). Recent work takes advantage of large scale pre-trained language models: Bell et al. Beyond the development of machine learning classifiers for specific error types (De Felice and Pulman, 2008;Rozovskaya and Roth, 2011), GEC has been formulated as a monolingual machine translation task that corrects all error types simultaneously. Both statistical machine translation (SMT) and neural machine translation (NMT) have been successfully applied to GEC with various task-specific adaptations (Felice et al., 2014;Junczys-Dowmunt and Grundkiewicz, 2016;Yuan and Briscoe, 2016;Junczys-Dowmunt et al., 2018;Yuan et al., 2019). With recent advances in sequence-to-sequence modelling and the introduction of the Transformer architecture, state-of-theart results have been reported (Kaneko et al., 2020;Lichtarge et al., 2020). Meanwhile, sequencetagging approaches have been proposed for GEC, where systems learn to predict a sequence of edit operations (Awasthi et al., 2019;Omelianchuk et al., 2020). In fact Omelianchuk et al. (2020) achieved the current state of the art by incorporating pre-trained Transformer encoders and employing iterative tagging.
Previous work has attempted to combine these two similar tasks and explore different ways of using GED in GEC. Zhao et al. (2019) and Yuan et al. (2019) employed multi-task learning and introduced token-level and sentence-level GED as auxiliary tasks when training for GEC. Kaneko et al. (2020) fine-tuned BERT for binary GED and then incorporated their model into an encoder-decoder GEC framework. Similarly, Chen et al. (2020) finetuned RoBERTa  for GED and reformatted the input to include error span information, which was used by their encoder-decoder model. Our approach of using additional GED input during GEC training differs from theirs in that we only use a small set of training examples with GED information for fine-tuning. The GED training and GEC training are not coupled together, so GED predictions from any system can be used.
Binary GED has also been used in postprocessing to re-rank GEC system output Yuan et al., 2019; or filter out unnecessary corrections (Kiyono et al., 2019). Similarly, Chollampatt and Ng (2018) and Liu et al. (2021) applied quality estimation approaches to re-rank GEC output. Additional GED and/or GEC features are often used in these systems however, which also require extra tuning. In contrast, our simple re-ranking approach only uses GED predictions and does not require tuning. Our work differs most centrally in that we treat GED as a multi-class problem and investigate ways to use multi-class GED predictions to inform GEC.
The edit annotations in all these corpora were pre-processed and standardised using the ERRANT annotation framework. One advantage of this framework is that error types are modular, and consist of "operation" + "main" type tags; e.g. R:NOUN for replacement noun. We hence use this modularity in our multi-class GED experiments such that 4-classes consist of operation type only (i.e. missing, replacement, unnecessary and correct), 25-classes consist of main type only (e.g. noun, noun number, verb tense, etc.) and 55-classes consist of the full tags combined. 2 An example is shown in Table 1.

Grammatical error detection
Following Rei and Yannakoudakis (2016), we treat GED as a sequence labelling (or token classification) task and assign a label to each token in the input sentence, indicating whether it is correct or incorrect (i.e. binary classification) or which error type it belongs to at different levels of granularity (i.e. multi-class classification). We first perform binary classification to compare our models with current state-of-the-art systems. We further extend them with multi-class error classification with the aim of improving downstream GEC.

Approach
We employ three state-of-the-art pre-trained language representation models: BERT (Devlin et al., 2019), XLNet (Yang et al., 2019) and ELEC-TRA (Clark et al., 2020). Although different in their pre-training architectures, they are all Transformer-based models on top of which we add a linear classification layer. We fine-tune these 1 Detailed corpus statistics are given in Appendix A. Public data is available at: https://www.cl.cam.ac.uk/ research/nl/bea2019st#data 2 See Appendix A in Bryant et al. (2017) for all combinations. models on annotated GED data for a small number of epochs. BERT is one of the most popular and pioneering transfer learning methods involving self-attention layers of encoders and decoders for which the pre-trained model weights are available. XLNet aims to overcome some shortcomings of BERT by using an auto-regressive architecture which relies on permutation rather than masking during pre-training. ELECTRA is an extension of BERT with a different pre-training task which is a discriminator (rather than a generator) and aims to detect replaced tokens. Intuitively, its objective to discriminate between plausible and non-plausible word tokens makes it more closely-related to GED.

Error detection experiments
All datasets contain manually annotated spans of various types of errors, together with their suggested corrections. We convert these spans into token-based labels, assigning missing word labels to the token on the right of the span. This is consistent with previous work and necessary because missing words fall between tokens and would otherwise not be represented. For binary detection, each token is labelled as either correct 'C' or incorrect 'I'. For multi-class detection, we use ERRANT error types at different levels of granularity (4-class, 25-class and 55-class) as described in Section 3 and exemplified in Table 1. We perform fine-tuning using the Adam optimiser with a learning rate of 3 × 10 −5 for all three models. 3 For binary GED, we follow Rei and Yannakoudakis (2016) and report token-level precision, recall and F 0.5 for detecting incorrect labels. For multi-class GED, we report 1) binarised F 0.5 which is the score for detecting any non-C labels regardless of class, and 2) macro-averaged F 0.5 which is the average F 0.5 across all classes.   , 2014). Results are presented in Table 2. As can be seen, all our Transformer-based GED models outperform the current state-of-the-art systems on all test sets by large margins, with fine-tuned ELEC-TRA performing the best overall. We believe that this is due to its intuitively more GED-relevant discriminative pre-training objective.
Multi-class GED Given that ELECTRA performed the best at binary GED, we use it in our multi-class GED experiments. Table 3 shows the binarised and macro-averaged F 0.5 scores for different binary and multi-class GED systems. As expected, we see lower macro-averaged scores for multi-class classification when there are a higher number of classes. This is due to the sparsity of the labels when we add more error types. It is interesting to note, however, that adding more error types does not significantly affect the performance of the models in terms of binarised detection. For example, the binarised performance of the 4-class and 55-class models is slightly higher than the binary model on BEA-dev (66.10 and 65.81 vs. 65.54). This may suggest that all systems are capable of detecting roughly the same number of errors despite the number of classes and generally struggle only with the specific class labels themselves.

Using GED to improve GEC
In this section, we investigate different ways of using GED to inform GEC. We use the Transformer sequence-to-sequence model as our baseline and employ a multi-encoder GEC system that takes additional GED predictions as input. We then experiment with two methods of using GED information: i) as auxiliary input, and ii) for re-ranking.

Baseline GEC system
The Transformer follows an encoder-decoder architecture. Each layer of the encoder contains two sub-layers: a multi-head self-attention mechanism and a feed-forward network. The decoder inserts an additional third sub-layer, which performs multihead attention over the output of the encoder stack. See Vaswani et al. (2017) for more details.

Multi-encoder GEC system
In order to incorporate GED into GEC, we propose a new extension to the standard Transformer encoder-decoder model, which employs a second encoder to take additional GED input (Figure 1).
Encoder The original Transformer encoder reads the source sentence S src and learns a vector representation c src as before. An additional encoder is introduced to process any auxiliary GED input S ged and compute a context representation c ged .
Decoder Similar to the original Transformer decoder, each layer of the decoder in our model is composed of three sub-layers. The first sub-layer performs masked multi-head self-attention on the known outputs at positions less than i. The second sub-layer now contains two multi-head attention components: a source multi-head attention (MH src ), which performs multi-head attention over the output of the encoder stack for the source sentence c src , and a new GED multi-head attention (MH ged ), that attends directly to the GED encoder representation c ged . Afterwards, a linear interpolation with gating is applied: The gating activation λ is given by: where σ is the logistic sigmoid function, and W and b are learnable parameters. The resulting Gating(MH) is used as an input to the third sublayer, which is a position-wise fully connected feedforward network.
Two-step training Inspired by the idea of freezing some parameters while fine-tuning the remaining part of the model (Zhang et al., 2018;Yuan and Bryant, 2021), we apply a two-step training strategy to train our proposed model. We divide model parameters into two subsets: where θ src is a set of original Transformer model parameters, and θ ged is a set of newly-introduced GED component parameters (i.e. the GED encoder, the GED multi-head attention and the gating -highlighted in blue in Figure 1). In the first step, we train a standard encoderdecoder GEC model using standard source-target parallel data: where N is the total number of training examples and (S n , T n ) is the nth source-target sentence pair.
In the second step, we construct a new dataset by adding GED information (S ged ) for each sourcetarget pair and estimate GED parameters θ ged : logP (T n |S n , S n ged ,θ src , θ ged ) (5) where M is the total number of training examples in the new fine-tuning dataset (M < N ) and (S n , S n ged , T n ) is the nth triplet. Our training strategy is different from most finetuning approaches in GEC (e.g. Kiyono et al., 2019;Lichtarge et al., 2020) as we only update θ ged and keep θ src fixed when adding GED auxiliary input. This is to prevent overfitting, as the dataset used in the second step is much smaller than the one used in the first step.

Using GED as auxiliary input
Experimental setup Following previous work that advocates fine-tuning GEC models on highquality, in-domain data (Kiyono et al., 2019;Lichtarge et al., 2020;Yuan and Bryant, 2021), we pre-train two GEC systems on public Lang-8 data (constrained) and private CLC data (unconstrained) respectively, and fine-tune both on W&I + FCE + NUCLE. The constrained system thus only uses public data released in the BEA-2019 shared task. Two GED systems are similarly fine-tuned using the Lang-8 and CLC pre-training data, 4 and are used to make predictions on the W&I + FCE + NUCLE fine-tuning data that is auxiliary input to the GEC system. 5 We use byte pair encoding (BPE) (Sennrich et al., 2016) with 30k merge operations. For both the Transformer baseline and our multi-encoder GEC models, the hidden size is set to 512 and the filter size is set to 2,048. In training, we use the Adam optimizer (Kingma and Ba, 2015) with the default parameters. Batch size is 3k tokens. All other settings follow the 6-layer 'Transformer (base)' model in Vaswani et al. (2017). We use four Tesla P40 GPUs for training and one for decoding. Experiments are carried out with Fairseq .
Results GEC performance on BEA-dev (evaluated by ERRANT) in the unconstrained setting is reported in Table 4. For the baseline model, we use the same pre-training and fine-tuning data splits, but with no additional GED input for fine-tuning, which follows the standard encoder-decoder GEC training procedure. For our new multi-encoder GEC model, we experiment with predictions from both the binary and multi-class GED models introduced in Section 4. The results demonstrate the efficacy of the multi-encoder GEC model: adding GED predictions as auxiliary input yields a consistent statistically significant improvement in performance over the baseline. 6 Our best system uses the 4-class GED predictions, achieving 52.84 F 0.5 , followed by binary (52.20), 25-class (51.78) and 55-class (51.52). We suspect the 4-class system works best here because it represents the best compromise between label informativeness and model reliability. In contrast, binary predictions tend to be less informative but more reliable because there are only two classes, while 25-class and 55-class predictions tend to be more informative but less reliable because of the increased difficulty in predicting sparser classes. These results nevertheless show that multi-class GED provides GEC with new information, and we expect further performance gains with better multi-class GED systems. We also notice that almost all the improvements come from recall, with only a negligible influence on precision.
Finally, we also analysed the performance of each operation error type in the 4-class GEC system and found that most gains come from missing word errors (+6.66 F 0.5 ), which happen to be the worst performing type in the baseline system. 7 Oracle To better understand our model's capacity to use GED information, we estimate an upper bound for the GEC system by using gold multiclass detection labels as auxiliary input. This provides us with the maximum performance our multiencoder GEC system can obtain given a perfect GED system. In Table 4, we see that our system benefits the most when the finest and most granular level of error type information is provided. Specifically, results show that the maximum attainable score is achieved by the 55-class oracle GED system (70.24 F 0.5 ), followed by the 25-class (68.36 F 0.5 ), 4-class (67.86 F 0.5 ) and binary systems (63.68 F 0.5 ). This further supports the idea that the main bottleneck in a practical system is 6 We perform two-tailed paired T-tests, where p < 0.001. 7 Error analysis results are included in Appendix C.  Table 4: ERRANT span-level correction results on BEA-dev in the unconstrained setting of our multiencoder GEC system using different GED models and oracle detection. The highest GED and oracle results are in bold. the reliability of the GED predictions rather than the informativeness of the labels. We finally observe only a small difference between the 4-class and 25-class oracle GED labels, suggesting that the 4-class operation labels (i.e. missing, replacement, unnecessary and correct) are about as informative as the 25-class POS-based error types for our multiencoder GEC system.

Using GED for re-ranking
The GEC decoder generates different hypothesis sentences from which the sentence with the highest confidence score is predicted as the correction. Inspired by  and Yuan et al. (2019), we take advantage of these hypotheses and employ a re-ranking approach using GED outputs to further improve our GEC results. Specifically, we i) generate a 10-best list of candidate hypotheses for each sentence, ii) align each hypothesis with the source sentence using ERRANT to extract the edits, and iii) convert the edit spans to token-based detection labels as described in Section 4.2. This produces a list of hypotheses, where each hypothesis H i∈{1:10} = (h i,1 , h i,2 , ..., h i,l ) consists of the error-type labels for each token in a source sentence of length l as predicted by the GEC system. We then use a GED system to predict a corresponding set of labels D = (d 1 , d 2 , ..., d l ) for each source token, and re-rank the hypotheses based on the minimum Hamming distance between H i and D. This ensures the maximal overlap between the GEC hypothesis and the (multi-class) GED predictions and hence provides greater confidence that a hypothesis is correct when more errortype labels from both systems agree. It is finally worth mentioning that we do not use any other features in our approach -not even the original scores output by the GEC system -and so our simple re-ranking method can also be applied to any number of hypotheses from multiple systems.

Results
We perform re-ranking using both GED predictions 8 and oracle labels as before. We report the results on BEA-dev using the binary, 4-class, 25-class and 55-class GED labels. As can be seen in Figure 2, using GED for re-ranking GEC output improves the results consistently and significantly. 9 Interestingly, there is a gradual increase in performance as the granularity of the GED labels also increases; the best re-ranking model uses 55-class GED (54.35 F 0.5 ). We suspect this is because the 55-class labels are the most discriminative compared to the other tagsets, even if they are potentially the noisiest. The performance boost using GED predictions is still far from what the model can achieve using oracle however. This suggests that having more accurate GED models is essential and that more fine-grained error types can be effectively incorporated into GEC.
Looking deeper into the performance of our best re-ranking GEC system which uses 55-class GED, we see that the model's F 0.5 score increases by +9.97 for missing and +10.98 for unnecessary word errors. The improvement is much smaller for replacement errors however (+1.13). This is not the case for oracle where the improvement for replacement errors is also significant (+17.97).

Final GEC results
In our final experiment, we combine both methods and apply the 55-class GED re-ranking strategy with our best multi-encoder GEC model which uses 4-class GED in both the constrained and unconstrained settings. We also evaluate our models on two other GEC benchmarks: BEA-test and CoNLL-2014. Results are presented in Table 5, where further performance gains are observed, suggesting that these two methods are complementary.
Comparison with other systems In terms of the constrained setting, which only uses public data released in the BEA-2019 shared task, our system outperforms the only other comparable system by a large margin. Specifically, we outperform Raheja and Alikaniotis (2020) by +9.4 F 0.5 on BEA-test and +7.3 F 0.5 on CoNLL-2014, with the largest gains coming from re-ranking.
In terms of the unconstrained setting, which includes systems trained on additional private and/or artificial data, our system outperforms all other previous work that combines GED and GEC, and furthermore achieves a new single-model NMTbased state of the art on BEA-test. Our closest NMT-based competitor meanwhile is Stahlberg and Kumar (2021), who holds the current record on CoNLL-2014 (66.6 F 0.5 ). Although Omelianchuk et al. (2020) score higher than our approach on both test sets, we note that their sequence-tagging approach additionally relies on a carefully curated set of 5000 language-specific edit tags. Ultimately, we believe we have demonstrated the value of incorporating multi-class GED into GEC and also the effectiveness of our proposed approaches.
Error type performance We also perform a detailed error analysis to better understand the performance of our final GEC system. 11 The largest improvement over the baseline is observed in U:CONJ (+64.

Conclusion
We have shown that multi-class GED can be used to significantly improve GEC. First, we showed that fine-tuning a pre-trained Transformer-based language model can lead to significant improvements in binary GED. Specifically, we found that fine-tuning ELECTRA, which has a discriminative pre-training objective that is conceptually similar to GED, produces a new state of the art on three benchmark datasets. We furthermore showed that our models are capable of multi-class detection, and obtain similar F 0.5 performance to binary GED. Next, we employ a multi-encoder GEC model and presented two methods of integrating GED predictions into GEC systems: firstly during GEC fine-tuning and secondly as a post-processing reranking step. Results show that both methods, when applied independently, significantly improve over a strong NMT-based GEC baseline. When applied together, we find the methods complement each other, yielding further performance gains. Our best single-model GEC system outperforms all previous systems that combine GED and GEC on both test sets, and all other single-model NMT-based systems on BEA-test.
Our results ultimately demonstrate the advantages of integrating multi-class detection into correction. In particular, different multi-class GED systems benefit GEC in different ways, and we find that the 4-class GED model leads to the best performance in fine-tuning the GEC system, but re-ranking using the 55-class GED model produces the best GEC performance overall. Finally, oracle experiments reveal that our proposed GEC systems are very effective at incorporating new GED information, but that there are still significant gains to be made in terms of more accurate GED systems.     Table 9: Precision, recall and F 0.5 for missing, replacement and unnecessary errors for baseline and our re-ranking GEC systems using 55-class GED predictions and oracle on BEA-dev.

Type
Baseline 55 Table 11: Error type-specific performance of the baseline and our final GEC system on BEA-dev. We show results for a subset of error types that are mostly positively (top part) and negatively (bottom part) affected.