LET: Leveraging Error Type Information for Grammatical Error Correction

,


Introduction
The grammatical error correction (GEC) task aims to correct grammatical errors in natural language texts, including spelling, punctuation, grammar, word selection, and more.As shown in Figure 1, a GEC model receives text containing errors and produces its corrected version.
Current GEC algorithms are mainly divided into two categories: detection-based models and end-toend generative models.
Detection-based models treat GEC as a token classification problem (Omelianchuk et al., 2020).By classifying each token in the sentence, we can make a detailed transformation according to the classification result to obtain the modified sentence.This method has strong interpretability and a transparent error correction process.However, to achieve precise error correction, it is necessary first to identify and classify all possible grammatical errors.The training data is then manually annotated based on the error categories, which is laborintensive.
To avoid manually designing wrong categories and labeling data, many works (Yuan and Felice, 2013a;Yuan and Briscoe, 2016) have built end-toend generative GEC systems from the perspective of Machine Translation (MT), which is also the current mainstream method.In this approach, erroneous sentences correspond to the source language, and error-free sentences correspond to the target language.
Most recent generative models (Raheja and Alikaniotis, 2020;Yuan et al., 2021) are based on the Transformer encoder-decoder architecture (Vaswani et al., 2017).They also achieve competitive results compared to detection-based models.The most significant advantage of the end-toend generative model is that we do not need to design complex error categories manually or perform labor-intensive labeling work on the data.We can only use parallel corpora to train the model.
Recent works (Wang et al., 2020;Chen et al., 2020) have shown that if the error type results obtained in the grammatical error detect (GED) task are introduced into the generative model in some form, the error correction ability of the model will be further improved.This is because the entire training and inference process can be viewed as a black-box operation in an end-to-end generative model.Furthermore, the model can generate more accurate results if additional information guides this process (e.g., The classification result of some location is "delete").Yuan et al. (2021) extends the Transformer encoder-decoder model based on introducing error type information.They classify input tokens into different error types, transform them into representations, and feed them into the decoder's crossattention module.However, this method suffers from two fundamental limitations: 1) Error propagation.Each token is mapped into a one-hot classification vector in the first process.If there is a misclassification in the results, it will be passed on and negatively influence the following parts.
2) Mismatched cross attention.In the original transformer decoder block, the input Q and K of the cross-attention module are from the semantic space of tokens.However, these inputs are from the semantic space of the error type information and the original tokens, respectively.This mismatch can lead to a reduction in the representation of the model.
Therefore, to solve the above problems, we propose a simple yet novel generative model to improve the performance of GEC, termed LET (Leveraging Error Type information).
First, we utilize the intermediate representation of the error type classification module as the error type vector.It would not discard the probabilities of other classes, even if their values are small.This operation ensures more convincing guidance of the type vectors to the generated modules.
Second, to discard the mismatch in the crossattention module, we transfer the input from the previous sub-layer in the decoder to the classification vector.Thus, both parts of the input are in the same semantic space.Therefore, the crossattention for them is more reasonable.
In summary, our contributions can be summarized in the following points: 1) We propose a novel sequence-to-sequence model which realizes the alignment of error type for GEC.This model improves the effect of this task with much more fine-grained error detection.
2) We demonstrate how GED benefits the correction task by introducing the error type infor-mation into the input module and the crossattention module of the decoder in two ways.
3) Experimental results on multiple datasets show that our proposed method achieves stateof-the-art results.

Related Work
Much progress in the GEC task can be attributed to transforming the problem into a machine translation task (Brockett et al., 2006) from an ungrammatical source sentence to a grammatical target sentence.Early GEC-MT methods leveraged phrase-based statistical machine translation (PB-SMT) (Yuan and Felice, 2013b).With the rapid development of related work on machine translation, statistical machine translation (SMT) and neural machine translation (NMT) have been successfully applied to various task-specific adaptations of GEC (Felice et al., 2014;Yuan and Briscoe, 2016;Junczys-Dowmunt et al., 2018) With the introduction of transformer architectures, this approach rapidly evolved to powerful Transformerbased seq2seq models (Vaswani et al., 2017).
Transformer-based models autoregressively capture the complete dependency among output tokens (Yuan et al., 2019).Grundkiewicz et al. (2019) leveraged a Transformer model pre-trained on synthetic GEC data.Several improvement strategies of BERT were also adopted in the GEC model (Kaneko et al., 2020).With the development of large-scale pre-trained models recently, Rothe et al. (2021) built their system on top of T5 (Xue et al., 2021) and reached new state-of-the-art results.Grammatical Error Detection is usually formulated as a sequence tagging task, where each erroneous token is assigned with an error type, e.g., selection errors and redundant words.Early GED methods mainly used rules to identify specific sentence error types, such as preposition errors (Tetreault and Chodorow, 2008).With the development of neural networks, Rei and Yannakoudakis (2016) presented the first work using a neural approach and framed GED as a binary sequence labeling problem, classifying each token in a sentence as correct or incorrect.Sequence labeling methods are widely used for GED, such as feature-based statistical models (Chang et al., 2012) and neural models (Fu et al., 2018).Due to the effectiveness of BERT (Devlin et al., 2019) in many other NLP applications, recent studies adopt BERT as the basic architecture of GED models (Li and Shi, 2021).
Recent work has attempted to explore a different approach to using GED in GEC, which aims to use the detection results of GED to guide GEC generation.Yuan et al. (2019) introduced token-level and sentence-level GED as auxiliary tasks when training for GEC.Zhao et al. (2019) employed multitask learning to utilize the detection results of GED to guide GEC generation.Similarly, Chen et al. (2020) fine-tuned RoBERTa (Zhuang et al., 2021) for GED and improved the efficiency for GEC by dividing the task into two sub-tasks: Erroneous Span Detection and Erroneous Span Correction.(Yuan et al., 2021) treated GED as a sequence labeling task and GEC as a sequence-to-sequence task and additionally investigated ways to use multiclass GED predictions to inform GEC.

Method
In this section, we first describe the problem definition and the basic model, our baseline.Then we describe the LET (Leveraging Error Type information) model, which explicitly applies the classification information (error types) of tokens to guide the generative model to generate better-corrected sentences.The whole architecture of LET is shown in Figure 2.

Problem Definition
Given a sentence that may contain erroneous tokens U = {u i } N , the target of GEC is to correct the input sentence and output the corrected sentence C = {c i } M .N and M are the input and output sequence length, respectively.

Backbone
We use BART (Lewis et al., 2020) as the backbone model of our end-to-end GEC system.BART is a denoising autoencoder that maps the noisy text to the correct form.It is implemented as a sequenceto-sequence model with a bidirectional encoder over corrupted text and a left-to-right autoregressive decoder (Vaswani et al., 2017).The word embedding layer is represented as Emb.The encoder and decoder are represented as EC and DC, respectively.The process of encoding and decoding can be formulated as: where E U is the output of the encoder EC.

Grammatical Error Detection
We aim to obtain the error type classification of each token in the sentence by the sequence labeling task.In practice, we construct this classifier with three parts.First, a two-layer transformer encoder block EC ′ is designed to encode the input sentence U and obtain the long error type representation R long U , of which the embedding dimension is the same as the word embedding, such as 768 or 512.This procedure can be formulated as: Then, a two-layer fully-connected network F F aims to transform the long error type representation to short error type representation R short U : where the dimension of the short one is the number of error types, such as 4, 25 or 55.
Finally, the error type can be calculated by a Softmax layer SM : where Y = {y i } N is the label sequence of N tokens.

GID: Guided Input of the Decoder
Naturally, after the generation module autoregressively decodes the tokens at some time step, if there is the error type information of the next time step of the original sentence, the generation module may make the correct decision more easily.For example, considering the error type of the next token in the original sentence is "Delete" (This token is redundant and needs to be deleted), the generation module will delete the next token by greater probability after receiving the information indicating "Delete."Metaphorically speaking, we can compare the decoder to a little boy and the decoding process to the boy solving a complex math problem.If a reference material is available to guide the problemsolving process, the little boy will undoubtedly find it easier to arrive at the correct answer.This reference material is what we refer to as "additional guiding information" in this context.
Formally, at time step t, we have obtained the output of the last time step, which is represented as p t−1 .Therefore, we take two elements as the input of this GID module: 1) Emb t−1 : the word embedding of p t−1 ; 2) R long t : the corresponding long error type representation of the token u t .Therefore, we obtain T i , the output of GID and also the input of the decoder DC, by a direct point-wise add operation:

GCA: Guided Cross Attention module
In addition to the above approach, we also want to introduce error type information in the crossattention module.
Cross Attention 1 In the original transformer, the cross attention module in the decoder layer performs attention weighting calculations on the token embedding output by the encoder and the output of the previous self-attention module.The calculation formula is expressed as: where E CA1 is the output of the Cross Attention 1 module.Here, Q represents the representation vector output by the input tokens of the current decoder after passing through the previous selfattention module.K represents the representation vector output by all the input tokens after passing through the stacked encoder.V is a copy of K.
In practice, Q/K/V are firstly mapped to different representation spaces by matrices W q /W k /W v , respectively.In Equation 7, by performing the scaled dotproduct operation on Q and K, the weight parameter for weighted summation of V is obtained.Previous work (Lee et al., 2018;Li et al., 2020) has shown that such an operation is to align the tokens input by the encoder and decoder at the semantic level, so that the decoder is able to generate accurate and reasonable results.

Cross Attention 2
We describe alignment at the semantic level in the last subsection.However, more than this alignment is needed.What about alignment at the error type level?That is, we use the existing detection module to classify the original Q and K to error types and then use the obtained results to replace Q and K in Equation 7, which realizes the alignment at the error type level.
Specifically, as shown in Figure 2, we utilize classification head F F to classify Q and K, and obtain their short error type representation vectors Q ′ and K ′ respectively: where the dimension of Q ′ and K ′ depends on the classification category of the detection task.representation vectors can be viewed as representations of error types.Therefore, applying cross attention to them realizes the alignment of the tokens input by the encoder and the decoder at the error-type level.The modified self-attention equation can be formulated as follows: where d K ′ is the dimension of K ′ , E CA2 is the output of this Cross Attention 2 module.

Combination of CA1 & CA2
Then, we combine the output of two cross-attention modules at pointwise.Nevertheless, before this, we need to define the weight of each one.Therefore, we calculate the dynamic Weighting factor λ: where σ is the logistic Sigmoid function and W and b are learnable parameters.
Then we obtain the combined output E GCA as follows: After this sub-module, E GCA is used as the input to the next sub-layer.Ultimately, the forward computation and back-propagation of the entire model are trained like the regular encoder-decoder model.

Loss function
The total loss contains two parts: 1) L err : The cross-entropy of the predicted error types and the ground truth of token-level labels.
2) L sen : The cross-entropy of the output corrected sentences and corresponding target sentences.

Experiments
To test the performance of the LET system, we conduct evaluation experiments on two mainstream GEC benchmarks: BEA-test (Bryant et al., 2019) and CoNLL-2014(Ng et al., 2014) and compare with previous state-of-the-art approaches.

Datasets
Following previous work, we use five datasets: • Lang-8 Corpus (Mizumoto et al., 2011) • Cambridge Learner Corpus (CLC) (Nicholls, 2003) • First Certificate in English (FCE) corpus (Yannakoudakis et al., 2011) • National University of Singapore Corpus of Learner English (NUCLE) (Dahlmeier et al., 2013) • Cambridge English Write & Improve + LOC-NESS (W&I) corpus (Bryant et al., 2019) Following the training process of previous work (Kiyono et al., 2019;Lichtarge et al., 2020;Yuan et al., 2021), we pre-train two LET systems on public Lang-8 Corpus (under the constrained setting) and the CLC dataset (under the unconstrained setting), then fine-tune them on the same three datasets, including W&I, FCE, and NUCLE.Finally, we train all modules in LET simultaneously, jointly optimizing L err and L sen based on Equation 13.

Error Type Annotations
We obtain error type annotations in these corpora by the ERRANT (Bryant, 2019)

Experiment Setup
The LET model, which is implemented with Transformers * (Wolf et al., 2020), consists of 6 encoder layers, 6 decoder layers, and a shared classification head.
The dimension of embedding is set to 768, and the batch size is set to 32.The maximum sequence length is 1024, and we pad sequences with the longest length in the batch.We train the model with Adam optimizer, and the learning rate is set to 2e-5.The weight factor α in Equation 13 is set to 0.2.The evaluation metric of text generation contains precision, recall, and F 0.5 score.We train the model on 4 Nvidia V100 GPUs.It takes about 4 hours to train the model in one epoch.

Results analysis
We report the experimental results of various methods in Table 1.The experimental results demonstrate the effectiveness of LET.
Overall performance As shown in Table 1, the proposed LET network outperforms most previous state-of-the-art methods on two mainstream GEC benchmarks: BEA-test and CoNLL-2014 under constrained and unconstrained settings.
Constrained setting Compared with the previous SOTA seq2seq model Yuan et al. (2021), LET improve F 0.5 score with 1% and 1.2% on BEA-test and CoNLL-2014, respectively.Compared with other seq2seq models, our work has achieved more obvious improvements based on the same experimental data.
Unconstrained setting On BEA-test, compared with Yuan et al. (2021), LET is at least 1.3% better than the state-of-the-art model on three key metrics.On CoNLL-2014, LET also achieves significant improvements.Notably, compared to precision (+1.3%/+0.4%),our method improves the recall score more (+1.4%/+1.3%).Under the Constrained setting, there is also a similar data distribution.It shows that the model is better at recalling the correct editing operations under the combined effect of our multiple innovations.
Compared with models without a GED module, our LET is less than Stahlberg and Kumar (2021).The possible reason is that they used more data to train the model.
As shown in the last line of Table 1, Omelianchuk et al. ( 2020) outperforms all systems above.Due to more data, fine-grained labels, and multiple ensemble strategies, this sequence-tagging model has taken the first place in GEC for a long time.

Discussion
We discuss many details of the model in more depth in this section.Unless stated otherwise, all experiments in this section are tested on the BEA-dev dataset under the constrained data setting.

Ablation study
We explore the effect of each component in the whole LET (Leveraging Error Type) system.We compare the Bart-base as the Baseline (6 encoderlayers,and 139M parameters).As shown in Table 3, GID and GCA achieve higher values than the Baseline on three key metrics no matter how many error types exist.Moreover, the combination of them even obtains more improvement, demonstrating the effectiveness of the proposed two modules.

Results on different error categories
In this section, we explore the performance of LET on different error categories.We use the same pre-training and fine-tuning data splits for the baseline model but with no additional GED input for fine-tuning, which follows the standard encoderdecoder GEC training procedure.As shown in Table 3, the results demonstrate the efficacy of the multi-encoder GEC model: adding GED predictions as auxiliary input yields a consistent statistically significant improvement in performance over the baseline.Our best system uses the 55-class GED predictions, achieving 55.5 F 0.5 .The reason may be that the 55-class system represents the best compromise between label informativeness and model reliability.
Unlike the optimal scheme of LET, GID achieves the best result (53.4 F 0.5 ) in the setting of 4-class GED prediction.Notably, GID using the 2-class GED predictions (binary predictions) outperformed the same model using the 55-class GED predictions.This is because 2-class GED predictions are less informative but more reliable.After all, there are only two classes, while 25-class and 55-class predictions tend to be more informative but less reliable because of the increased difficulty in predicting sparser classes.This also shows that the GID model lacking the alignment of the error type is not good at using too subdivided error type guidance information, and it can also be inferred that the GID model does not make full use of the error type information effectively.
Notably, similar to LET, GCA achieves the best result (54.7 F 0.5 ) in the setting of 55-class GED prediction.Meanwhile, the experiment shows that with the increase in the number of error type categories, the GCA model's effect gradually improves.We construct a new classifier, which is a two-layer fully connected layer with the hidden size of (768->768->55).
The parameters are initialized randomly.ablation study B: We also construct a new classifier with the hidden size of (768->768->768).Follow the rest settings of ablation study A.

Effects of dynamic weight setting
As described in Section 3.5, the guided crossattention module contains two sub-modules: Cross Attention 1 (CA1) and Cross Attention 2 (CA2).
First, we explore the information fusion method by conducting a controlled experiment.Under the static weights setting: After grid search, the best β is set to 0.37.It can be seen from Table 4 that the guided crossattention module with dynamic weights is significantly better than it with static weights.Therefore, we conjecture that the model needs to adaptively change the information fusion weights of the two attention modules according to the input sentence to satisfy tasks of different difficulty.

Discussion on the number of parameters
Compared with the baseline method, our method introduces additional parameters, mainly from the newly added cross-attention module GCA.So, does the improvement in model performance benefit from the increase in the number of parameters?In order to explore this question, we conducted a related comparative experiment.
We set up two ablation studies in Table 5.As shown in this table, comparing the results of GCAablation studies 1 & 2 shows that the increase in the number of parameters does improve the model effect under the current conditions, but the improvement here is negligible compared to the improvement brought by the GCA module.Experimental results show that our proposed method is necessary to align at the error type level.

Conclusion
Grammar error correction is significant for many downstream natural language understanding tasks.In this paper, we propose an end-to-end framework termed LET, which effectively leverages the error type information generated by the GED task to guide the GEC task.
Our work solves two critical problems in the previous work.Firstly, we have alleviated the problem of error propagation caused by hard-coded error types by introducing soft-encoded error types.Secondly, we have introduced the concept of error type alignment, which is more reasonable and adequate.We transfer the original semantic vectors into classification vectors to ensure that the two parts of the input of the proposed cross-attention module are both in the same semantic space.Experiments and ablation studies show that alignment leads to better results.
Overall, LET provides a better sample for research in the GEC field and addresses some potential issues with previous technical solutions.

Limitations
By analyzing the error cases, we find that almost all the existing work (including our LET) cannot handle the disorder problem of words well, primarily when the error occurs far from the correct location.For example, there is a correct sentence: 'On my way to school today, I bought a very tasty apple.'.If the erroneous form is as follows: 'on my way to school apple today, I bought a very tasty.', it is hard for the model to understand that the right thing to do is to put apple back at the end of the sentence.

Figure 1 :
Figure 1: An example of grammatical error correction and detection.

Figure 2 :
Figure 2: Architecture of LET (Leveraging Error Type information).(a) shows the overview architecture of LET, and (b) describes the detailed components in the cross-attention module.

Table 2 :
Error Type Annotations.N.C: Number of Classes

Table 3 :
Results based on ablated modules and different number of error types.N.C: Number of Classes

Table 4 :
Results of two weighting ways of combining CA1 and CA2.

Table 5 :
Discussion on the number of parameters.Baseline: There is no new cross attention module.ablation study A: