Improving Seq2Seq Grammatical Error Correction via Decoding Interventions

The sequence-to-sequence (Seq2Seq) approach has recently been widely used in grammatical error correction (GEC) and shows promising performance. However, the Seq2Seq GEC approach still suffers from two issues. First, a Seq2Seq GEC model can only be trained on parallel data, which, in GEC task, is often noisy and limited in quantity. Second, the decoder of a Seq2Seq GEC model lacks an explicit awareness of the correctness of the token being generated. In this paper, we propose a unified decoding intervention framework that employs an external critic to assess the appropriateness of the token to be generated incrementally, and then dynamically influence the choice of the next token. We discover and investigate two types of critics: a pre-trained left-to-right language model critic and an incremental target-side grammatical error detector critic. Through extensive experiments on English and Chinese datasets, our framework consistently outperforms strong baselines and achieves results competitive with state-of-the-art methods.


Introduction
Automatically correcting grammatical errors is an important task of practical value in the NLP field.The potential applications include document proofreading, writing assistant, language learning education, text post-processing for automatic speech recognition (Leng et al., 2021), etc.There are two mainstream approaches to grammatical error correction (GEC), namely sequence-to-sequence (Seq2Seq) (Sun et al., 2021;Rothe et al., 2021) and sequence-to-edit (Seq2Edit) (Awasthi et al., 2019;Omelianchuk et al., 2020).The Seq2Seq approach treats GEC as a monolingual text translation/transduction task, whereas the Seq2Edit approach casts GEC into a sequence labeling task.
: Zhenghua Li is the corresponding author.
Input: But there had no buyers .Reference: But there were no buyers .Recent studies show that the Seq2Seq approach consistently outperforms the Seq2Edit approach on a variety of languages and datasets, especially in handling more complex errors such as wordordering ones (Qorib et al., 2022;Zhang et al., 2022b).However, the Seq2Seq approach still suffers from two issues.
First, a Seq2Seq model can only utilize parallel sentence pairs as training data, in which the input sentence is potentially ungrammatical whereas the target one is considered as correct.Usually, a major proportion of the training data is automatically collected from language learner websites such as Lang-8.On the one hand, the Lang-8 data contains a certain amount of noises, considering the voluntary contributors may make mistakes as well.On the other hand, the data scale is quite limited, compared with the non-parallel data used for training large language models.For instance, the cleaned version of English Lang-8 corpus (CLang8) contains only 2.4M sentence pairs (Rothe et al., 2021).
Data augmentation is a popular approach for addressing the limited scale issue, i.e., synthesizing large amount of training data (Grundkiewicz et al., 2019;Stahlberg and Kumar, 2021).However, it can be very difficult and tricky to control the error distribution in the generated data so that it resembles the realistic scenario.Moreover, it brings a heavy computational cost to train a GEC model with very large training data.
The second issue of the Seq2Seq GEC model is that the decoder lacks an explicit awareness or evaluation of whether the generated token is correct during decoding.There are indeed several works that perform grammatical error detection (GED) for the input sentence, and use the results as extra features for the encoder so that the decoder pays extra attention to the erroneous spans in an implicit manner (Chen et al., 2020;Yuan et al., 2021;Zhang et al., 2022b).However, we are not aware of any previous works that explicitly checks the correctness of generated tokens during decoding (e.g., target-side GED).As pointed out by Mita and Yanaka (2021), a Seq2Seq GEC model tends to generate wrong corrections when the model encounters errors unseen in the training data.
In this work, we propose a decoding intervention framework to address both issues of the Seq2Seq GEC approach.As illustrated in Figure 1, we employ an external critic to assess the appropriateness of the token to be generated incrementally, and then dynamically influence the choice of the next token.Specifically, at each decoding step, the critic will evaluate the appropriateness of the candidate tokens, if the token is inappropriate, the critic will punish the generation of the token by reducing the log-probability score of the token.
The key to our decoding interventions is to find a suitable critic.We discover and investigate two useful critics.The first critic is a pre-trained leftto-right language model (LM).Using the language model as the critic can take advantage of its knowledge learned from the vast amount of text.If the language model gives a low probability to a token, then the token is probably wrong even if the GEC model gives it a high probability.The second critic is a GED model which is an ideal critic to incorporate the explicit awareness of correctness into the Seq2Seq GEC model during the decoding process.However, the conventional GED cannot be directly used as the critic because it does not match the incremental manner of the decoding process.To address this problem, we propose an incremental target-side GED, which acts in a Seq2Seq manner making judgments on the token to be generated y t based on both the input sentence x and the tokens generated so far y ăt .
We conduct experiments on three English GEC datasets, including two English-as-a-secondlanguage (ESL) datasets, a multi-domain native-English dataset, and a Chinese dataset.Experimental results demonstrate that our decoding intervention brings consistent and substantial improvements on all datasets.The results also show that with the help of decoding intervention, our GEC model can achieve comparable performance to the state-of-the-art models on all datasets under comparable settings without any re-training.
Our code is available at https://github.com/Jacob-Zhou/gecdi 2 The Basic Seq2Seq GEC Model This work aims to improve the Seq2Seq GEC approach.In this section, we briefly describe it.Given a potentially erroneous sentence, a Seq2Seq GEC model tries to generate a correct one without changing its meaning, similar to machine translation, yet in a monolingual fashion.
We adopt the widely-used Transformer architecture (Vaswani et al., 2017) as our model backbone, which comprises an encoder and a decoder.Given an input sentence x " x 1 . . .x n , the encoder first encodes it into a sequence of hidden states h " h 1 . . .h n .At each timestamp, given the input sentence representation h and the previously generated tokens y ăt , the decoder calculates a probability distribution over the vocabulary for the next-token generation: p θ py t | y ăt , xq " Decoderpy ăt , hq. (1) The score of an output sentence y is the sum of the log-probabilities of all predicted tokens: spx, yq " During training, Seq2Seq models commonly employ the teacher forcing method, aiming to maximize the log-likelihood of the ground-truth next token g t , given the input sentence x and the previous ground-truth tokens g ăt : The main advantage of teacher forcing is that it allows for parallel training.The inference of Seq2Seq GEC models is to find the best output sentence y ˚by solving the following optimization problem: where Y is the set of all possible sentences.This optimization problem is typically tackled using the beam search algorithm, in which the model predicts a token at each decoding step, appends it to the partial sentence, and subsequently selects the top k partial sentences based on their scores for the next decoding step.

Decoding Intervention
In The first term is the original probability from the GEC model.The logarithm transform stretches the probability into a wider range and thus makes it more influential.This also gives more flexibility to the design of the critic model 1 .
The second term is the penalty score from the critic model to "y t " given the input sentence x and the generated prefix y ăt and λ is a coefficient that controls the trade-off between two model scores.Please note that λ is not a global hyper-parameter but is instead decided by the scores in a token-wise manner.We detail this in Section 3.3.From Eq. ( 5), we can draw two characteristics of our framework.
• Incremental.Similar to the Seq2Seq GEC model, the critic model incrementally evaluates a target sentence from left to right, token by token.
1 Based on our early-stage trials, we find it problematic to directly integrate the probabilities of the GEC model and the critic model via weighted interpolation, since the models usually have different vocabulary spaces and smaller vocabulary leads to a relatively larger probability for each token.
• Dynamic.The critic model dynamically influences the choice of tokens during decoding, in contrast to re-ranking N complete sentences.
Moreover, the critic model may or may not use the input sentence x.In this work, we discover and investigate two useful critic models, i.e., a pure leftto-right pre-trained language model which does not use x, and an incremental target-side GED model that uses x.

Left-to-Right Pre-trained LM
A conventional pre-trained left-to-right language model, unlike masked language models (e.g., BERT (Devlin et al., 2019)) and Seq2Seq models (e.g., BART (Lewis et al., 2020)), can be naturally used to evaluate the possibility of a sentence y, which is factored as product of probabilities of tokens in an incremental manner.
where π denotes parameters of the language model.The possibility of a token, i.e., p π py t | y ăt q, can also be understood as how likely the token appears after previous tokens y ăt .
The GEC task aims to produce a correct sentence that keeps the same meaning as the input sentence.We propose to use a language model to evaluate the correctness of a sentence from a purely linguistic perspective, without referring to its input sentence.
In this work, we select the GPT-2 models, which are trained on a very large amount of sentences, much more than the parallel sentences that are used for training GEC models, as the pure left-to-right language models.The rationale is that if the language model gives a low probability to a token, then the token is probably wrong even if the GEC model gives it a high probability.Specifically, we define the penalty from the language model critic as follows.

Incremental Target-side GED
As discussed in Section 1, one potential weakness of Seq2Seq GEC models is that the decoder may be unaware of the correctness of its output tokens.Several recent works try to alleviate this issue by performing GED on the input sentence, and using the GED labels as extra inputs to the encoder (Chen et al., 2020;Yuan et al., 2021;Zhang et al., 2022b).To some extent, this approach can make the model more explicitly aware of the correction process.In this work, we for the first time propose to apply an incremental target-side GED to the output sentence under our framework, which we believe is a more effective intervention strategy.Given an input sentence x, a partial target sentence generated so far y ăt , and a candidate token y t to be generated, the GED model judges the correction of y t into four labels, as shown in Table 1.Please notice that the GED model must look at x, instead of only accessing y ăt .The reason is that the GED model as a critic provides a complementary impact versus the language model critic that the target sentence should keep the same meaning as the input sentence.In the absence of x, many tokens can be considered correct given y ăt .
Formally, we design the penalty from the GED critic model as follows.
Training Our incremental target-side GED acts in a Seq2Seq manner, which is much like the GEC model, requiring parallel sentence pairs for training.Yet we cannot directly use the GEC training data, because the target sentences are all correct.The other consideration is that it is obviously beneficial that errors in the target sentences are more consistent with those generated by GEC models.Basically, we use the baseline GEC model to generate K output sentences which may be erroneous via beam search.Then we obtain the error labels for each token (subword, to be accurate) using the editing distance algorithm, which is the same as the evaluation metrics.Section 4 gives more details.

Coefficient for the Critics
The coefficient λ in Eq. ( 5) is important for leveraging the power of the critic model.Instead of using a fixed value for all contexts, we find it is beneficial to dynamically set the value by comparing the confidences from the two participating models.Intuitively, a model can be trusted if it has high confidence in its prediction, as a strong correlation holds between a model's confidence and the accuracy of its prediction (Guo et al., 2017;Kull et al., 2019).After several experimental trials, we find the following formula works quite well, especially when we use two critics at the same time.Here we use the language model critic for illustration.
λ " α ˆβ ˆEntropyp p θ p¨q q `1 where p θ p¨q and p π p¨q refer to the probability distribution of the GEC model and the language model over their own vocabulary space V; α ą 0 is a coefficient that controls the overall scale of penalty scores, and β ě 0 governs their variation, both of which aim to balance the influence of the critic models.
For the GED critic, we simply replace p π p¨q with p Φ p¨q.Please notice that its vocabulary only contains the four GED tags2 .
We separately select α and β for the two critics based on dev data.The search space of α is t0.1, 0.2, . . ., 1.0u and that of β is t0.01, 0.1, . . ., 100u.In Section 5, we study the impact of α and β.Results show that our decoding intervention is robust on a wide range of α and β.
Using both critics When we use two critics at the same time, we directly add penalties from the two critic models in Eq. ( 5).

Experiments
Datasets In this paper, we conduct experiments on two languages: English and Chinese.For English, we follow the convention of using BEA-19 dev set (Bryant et al., 2019) for methodological development and the BEA-19 and CoNLL-14 test sets for final evaluation.It should be noted that both the BEA-19 and CoNLL-14 test sets are collected from ESL learners.To better understand the effectiveness of our method in real-world scenarios, we also conduct experiments on GMEGwiki dev/test set (Napoles et al., 2019), a multidomain dataset derived from native English speakers.For performance metrics, we use M 2 Scorer on CoNLL-14 and ERRANT v2.0.0 on the others.
For Chinese, we conduct experiments on the MuCGEC dataset (Zhang et al., 2022a), a multireference and multi-source dataset 3 and use the official ChERRANT scorer to measure the performance.
Baseline & general settings The GEC model used in this paper is a BART model (Lewis et al., 2020) fine-tuned on GEC datasets.Detailed information on this model can be found in Appendix B.
We take "Vanilla Decoding" as our baseline, which refers to decoding using the original probability score as defined in Eq. ( 2).
During the decoding process, we employ the commonly used beam-search algorithm to find the sequence with the highest score spx, yq.For all experiments, we use a beam size of 12.
We repeat all the experiments 4 times with different random seeds and report the average results.
Language model critic We take off-the-shelf GPT-2 models as our language model critics.For 3 Please note that we omit the experiments on the NLPCC-18 (Zhao et al., 2018) since it is included in MuCGEC.
ESL datasets, we use the gpt2 model, while for the GMEG-wiki dataset, we opt for the larger gpt2-large model.For the Chinese dataset, MuCGEC, we employ the uer/gpt2chinese-cluecorpussmall.
Target-side GED critic We initialize the backbone of our target-side GED critic models with pre-trained BART models.Specifically, we use the facebook/bart-base for ESL datasets, the larger facebook/bart-large for the GMEGwiki dataset, and the fnlp/bart-largechinese for the MuCGEC dataset.
We use the FCE, NUCLE, W&I+LOCNESS train set to generate the English training data.And, we use the HSK train set (Zhang, 2009) for Chinese critic models training4 .Hyper-parameter details can be found in Appendix C.

Main Results
The main results are presented in Table 2. Results show that compared to the baseline "Vanilla Decoding", our decoding intervention consistently improves F 0.5 scores across all datasets, regardless of the critic used.The two critics improve the model's performance in different ways.The language model critic is better at improving the recall rate, while the target-side GED critic is better at improving the precision rate.Results also show that our decoding intervention can be further improved by combining the two critics ("Both").Specifically, "Both" achieves 1.4, 0.6, 1.6, and 1.2 F 0.5 improvement on the CoNLL-2014, BEA-19,  GMEG-wiki, and MuCGEC test sets, respectively.
We also compare our model with the recent stateof-the-art models.Note that our baseline model is already competitive with the state-of-the-art models.The tricks that we have used to improve the baseline model's performance are listed in Appendix B.3. Results show that our decoding intervention method ("Both") achieves an absolute improvement of 2.0 F 0.5 on the CoNLL-2014 and 2.10 on the MuCGEC.It is worth noting that the best performance in the BEA-19 test is achieved by Sun and Wang (2022) with an F 0.5 score of 75.0.However, it can not be directly compared with our results since they use a private synthesized dataset and the size of it is hundreds of times larger than our training data (300M vs. our 2.4M).

Qualitative Examples
We include two qualitative examples in Table 3.
In the first example, the baseline "Vanilla Decoding" and the decoding intervention using the target-side GED as the critic both fail to correct the error "ward of " to "ward off ".It is because the error pattern ("ward of " to "ward off ") has not appeared in the training data of both the GEC model and the target-side GED.However, the language model critic is able to correct this error successfully, demonstrating that a language model, pre-trained on vast amounts of data, can help the GEC model identify and correct errors that do not appear in the GEC training data.
In the second example, the input sentence is grammatically correct.Yet, the baseline "Vanilla Decoding" introduces a new error by inserting a definite article "The" before "Girls".The language model critic fails to correct this by intervening in the decoding process, since the sentence with the definite article is still grammatically correct, albeit with a different meaning.
These two examples also show that the "Both", which uses the target-side GED and the language model at the same time, manages to integrate the advantages of both critics.

Ablation Studies
Impact of critic sizes We perform experiments using four distinct sizes of language models and two different sizes of target-side GEDs.
As shown in Table 4, we can observe that, on BEA-19, the ESL learners dataset, a larger critic only results in a slight improvement in the F 0.5 score (+0.07 for language models, and +0.08 for target-side GEDs).However, on the GMEG-wiki, a multi-domain dataset from native speakers, a larger critic can lead to a large improvement on F 0.5 score (+0.30 for language models, and +0.76 for targetside GEDs).This may be because the errors on the ESL dataset are relatively simple and can be captured by smaller critics.In contrast, errors on the multi-domain native dataset are more complex and may require domain knowledge to identify.
Due to the uniform size of the Chinese GPT-2 models we found, we only performed experiments on the target-side GEDs for the Chinese dataset.The results show that a larger target-side GED is more effective when used with the Chinese dataset.
Effectiveness of the dynamic coefficient As mentioned in Section 4.1, the language model and the target-side GED exhibit specific tendencies when improving the GEC model.However, we also observed that a critic tends to decrease one score when improving another.For instance, while the target-side GEDs improve the precision score, they also result in a decline in the recall score.It might be caused by the misjudgment of the critics when they are unconfident.As a result, the improvement of the F 0.5 score is potentially hindered.
To address this issue, we propose a coefficient strategy that dynamically adjusts the coefficient of the critic at each decoding time step according to the confidence levels of the critic and the GEC model.Results in Table 5 show that the dynamic coefficient strategy can alleviate the decrease in either the precision score or recall score.Furthermore, this strategy can even lead to an improvement of the P score on the BEA-19 dataset when using the language model as the critic and an improvement of the R score on the MuCGEC dataset when employing the target-side GED as the critic.

Robustness of the decoding intervention
The dynamic coefficient strategy of the decoding in- tervention contains two hyper-parameters: an α for controlling the global scale of the coefficient and a β for the coefficient's variability.We use a heatmap to visualize the impact of these two hyperparameters on the F 0.5 score, as shown in Figure 2.
In general, our decoding intervention is robust to the hyper-parameters, and surpasses the baseline in most cases.However, we can also observe some interesting phenomena.Compared to the targetside GEDs, the language models are more sensitive to hyper-parameters, particularly to the variability β.Specifically, on the English dataset, when variability β is small, meaning that the λ is almost the same at different decoding time steps, the language model not only fails to gain improvement but even leads to a decrease in the F 0.5 score when α is large.Although the target-side GEDs are robust to the hyper-parameters, they tend to perform better with a larger α and smaller β in English, and with a moderate α and larger β in Chinese.
6 Related Works

Grammatical Error Correction
There exist two main approaches: sequenceto-sequence (Seq2Seq) and sequence-to-edit (Seq2Edit).The Seq2Seq-based approach regarding GEC as a monolingual machine translation task, is the most widely used approach in the GEC community recently.Though the Seq2Seq-based approach has achieved the state-of-the-art (SOTA) performance on various benchmarks (Sun et al., 2021;Rothe et al., 2021;Zhang et al., 2022b, inter alia), they typically have a slow inference speed due to their autoregressive decoding process.
In order to deal with the slow inference speed of the Seq2Seq-based approach, numerous recent works focus on the second approach, the Seq2Editbased approach (Gu et al., 2019;Awasthi et al., 2019;Omelianchuk et al., 2020;Zhang et al., 2023, inter alia).This approach regards GEC as a sequence labeling task.A Seq2Edit model is trained to predict the edit operations (e.g., keep, insert, delete, replace) for each token in the input sentence to transform it into a correct one.The most representative work, GECToR (Omelianchuk et al., 2020), achieves a comparable performance to the state-of-the-art Seq2Seq approach, with a 10x faster inference speed.

Decoding Intervention
The idea of decoding interventions has been widely used in many NLP tasks.Existing works can be categorized into two temporal stages: early and contemporary.
Early-stage works are mainly used to improve the performance of Seq2Seq-based approaches by using a language model trained on a large amount of monolingual data to intervene in the decoding process remedying the lack of parallel data (Gülçehre et al., 2015;Kannan et al., 2018;Zhao et al., 2019, inter alia).To the best of our knowledge, these early-stage works mainly focus on tasks like machine translation and automatic speech recognition, with no known attempts to apply them to GEC.This kind of decoding intervention has become less popular in recent years, as the advent of powerful pre-trained models has largely mitigated the problem of lacking parallel data.
Recent works, on the other hand, mostly focus on using decoding interventions to steer pre-trained language models towards generating desired outputs, such as certain topics, sentiments, or the avoidance or inclusion of specific words (Dathathri et al., 2020;Krause et al., 2021;Liu et al., 2021;Chen et al., 2022, inter alia).
Our work shares similarities with early-stage works in that we also use a language model to intervene in the decoding process.However, we distinguish ourselves by focusing on the GEC task and proposing the use of a target-side GED model to incorporate explicit grammaticality awareness into the decoding process.It is worth noting that there is a work conducts a decoding intervention in GEC (Sun and Wang, 2022).However, their motivation is to adjust the precision-recall tradeoff.

Conclusions
In this paper, we propose a unified decoding intervention framework for GEC models.Within this framework, we discover and investigate two useful critics: the language model critic and the targetside GED critic.Among them, the target-side GED critic represents a novel contribution.While most existing research has employed GED on the input side, this work is the first to leverage GED on the target side to assist GEC.Although the concept of a language model critic may not be entirely new, we argue that it is still worth investigating its effectiveness on the GEC task, especially in the era of pre-trained language models.
Experiments conducted on four English and Chinese GEC datasets lead to several promising findings.Firstly, the decoding intervention framework can consistently and substantially improve the performance of GEC models, regardless of whether a language model or error detector is used as the critic.Secondly, The language model critic is better at improving the recall rate, while the target-side GED critic is better at improving the precision rate.Thirdly, while the size of the critic has a minor impact on the ESL dataset, it becomes substantial on the multi-domain English dataset from native speakers, as well as the Chinese dataset.Finally, aided by the decoding intervention framework, our baseline GEC model shows competitive performance when compared to state-of-the-art models.

Limitations
The use of the critic introduces additional computational costs and GPU memory usage.Consequently, the decoding intervention has slower decoding speeds than the vanilla decoding, especially in the native writing dataset where a larger critic is required for better performance.In the future, we will further explore methods to reduce the computational costs of the decoding intervention framework, for example, distilling a larger critic into a small one, or using a lightweight mechanism to decide when to use the critic.
Besides, this work primarily focuses on the decoding intervention framework for GEC models.It would be interesting to investigate whether the decoding intervention framework can be applied to other Seq2Seq-based approaches in different NLP tasks, such as machine translation and text summarization, or how to design a suitable critic for these tasks.We leave these questions for future work.

Figure 1 :
Figure 1: Decoding intervention uses a critic to score the correctness of the token attaching to the partially generated target sentence.The final score of a candidate token is the log-probability from the GEC model subtracted by the critic penalty, which is scaled by λ.

Figure 2 :
Figure 2: Model performance (F 0.5 ) of decoding intervention compared to vanilla decoding with different scale α and variability β.The x-axis is β and y-axis is α.The red cells denote superior performance of decoding intervention compared to the vanilla decoding.The blue cells represent inferior performance.A deeper color indicates a larger performance difference.

Table 2 :
Results on GEC test datasets.:: The model of Yasunaga et al. (2021) in GMEGWIKI dataset is only trained on synthetic data, which makes direct comparisons less meaningful.
InputScientists can not conclude whether this smell tactic is used to attract the Danman (evil partner in crime whom has mad guitar skillz) or to ward of predators (PhD supervisors) .Reference 0 . . .crime who has mad guitar skillz) or to ward of predators (PhD supervisors) .Reference 1 . . .crime who has mad guitar skills) or to ward off predators (PhD supervisors) .Vanilla Decoding . . .crime who has mad guitar skills) or to ward of predators (PhD supervisors) .Decoding Intervention ├ Language Model . . .crime who has mad guitar skills) or to ward off predators (PhD supervisors) .├ Target-side GED . . .crime who has mad guitar skills) or to ward of predators (PhD supervisors) .└ Both . . .crime who has mad guitar skills) or to ward off predators (PhD supervisors) .

Table 3 :
Qualitative examples of decoding intervention versus vanilla decoding.Corrections marked in "Blue" are correct or suggested by the reference, while those in "Red" are incorrect.

Table 4 :
Results on GEC dev datasets of different sizes of GPT-2 and target-side GED models.

Table 5 :
Results on GEC dev datasets of w/ or w/o dynamic coefficient strategy.Underline means the result is inferior to the vanilla decoding baseline.