Grammatical Error Correction as GAN-like Sequence Labeling

In Grammatical Error Correction (GEC), sequence labeling models enjoy fast inference compared to sequence-to-sequence models; however, inference in sequence labeling GEC models is an iterative process, as sentences are passed to the model for multiple rounds of correction, which exposes the model to sentences with progressively fewer errors at each round. Traditional GEC models learn from sentences with fixed error rates. Coupling this with the iterative correction process causes a mismatch between training and inference that affects final performance. In order to address this mismatch, we propose a GAN-like sequence labeling model, which consists of a grammatical error detector as a discriminator and a grammatical error labeler with Gumbel-Softmax sampling as a generator. By sampling from real error distributions, our errors are more genuine compared to traditional synthesized GEC errors, thus alleviating the aforementioned mismatch and allowing for better training. Our results on several evaluation benchmarks demonstrate that our proposed approach is effective and improves the previous state-of-the-art baseline.


Introduction
Sequence-to-sequence neural solutions (Parnow et al., 2020) have been quite successful in comparison to their statistical counterparts (Sutskever et al., 2014), but these approaches suffer from a couple key problems, which has given rise to sequence labeling approaches for GEC (Omelianchuk et al., 2020). Such approaches task models with generating a list of labels to classify the grammatical errors in a sentence before correcting these errors.
Sequence labeling approaches have recently gained popularity in GEC and are currently stateof-the-art. One typical aspect of sequence labeling approaches is labeling and correcting sentences through an iterative process. As successive edits will depend on how other errors are corrected in a sentence, using an iterative process and correcting only the most salient errors in each round allows models to achieve better performance; however, because of this process, models are tasked with handling sentences with varying rates of errors, as during each round of inference for a given sentence, a model encounters a sentence with progressively fewer errors. This of course causes an exposure bias problem, as the training data does not match the test data, and suggests that providing the model with training data with varying error rates will lead to better performance.
To combat this exposure bias, we propose a new approach for training a sequence labeling GEC model that draws from GANs (Goodfellow et al., 2014), which consist of a generator that generates increasingly realistic fake inputs and a discriminator that is tasked with differentiating these fake inputs from real inputs. Other GEC works like (Raheja and Alikaniotis, 2020) directly used GANs to produce grammatically correct sentences given grammatically incorrect ones. This contrasts our work, which uses aspects of a GAN to enhance the training process rather than using a GAN itself as the correcting model. Our model consists of three components: an encoder, a Grammatical Error Detector, and a Grammatical Error Labeler. By sampling from the error distribution in the error labeler, our model can synthesize sentences with new errors creating new sentence pairs for further training data. As a result, our Detector continually Figure 1: An overview of our model.
improves its ability to detect errors and essentially acts as a discriminator of errors, and our Labeler continually improves the authenticity of its error distribution and becomes a better generator of errors. This process allows us to counter the exposure bias problem sequence labeling GEC models face because in addition to allowing us to generate new errorful sentences whose errors are increasingly representative of those in real data, we can also use control parameters to set the error rates of these sentences and accommodate our iterative inference process.

Our Approach
We formulate the GEC task as a problem of sequence labeling and create a neural sequence labeling model based on a deep pre-trained Transformer encoder to deal with this problem. Inspired by the work of (Omelianchuk et al., 2020), our full model's overall architecture diagram is shown in Figure 1. There are three main components in our basic neural GEC model: a deep pre-trained Transformer Encoder, a Grammatical Error Detector, and a Grammatical Error Labeler. To accommodate our new GAN-like training process, we add a Gumbelsoftmax sampling component to the basic GEC model.

Background and Notation
First, in training, given incorrect input sentence X = x 1 , x 2 , ..., x n and its corrected version X c = y 1 , y 2 , ..., y m , the model predicts a corrective label sequence T = t 1 , t 2 , ..., t n by minimizing the token-level Levenshtein distance on the span-based alignments of X and X c . The corrective label set is given as T = {$KEP, $DEL, $APP, $REP} ∪ {$CAS, $MRG, $SPL, $NNUM, $VFORM}, in which the first set consists of the basic text editing transformation operations and the second consists of g-transformations as defined by (Omelianchuk et al., 2020) for GEC 1 . Aligning sentences using these transformations in preprocessing, reduces what would be a sequence generation task that handles unequal source-target lengths to a set of label classification problems. In this formulation, the neural sequence labeling model trains to optimize the input sequence's negative log-likelihood loss for an input sequence: where p is the conditional probability that the model outputs at each position i.

Deep Pre-trained Transformer Encoder
As in most neural sequence labeling models (Ma and Hovy, 2016), a neural encoder such as a BiL-STM (Hochreiter and Schmidhuber, 1997) or a Transformer (Vaswani et al., 2017;Li et al., 2021) is used to extract context-aware features from the input sequence. Deep pre-trained language models such as BERT (Devlin et al., 2019;Zhang et al., 2020b), RoBERTa , and XLNet (Yang et al., 2019) have recently demonstrated the efficacy of Transformer models trained on largescale unlabeled data in various NLP tasks. We leveraged these very beneficial models by using a pre-trained language model as our encoder. We define the contextualized features captured by the neural encoder as: where Enc represents the encoder, and [·] i represents the output of the i-th position after encoding.

Grammatical Error Detector and Labeler
Next, we adopt a a Grammatical Error Detector (GED) to detect the presence of errors and a Gram-matical Error Labeler (GEL) to predict detailed error labels. With these labels, corrections are applied to sentences, and this process is typically iterative, as some corrections may depend on others, and applying corrections only once may not be enough to fully correct the sentence. During iterative correction, the model needs to assess at each round whether more correction is required. To this end, we use the GED to determine the degree of error for an entire sentence and control the iterative correction process.
Specifically, we use a binarization Y b of the corrective labels Y as the training target of the GED and use Y as the training target of the GEL. To obtain label probabilities grammatical error detection and labeling, two linear layers with softmax layers are appended to the encoder: . The binary classification probabilities in the GED output do not necessarily control the inference process's iterations. Rather, after using the GEL error label probabilities as thresholds for sentence positions, we also use the sum of these probabilities as a threshold for attempting another round of correction on the whole sentence. The model continues correcting the sentence until either it reaches a preset maximum number of iterations or no longer satisfies the following condition: where γ is the minimum error probability threshold for a sentence.
Additionally, since GEC usually corrects a small portion of a sentence (and there are therefore no errors in most of the input), the corrective label prediction task is an imbalanced classification problem. We alleviate this imbalance classification issue by taking advantage of this prior knowledge and adding a fixed and preset confidence β to the label $KEP to keep a position unchanged when applying corrections:

GAN-like Sequence Labeling Training
While we adopt sequence labeling instead of sequence-to-sequence modeling in this paper and therefore avoid the exposure bias problem caused by left-to-right sequence generation, our model still faces exposure bias because of the iterative correction process, which, through its iterative correction process, tasks the model with handling much more varied error rates in inference compared to in training, where it handles static data and does not use multiple-round corrections. To address this issue, we borrow the idea of a GAN (Goodfellow et al., 2014) and propose a GAN-like iterative training approach for a sequence labeling GEC model. GANs, whose training objective can be formulated as a minimax game between a generator that creates increasingly realistic fake outputs and a discriminator that must differentiate these outputs from their real counterparts, have been suggested for sequence-tosequence text generation Zhang et al., 2020a;Li et al., 2018) as they do not suffer from exposure bias. Initialize model parameters from previous training stage θi ← θi-1 when i > 1 3: for j in 1, ..., M do 4: for k in 1, ..., |D ∪ DSYN| do 5: Encode each sentence X k as H k 6: P k GED = Softmax(MLPGED(H k )) 7: P k GEL = Softmax(MLPGEL(H k )) 8: lossGED = CrossEntropy(P k GED , Y k err ) 9: lossGEL = CrossEntropy(P k GEL , Y k label ) 10: loss = lossGED + lossGEL 11: Update the model parameter θi with loss 12: end for 13: end for 14: DSYN = {} 15: for k in 1, ..., |D| do 16: Encode each sentence X k as H k 17: Use P k GED and P k GEL to produce sampled sequence X k SYN 23: end for 25: end for In our model, the GEL module can be considered a discriminator, as it must differentiate whether tokens are erroneous, and by adding a sampling module to the GED module, we can create a generator that outputs grammatical errors (rather than corrections) that are increasingly realistic. We can then pair these sampling outputs with their golden sequence in the training dataset to create new training  Table 1: Comparison of GEC models. The baseline comes from the model released by (Omelianchuk et al., 2020). samples. This trains the model with more samples and more varied errors and alleviates the exposure bias issue. Separate cross-entropy losses are calculated for the Grammatical Error Detector and Labeler, and we detail the whole algorithm for our training process in Algorithm 3.

Detailed Training Process
To synthesize new errors based on a genuine grammatical error distribution, we add a sampling module to a trained GED module. Specifically, we use Gumbel-softmax sampling, a simple and efficient way to draw samples z from a categorical distribution with class probabilities P GEL using the Gumbel-Max trick (Gumbel, 1954;Maddison et al., 2014): (1) where g 1 ...g j are i.i.d samples drawn from Gumbel(0, 1) 2 . We use the softmax function as a continuous, differentiable approximation to argmax: where |C| is the number of classes, τ is the softmax temperature. Altering γ and β allows us to synthesize input samples of different error rates.

Results and Analysis
Our results on the three test datasets are listed in  benchmarks are further improved using the GST approach, which demonstrates that the GST approach can effectively alleviate the exposure bias issue. With GST, we achieved new best results on the CoNLL-2014 test dataset, surpassing ensemble methods while only using a single model. In order to illustrate the benefits of sampling using Gumbel-Softmax, we replaced it with random sampling and Multinomial. The comparison is shown in Table 2. Random sampling actually hampers performance, which shows that synthetic sentences not based on a genuine error distribution do not alleviate exposure bias. Both GumbelSoftmax and Multinomial, which use a genuine error distribution, improve the model, though Gumbel-Softmax appears to be more suitable for sampling in sequence labeling modeling.
In Figure 2, we show how the performance changes with increasing rounds of GST training. In the first few rounds, due to the model's readaptation to new errors, there was a drop in performance on the test datasets; however, as the number of training rounds increased, performance on the test set gradually improved and finally stabilized.

Intermediate Outputs and Longer Training
In this experiment, we explored using intermediate outputs from our iterative inference process as additional training outputs to highlight the impact of generating new erroneous sentence by sampling from the real error distribution with our GST approach. For this experiment, we use our baseline architecture. As seen in the results in Table  3, whereas GST leads to a 0.6 F 0.5 gain over the baseline, using intermediate training outputs paired with golden sentences for additional training actually leads to worse performance, yielding a 0.3 F 0.5 loss in comparison to the baseline.
To confirm that GST's performance gain is not due to the added training time, we also train the baseline for a commensurate amount of additional steps but find that this does not have any effect on model performance. This experiment demonstrates  that our model does bring improvement to the baseline without relying on additional training steps. We also note that as our model is not significantly different in size from our baseline, our improvement is also not brought about by simply using a larger model.
Performance with out Pre-trained Language Models We additionally explored the performance of our system in the absence of contextualized pre-trained language models. As we expected, these models make our model much more resilient to the exposure bias problem, and as seen in Table  4, the improvement brought about GST is therefore much more evident. In comparison to the baseline, using GST brings an improvement of 1.5 F 0.5 points.

Conclusion
In this paper, we studied the exposure bias problem GEC sequence labeling models face. To alleviate this issue, we proposed a novel GAN-like training method for the GEC sequence labeling model. Through evaluation on three GEC benchmarks, we demonstrate that our novel training approach further improves a strong baseline model, illustrating the effectiveness of our training approach. Notably, with the help of pre-trained language models and our training approach, we achieved state-of-the-art results on the CoNLL-2014 benchmark.