Grammatical Error Correction with Contrastive Learning in Low Error Density Domains

Although grammatical error correction (GEC) has achieved good performance on texts written by learners of English as a second language, performance on low error density domains where texts are written by English speakers of varying levels of proﬁciency can still be improved. In this paper, we propose a contrastive learning approach to encourage the GEC model to assign a higher probability to a correct sentence while reducing the probability of incorrect sentences that the model tends to generate, so as to improve the accuracy of the model. Experimental results show that our approach signiﬁcantly improves the performance of GEC models in low error density domains, when evaluated on the benchmark CWEB dataset.


Introduction
Grammatical error correction (GEC) is the task of correcting errors in a source sentence and generating a well-written and grammatically correct target sentence. Good results have been achieved by state-of-the-art GEC systems based on the seq2seq transformer architecture (Grundkiewicz et al., 2019;Choe et al., 2019;Omelianchuk et al., 2020). However, most prior approaches in GEC are all targeting English-as-a-second-language (ESL) datasets, where GEC systems are trained to correct errors made by ESL learners. In fact, grammatical and other writing errors are made not only by ESL speakers but also by native speakers. Therefore, correcting grammatical errors made by native speakers should also be considered, which helps to broaden the application of GEC.
Compared to the errors made by ESL learners, native English speakers are less likely to make grammatical errors, so the density of errors in the sentences is much lower. The GEC model may end up over-correcting or failing to correct certain errors unique to native speakers.
To address the problem mentioned above, it is necessary to improve the ability of the model to discriminate grammatical features from ungrammatical features with minor differences. Recently, supervised contrastive learning (CL) was proposed by , which allows the model to learn discriminative features through pushing the features of positive samples closer together and negative samples further apart. However, since GEC is a text generation task, it is not clear how to generate positive sample sentence pairs. To bridge the gap, we instead incorporate CL by increasing the probability of the model generating the right corrections and further reducing the probability of generating the wrong corrections, thereby improving the ability of the model for error correction in low error density domains.
More specifically, we use the negative loglikelihood (NLL) loss to increase the probability of a model to generate positive samples (the right corrections) and use a margin-based CL loss to increase the gap between the probability of positive samples and the probability of negative samples (the wrong corrections) predicted by the GEC model. In this paper, the negative samples are generated in two ways. The first kind of negative samples consists of those wrong corrections generated with high probability by the GEC model during beam search. The second kind of negative samples consists of erroneous sentences from the dataset that require some correction. Through the above negative sampling method, we make the model avoid over-correcting a correct sentence or neglect to correct an erroneous sentence.
The main contributions of this paper are as follows: 1 • We propose a new loss function based on CL, which allows the model to achieve higher performance in low error density domains. To the best of our knowledge, our work is the first to incorporate CL in GEC.
• We design a negative sampling method with two strategies, which makes the GEC model avoid over-correcting a correct sentence or neglect to correct an erroneous sentence.
• Experimental results on the benchmark dataset show that our CL approach can significantly improve the performance of seq2seq GEC models compared to direct fine-tuning in low error density domains.

Method
In this section, we will first introduce the background of grammatical error correction in Section 2.1, and then introduce our contrastive learning method in Section 2.2.

Background of Grammatical Error Correction
Let s (i) be an ungrammatical source sentence and t (i) be the corrected grammatical target sentence. For a grammatical error correction (GEC) model parameterized by θ, the goal is to minimize the NLL for a set of M sentence pairs s (i) , t (i) M i=1 , as follows: Given trained parametersθ, the hypothesis sentencet (i) is generated using beam search to select the candidate with the highest probability, as follows:t better distinguish grammatically correct features from grammatically incorrect features. Our proposed contrastive learning approach is described in Algorithm 1, which consists of three steps. Specifically, in the algorithm, the input D T and D F represent the datasets used for training (i.e., the standard GEC dataset) and fine-tuning (i.e., the low error density GEC dataset), respectively. In the first step, the GEC model is trained on D T via Eq 2 described in Section 2.1. In the second step, the negative sample dataset DF is constructed using the negative sampling method that will be described in Section 2.2.3. In the third step, the model is fine-tuned via Eq 5 to be described in Section 2.2.2.

Loss for Contrastive Learning
The idea of supervised contrastive learning is to make features of samples from the same class close together, and to make features of samples from different classes far apart (Khosla et al., 2020), thereby improving the feature discrimination of the model. However, since the GEC task is a text generation task instead of a classification task, no samples belong to the same class.
To overcome this problem, we instead improve model feature discrimination by increasing the probability of the model generating positive samples (right corrections) and further reducing the probability of the model generating negative samples (wrong corrections).
Specifically, to discourage the model from generating ungrammatical sentences, we design a margin-based contrastive learning loss as follows: is a negative sample pair. For the i-th positive sample, it is possible to have N i negative sample pairs, which will be described in Section 2.2.3. We utilize L (i) CL (θ) to ensure that the margin of log-likelihood between a positive sample pair and a negative sample pair is higher than γ.
To further encourage the model to generate grammatically correct sentences, we combine L (i) CL (θ) with the NLL loss, and obtain the combined loss as:

Negative Sampling Method
We choose the wrong corrections that the model tends to generate and the incorrect sentences that the model neglects as the negative samples. In this way, the model can learn more significant grammatically correct features. More formally, given a ground-truth sentence pair s (i) , t (i) from D F , a set of negative sample pairs s (i) ,t (i) will automatically be constructed by the following two strategies: • For a positive sample pair s (i) , t (i) , we feed s (i) to the model parameterized byθ, and choose the top k output sentences with the highest probability generated by beam search. Each sentence among these top k output sentences that is not identical to the target t (i) is selected as a negative sentencet (i) and forms a negative sample pair s (i) ,t (i) .
• If s (i) is not identical to t (i) in the positive sample pair s (i) , t (i) (i.e., some edits are made to s (i) to generate the corrected sentence t (i) ), we further form a new negative sample pair s (i) , s (i) .

Experiments
In this section, we demonstrate the effectiveness of our CL approach.

Datasets
To evaluate the efficiency of our CL approach in low error density domains, we conduct experiments on the public dataset CWEB (Flachs et al., 2020)  the training set, consisting of NUCLE (Dahlmeier et al., 2013), FCE (Yannakoudakis et al., 2011), Lang-8 (Tajiri et al., 2012, and W&I (Bryant et al., 2019). Detailed statistics of the datasets are shown in Table 1. CWEB is a low error density dataset consisting of two domains, S and G. Compared to G, S focuses more on professional writings and contains fewer errors. CWEB-dev is the combination of the first 1,000 sentences from the CWEB-S development set (2,862 sentences in total) and the first 1,000 sentences from the CWEB-G development set (3,867 sentences in total), and the remaining 4,729 sentences are regarded as CWEB-train, similar to the setting used in (Flachs et al., 2020). When testing on CWEB-S/G-test, we use BEA-train for training and CWEB-train for fine-tuning.

GEC Systems
In this paper, we employ our CL approach on two state-of-the-art seq2seq GEC systems, i.e., GEC-PD (Kiyono et al., 2019), and GEC-BART (Katsumata and Komachi, 2020) to verify its effectiveness. The detailed description of these two systems follows.
GEC-PD uses a Transformer-based framework (Vaswani et al., 2017) with the Transformer-big setting. This system is first pre-trained on 70 million parallel synthetic sentences. Then, it is further trained on the BEA-train erroneous portion by only choosing sentence pairs whose source sentence is not identical to the target sentence, consisting of 561,525 sentence pairs. During fine-tuning, we use the training setting in (Kiyono et al., 2019) but change its optimizer to Adam. Note that we do not apply any of the post-processing steps used in GEC-PD, because our goal is to compare our CL approach against direct fine-tuning.
GEC-BART builds the GEC system based on the BART-large (Lewis et al., 2020) (Flachs et al., 2020). For the GEC-BART system, we obtain the DI and NLL results by following the original setting in (Katsumata and Komachi, 2020). CL − and CL: The systems are fine-tuned with our CL approach by minimizing the loss in Eq. 5 without and with the second strategy of the negative sampling method. For the detailed training setting, please see Appendix A.4. Statistically significant improvements (p < 0.001) of the CL approach over the NLL approach and the CL − system are marked with an asterisk (*) and a dagger ( †), respectively.
BEA-train erroneous portion as GEC-PD does. For fine-tuning, we choose the same setting as (Katsumata and Komachi, 2020).

Settings and Hyper-parameter Selection
In this paper, we implement the GEC systems based on publicly available code 4 , and fine-tune the model using NVIDIA V100 GPU. Unless otherwise stated, we use the same hyper-parameters as the original GEC systems. For evaluation, we use the ERRANT scorer (Bryant et al., 2017) for all datasets and carry out statistical significance tests using one-tailed sign test with bootstrap resampling on 100 samples. There are two hyper-parameters in our CL approach: the number k of top-ranked candidates during beam search in Section 2.2.3, and the margin parameter γ from the loss function in Eq. 4. We select γ in the range (0.1, 1.0) with a step size of 0.05, and k in the range of 2, 3, 4 using grid search. We get the best results on CWEB-dev when k = 3, γ = 0.25 for GEC-PD, and k = 3, γ = 0.85 for GEC-BART, respectively.

CWEB Results
The results of our CL approach with fine-tuning on CWEB-train are shown in Table 2. Since each CWEB sentence was annotated by two annotators, following the setting in (Flachs et al., 2020), we first calculate the F 0.5 score based on each individual annotator, and report the average score as the final result.
In the S domain, GEC-PD with CL achieves the best performance with F 0.5 score of 31.34% and also achieves the most significant improvement of 4.88% compared with NLL. In the G domain, GEC-PD with CL achieves the best performance too, with F 0.5 score of 33.03%, while GEC-BART achieves the most significant improvement with 4.78% compared with NLL.
Compared with NLL, our CL approach can significantly increase recall and achieve competitive precision in both of the systems, except for GEC-PD in the G domain. This is likely because GEC-PD is pre-trained with a large amount of synthetic data. Although CL fails to increase the precision for GEC-PD in the G domain, the overall F 0.5 score still increases and the increase is statistically significant.

Ablation Study
We also carry out an ablation study to show the importance of the second strategy in the negative sampling method in low error density domains. The performance of CL without and with the second strategy of the negative sampling method is shown in Table 2.
The results show that after adding the second strategy of the negative sampling method, both precision and recall increase for both GEC systems. This shows that adding neglected error sentences as negative samples is an effective way for low error density domains.

Effect on Over Correction and Ignored Correction
We use the Overdone Edit (OE) ratio and Ignored Edit (IE) ratio to measure over-correction and ignored correction, respectively. Specifically, we use the closed interval [start, end] to indicate the range of an edit. Then, an edit in the gold edits is counted as IE if its range does not intersect with any model-generated edits. Similarly, an edit in the model-generated edits is counted as OE if its range does not intersect with any gold edits. Those model-generated edits that intersect with the gold edits but not correct are counted as wrong edits, not counted as the above two cases. The IE ratio is calculated by dividing the number of IEs by the number of gold edits, and the OE ratio is calculated by dividing the number of OEs by the number of model-generated edits. The results of the OE ratio and IE ratio are shown in Table 3.  We have successfully reduced the IE ratio and the OE ratio for both systems in S and G domain, except for the case of GEC-PD in G domain. This result demonstrates that CL can effectively reduce the over correction problem and ignored correction problem.

Grammatical Error Correction
The state-of-the-art approach in GEC uses sequence-to-sequence learning with transformer neural networks (Grundkiewicz et al., 2019;Choe et al., 2019;Omelianchuk et al., 2020). Several task-specific techniques have been proposed for the seq2seq GEC models. (Zhao et al., 2019) incorporated a copy mechanism into transformer networks (Vaswani et al., 2017), since many words in a source sentence are often correct and they should be kept. Diverse ensembles (Chollampatt and Ng, 2018a), rescoring (Chollampatt and Ng, 2018b), and iterative decoding (Omelianchuk et al., 2020;Lichtarge et al., 2019) have also been applied to improve the accuracy of GEC.

Contrastive Learning
Contrastive learning has been used to learn a good representation by contrasting positive with negative samples.  demonstrate that contrastive learning could boost the performance of semi-supervised learning and self-supervised learning in computer vision.
In natural language processing, contrastive learning has also been used. In word2vec (Mikolov et al., 2013), a center word and a word in its surrounding context are regarded as a positive sample and their vector representations are pushed together, while a center word and a randomly chosen word are regarded as a negative sample and their vector representations are pushed further apart. Besides word2vec, contrastive learning has also been used in natural language inference (Cui et al., 2020), language modeling (Liza and Grzes, 2018), and knowledge graph embeddings (Bose et al., 2018).
Most of the above methods work at the sample level and have to generate both positive and negative samples. However, since the positive samples are hard to generate in the GEC task, the above methods are not suitable for GEC. Compared to the above methods, our approach does not need to generate extra positive samples. Although (Yang et al., 2019) propose a sentence-level margin loss-based method for machine translation to reduce the word omission errors and do not need positive samples too, their negative samples are generated by word omission at the token level and cannot be used in GEC. In contrast, our approach uses beam search to generate erroneous sentences as negative samples at the sentence level, which effectively prevents the model from making mistakes and thus is more suitable for the GEC task.

Conclusion
In this paper, we propose a contrastive learning approach and a corresponding negative sampling method to improve the performance of seq2seq GEC models in low error density domains. By assigning a higher probability to grammatical corrections and reducing the probability of wrong corrections that the model tends to generate, we improve the performance of GEC models in low error density domains.

A.1 Experimental Details
In this part, we will introduce the software packages we have used, implementation details and the training time required for each epoch.
Software configurations: All models are implemented based on Fairseq 5 and PyTorch packages. More specifically, we use Python 3.7 and PyTorch 1.7.0 (or above).
Implementation details: Our implementation of the loss function for contrastive learning is based on cross-entropy loss with label smoothing, which is widely used to avoid overfitting for the model.
Note that each CWEB sentence was annotated by two annotators with two possible corrections. We only use the first correction as the target for negative sampling.

A.2 Dev Set Performance
In this part, we will introduce the validation performance for fine-tuning on the two datasets.
CWEB: The validation performance for finetuning on the CWEB dataset using our contrastive learning approach is shown in  All results are obtained by using their optimal hyper-parameters. The F 0.5 scores in this table represent the validation scores on CWEB-dev.

A.3 Performance against Both Annotations
The performance of fine-tuning the GEC systems when calculated against both annotations (nonaveraged) using the ERRANT toolkit is shown in