Focal Training and Tagger Decouple for Grammatical Error Correction

In this paper, we investigate how to improve tagging-based Grammatical Error Correction models. We address two issues of current tagging-based approaches, label imbalance issue, and tagging entanglement issue. Then we propose to down-weight the loss of correctly classified labels using Focal Loss and decouple the error detection layer from the label tagging layer through an extra self-attention-based matching module. Experiments on three recent Chinese Grammatical Error Correction datasets show that our proposed methods are effective. We further analyze choices of hyper-parameters for Focal Loss and inference tweaking.


Introduction
Grammatical Error Correction (GEC) has been receiving increasing interest from the natural language processing community with the surging popularity of intelligent writing assistants like Grammarly.In the English language, a series of benchmarks (Ng et al., 2013(Ng et al., , 2014;;Bryant et al., 2019) have been created for the evaluations of different methods.In many other languages, various emerging datasets with language-specific challenges are also attracting plenty of attention (Trinh and Rozovskaya, 2021;Korre and Pavlopoulos, 2022;Náplava et al., 2022;Zhang et al., 2022;Xu et al., 2022;Jiang et al., 2022).
Existing methods for GEC can be categorized into sequence-to-sequence approaches, tagging-based approaches, and hybrid approaches.Sequence-to-sequence approaches require a larger amount of training data and usually rely on synthetic data for pretraining (Rothe et al., 2021;Stahlberg and Kumar, 2021;Kaneko et al., 2020).Tagging-based approaches adopt editing operations between the source text and the target text as training objectives (Malmi et al., 2019;Awasthi et al., 2019).These methods are faster in inference and can achieve competitive performance as sequenceto-sequence approaches.Hybrid models separate the tagging process and the insertion process into two stages, and can easily change word order with an extra pointer network module (Mallinson et al., 2022).In average cases, hybrid approaches can achieve sub-linear inference time.
In this work, we are interested in tagging-based models due to their simplicity and high efficiency and would like to investigate how to further improve their performance.In the literature, there is also work on improving the performance of existing tagging-based models.For example, Tarnavskyi et al. (2022) explored ensembles of recent Transformer encoders in large configurations with various vocabulary sizes.For general-purpose improvements, existing methods range from optimizing training schemes to changing inference techniques.For training, Li et al. (2021) explore how to enhance a model through generating valuable training instances and applying task-specific pretraining strategies.For inference, Sun and Wang (2022) propose Align-and-Predict Decoding (APD) to offer more flexibility for the precision-recall trade-off.From the perspective of system combination, Qorib et al. (2022) propose a simple logistic regression algorithm to combine GEC models effectively.
Different from the methods discussed above, we focus on improving tagging-based models from the perspective of model designing and learning.Currently, GECToR (Omelianchuk et al., 2020) is one of the representative tagging-based models.GECToR contains a pretrained transformer-based encoder with two linear classification layers as the tagger.One of the linear layers is used for label tagging, and the other for error detection.However, we identify the following issues with current tagging-based models: (1) Label imbalance issue.The training labels contain a large portion of easy-to-learn labels and the distribution is highly skewed.Existing learning methods use cross entropy with label smoothing as the loss function which is deemed as sub-optimal for this scenario.
(2) In current tagging-based models, sequence labeling and error detection are two linear classification layers over the same hidden representation.This entanglement may also hurt the performance of models.
To solve the problems discussed above, we propose the following modifications to tagging-based models: (1) We use Focal Loss (Lin et al., 2017) to counteract class imbalance and down-weight the loss assigned to correctly classified labels.(2) We decouple error detection from label tagging using extra attention matching module (Wang and Jiang, 2017).
We then verify the effectiveness of the proposed method over three recent Chinese grammatical error correction datasets.Through our experiments, we find that both focal loss and matching mechanisms contribute to performance gain.

Method
Suppose we have an edit operation set O for the manipulation of text at token level.Given a piece of source context denoted as x = (x 1 , x 2 , . . ., x N ) and its corrected target sequence w = (w 1 , w 2 , . . ., w M ), to construct a mapping from the (x, w) to O, we use a tagging scheme T to first compute alignments between the two sequences.Then we assign each to-ken in the sequence with candidate operations.The tagged label sequence is denoted as y l = {(y 1 , y 2 , . . ., y N )|y i ∈ O} = T (x, w).Its corresponding error detection target is named as y d .
Tagging Scheme.We use a tagging scheme from GECToR (Omelianchuk et al., 2020) with adapted vocabularies and operations by MuCGEC (Zhang et al., 2022).Specifically, the scheme computes an optimal token-level alignment between x and w.Then for each aligned pair of tokens, there will be four choices for tagging labels: (1) KEEP for identical tokens, (2) DELETE if the token comes from x only, (3) REPLACE_w if the token from x is replaced by the one from w, (4) APPEND_w if the token comes from w only.For example, APPEND_， and REPLACE_识 in Figure 1.Notice that if multiple insertions appear within one alignment, only the first one is used for training.The error detection labels are constructed from the tagging labels.It will be a correct label COR if tagged as KEEP else error label ERR.
Encoder.We use a transformer-based encoder to process the tokenized input text.We enclose the context with special tokens [CLS] and [SEP] and pass them into the BERT model.We use the last layer of BERT as the encoded hidden representation for the context.Considering our tagging system is consistent with GECToR, we also use a mismatched encoder to get hidden representations of the original word, denoted as H = (h 0 , . . ., h N ).
Tagger.Our tagger contains two separate linear classification heads.The label tagging head is conducted over the encoder's hidden representation H directly: The error detection head is decoupled from the label tagging head using an input constructed from H with a matching mechanism over its self-attended representation: Training Objective.In this paper, we choose Focal Loss to down-weight correctly classified labels: where γ is a hyper-parameter to control the loss assigned to these labels.Both label tagging and error detection contribute to the final loss.The final loss is a linear combination of label tagging loss and error detection loss: where λ is a positive hyper-parameter.

Experiments
We evaluate our method on three recent Chinese Grammar Error Correction datasets.

Datasets
We use three recent grammar error correction datasets from the Chinese language, MuCGEC (Zhang et al., 2022) The evaluation metric reported in this paper is span level correction F 0.5 scores evaluated using ChERRANT, a Chinese version of ERRANT1 .Specifically, ChERRANT computes an optimal sequence of char-level edits with the minimal edit distance given an input sentence and a correction.Then consecutive char-level edits are further merged into span-level, resulting in the following error types: (1) Missing, (2) Redundant, (3) Substitution, (4) Word-Order.
We analyze validation sets of all used datasets using ChERRANT and show the error type distribution in Figure 2. The distribution indicates the difficulty of each dataset which will be discussed in Section 3.3.

Settings
We use a GECToR model released by MuCGEC2 as our checkpoint.The model uses StructBERT-Large (Wang et al., 2020) as the transformer encoder and it is the best tagging-based model over MuCGEC.It has 7375 labels for token-level operation and 2 labels for error detection.We evaluate our model on the official benchmark websites for MuCGEC3 and FCGEC4 .For all our experiments, we use a learning rate of 1e −5 with batch size 128 and run three epochs for training.The hyper-parameter λ is chosen as 1.The default γ for Focal Loss is chosen as 2. We use the maximum F 0.5 score over the validation set to choose the best model for evaluation.To be consistent with MuCGEC, we use iterative refinement for five iterations to get the final corrected results.No inference tweaking tricks are used for our main results in Section 3.3.However, we will conduct further analysis on inference tweaking in Section 3.5.
The training costs are computed on a NVIDIA GeForce RTX 3090 GPU.For MuCGEC, the cost is 14 GPU hours .For FCGEC, the cost is 3.5 GPU hours.For MCSCSet, the cost is 3 GPU hours.Our code has been released on Github 5 .

Main Results
We use the GECToR model as the baseline for comparison.For our proposed methods, we show incremental results of adding Focal Loss (FL) and Tagger Decouple (TD), denoted as GECToR + FL and GECToR + FL + TD.
We report evaluation scores over the test split of each dataset in Table 2.The baseline scores for MuCGEC and FCGEC are quoted directly from their original papers.The baseline score for MC-SCSet is offered by us.We then list the scores of our proposed methods.
The table shows that using Focal Loss for training can improve performance for all datasets.If we further decouple error detection from label tagging, extra gains can be achieved consistently.
It is worth noting that error type distribution reflects the complexity of a specific dataset.For example, MCSCSet is easier than the other two even if it comes from a different domain since the error types are mostly Substitution.

Analysis over choices of γ
It remains to be answered whether we should choose a larger γ to make the model more aggressive about harder labels.We conduct experiments over MuCGEC using different γ.To evaluate the generalizability of the trained models, we further adopt a zero-shot setting using the test split of FCGEC.We don't use MCSCSet for zero-shot evaluation due to its low domain similarity with MuCGEC and low error type diversity.
In Table 3, we list results for GECToR and GEC-ToR + FL using different γs.On MuCGEC, using larger gamma helps the model to do better in evaluation.However, if we take the zero-shot setting into consideration, the performance over FCGEC is not consistent with the increase of γ.This indicates that larger γ tends to make the model overfit the training data.

Analysis over inference tweaking
Inference tweaking has been used as a postprocessing technique to further improve the performance of tagging-based models.The method searches two hyper-parameters (δ, β) over the validation set.δ is a threshold for sentence-level minimum error probability.β is a positive confidence bias for keeping the source token.
Considering inference tweaking promotes F 0.5 scores through trading off precision and recall, we conduct experiments to compare how our proposed methods perform against it.We use validation and test split of MuCGEC for the illustration.In Table 4, we list the best scores achieved after applying inference tweaking and place the difference value in the bracket.All scores over the validation split increase by roughly 0.5 points.However, on the test split, the tweaked results are not rising consistently.Although inference tweaking is effective, it's not guaranteed the (δ, β) searched over the validation set works for each specific model.

Limitations
In this work, we have been focusing on improving the performance of tagging-based Grammatical Error Correction.Our work has the following limitations: (1) We work on three recent Chinese Grammatical Error Correction datasets.But there are many emerging datasets from various languages.We will add support for these languages on our GitHub repository and make all resources publicly accessible.(2) We point out a limitation of inference tweaking, but it remains to be explored how to explain the phenomenon and derive better tweaking methods.

Conclusion
In conclusion, focal training and tagger decoupling are effective in improving current tagging-based Grammatical Error Correction models.However, it is also important to choose a suitable γ for Focal Loss considering the generalizability of the model.For the widely adopted post-processing technique inference tweaking, it depends on the model whether there will be significant performance gain.B2.Did you discuss the license or terms for use and / or distribution of any artifacts?Not applicable.Left blank.
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?Not applicable.Left blank.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?Not applicable.Left blank.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.?Not applicable.Left blank.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.Not applicable.Left blank.
C Did you run computational experiments?

3
C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?3.2

Figure 1 :
Figure 1: Model structure with error detection decoupled from label tagging.

Figure 2 :
Figure 2: Error type distribution over validation set of used datasets.For MCSCSet, the numbers are divided by 10 to make them fit into the figure.
you describe the limitations of your work? 4 A2.Did you discuss any potential risks of your work?The benchmarks are open and the results are reproducible.A3.Do the abstract and introduction summarize the paper's main claims?Left blank.A4.Have you used AI writing assistants when working on this paper?Left blank.B Did you use or create scientific artifacts?Not applicable.Left blank.B1.Did you cite the creators of artifacts you used?Not applicable.Left blank.

Table 2 :
Comparison of our proposed methods and the GECToR model over test split.

Table 3 :
Comparison of our proposed methods with different γ values over MuCGEC and FCGEC.

Table 4 :
Performance differences over validation split and test split after applying inference tweaking.