From Spelling to Grammar: A New Framework for Chinese Grammatical Error Correction

Chinese Grammatical Error Correction (CGEC) aims to generate a correct sentence from an erroneous sequence, where different kinds of errors are mixed. This paper divides the CGEC task into two steps, namely spelling error correction and grammatical error correction. Specifically, we propose a novel zero-shot approach for spelling error correction, which is simple but effective, obtaining a high precision to avoid error accumulation of the pipeline structure. To handle grammatical error correction, we design part-of-speech (POS) features and semantic class features to enhance the neural network model, and propose an auxiliary task to predict the POS sequence of the target sentence. Our proposed framework achieves a 42.11 F0.5 score on CGEC dataset without using any synthetic data or data augmentation methods, which outperforms the previous state-of-the-art by a wide margin of 1.30 points. Moreover, our model produces meaningful POS representations that capture different POS words and convey reasonable POS transition rules.


Introduction
Grammatical error correction (GEC) takes erroneous sequences as input and generates correct sentences.In recent years, English GEC task has attracted wide attention from researchers.By employing pre-trained models (Kaneko et al., 2020;Katsumata and Komachi, 2020) or incorporating synthetic data (Grundkiewicz et al., 2019;Lichtarge et al., 2019), the sequence-to-sequence models achieve remarkable performance on English GEC task.Besides, several sequence labeling approaches are proposed to cast text generation as token-level edit prediction (Malmi et al., 2019;Awasthi et al., 2019;Omelianchuk et al., 2020).
Chinese grammatical error correction (CGEC) is less addressed.Previous works adopt ensem- * *Corresponding author.ble methods by combining seq2seq networks with heuristic rules (Zhou et al., 2018;Fu et al., 2018) or sequence editing approaches (Hinson et al., 2020;Zhang et al., 2022).Different from English, Chinese language utilizes function words instead of affixes to represent forms and tenses, making it hard to design detailed Chinese-specific edit labels.The simple label strategy (for example, Keep, Delete, Append_X) has been proved not competitive with the seq2seq model on CGEC task (Chen et al., 2020;Zhang et al., 2022).
Generally, the errors occurring in Chinese texts can be divided to spelling errors and grammar errors.For example in Figure 1, "主" is a spelling error which should be subtitued to "注".While "和 (and)" relates to grammatical error which should be deteled.According to our statistics on HSK data1 , which is collected from the writing section of Chinese proficiency exam, the proportion of spelling errors is about 18.58%.Unfortunately, most of previous works mix two kinds of errors and adopt the same model to handle them.Moreover, as the basic elements for sentence understanding and processing, spelling errors will influence the usage of high-level features in the CGEC task.Although amounts of linguistic features have been investigated to improve many natural language processing tasks, deep syntactic and semantic knowledge is rarely explored in CGEC.
Therefore, we propose a new framework for CGEC with two steps: Spelling error correction and semantic-enriched Grammatical error correction (SG-GEC).We propose a novel zero-shot method for Chinese spelling error correction, by taking advantage of the pre-trained BERT and Chinese phonetic information, which is straightforward but achieves a satisfying precision.Further, we introduce semantic knowledge into the seq2seq model to correct grammatical errors.We carefully analyze the reliability and utility of part-of-speech (POS) in erroneous-correct paired sentences, and design an effective method to integrate POS and semantic representations into the neural network model.Moreover, we introduce an auxiliary task of POS sequence prediction, where a Conditional Random Field (CRF) layer is added to ensure the valid of generated POS sequences and stimulate the model to learn grammar-level corrections.
We conduct extensive experiments on CGEC NLPCC dataset (Zhao et al., 2018).Experimental results show that our proposed zero-shot spelling error correction module achieves a 60.25 precision, which lays a good foundation for further leveraging word-level features.With the pre-trained BART for initialization, our model achieves a new stateof-the-art result of 42.11 F 0.5 score, which outperforms all previous approaches including pre-trained models and ensemble methods.We also evaluate model performance on CGED-2020 test dataset (Rao et al., 2020) and obtain satisfying results.
To sum, our contributions are as follows: • We present a new framework for CGEC, which first conducts a preliminary spelling error correction and then performs grammatical error correction with semantic features.
• We propose a novel zero-shot Chinese spelling error correction method, which is straightforward and achieves a high precision.
• We effectively inject semantic knowledge to CGEC at both encoder and decoder, by incorporating POS and semantic class features into the input embeddings, and introducing an auxiliary task of POS sequence generation in the decoding phase.
• Our proposed model obtains a new state-ofthe-art result on CGEC task, outperforming previous works by a wide margin without using any data augmentation method.

Observation and Intuition
Various types of linguistic features have been exploited in NLP, which bring improvement on different tasks.However, it remains an open issue to introduce linguistic features to GEC.Different from other NLP tasks, the GEC task takes erroneous sentences as input, based on which the extra features might bring noise to the GEC model that harms the performance.

Part-of-Speech and Grammar Errors
Part-of-speech represents the syntactic function of a word in contexts, which is closely connected with grammar.To bring POS features to the GEC task, the reliability and sensitivity of POS tags to grammar errors should be carefully examined.We conduct such analysis on NLPCC dataset, using Jieba2 as the POS tagger.According to our statistics, 88.2% of erroneous sentences have different POS sequences with their paired correct sentences, demonstrating that POS feature is sensitive to grammatical errors.An example is given in Figure 1.We count LCS (Longest Common Sub-sequence) between erroneous-correct sentence pairs.We divide tokens in the erroneous sentence into two types: Corrtoken which appears in LCS and Err-token which does not appear in LCS.Consequently, 98.1% Corrtokens have correct POS tags, proving that the POS tagger could provide precise feature for the correct part of erroneous sentences.As for those 1.9% Corr-tokens which have wrong POS tags (red colour in Figure 1), we calculate the average distance between them and the nearest Err-token (orange color) and the result is 2.38 tokens.In contrast, the average distance between Corr-tokens with correct POS tags and the nearest Err-token is 8.59 tokens.This suggests that Corr-tokens with wrong POS tags are next to the erroneous part of sentences.
All these statistical results demonstrate that the POS feature is sensitive to the erroneous part meanwhile is robust for the correct part of sentences.

Semantic Class and Grammar Errors
Word semantic class is a kind of context-free feature, which tells which class each word belongs to according to a semantic dictionary.The dictionary is organized in a tree structure, consisting of different levels of semantic classes.For example, top-3 level semantic class of soup is entity, water and boiled water.By introducing semantic class knowledge, the model could learn the correlation between different semantic classes, and thus correct some semantic collocation errors, such as "冬阴功对外 国人的喜爱 (Seafood soup enjoys foreigners.)" .In this example, a kind of food is incorrectly used as the subject which performs the action "enjoy" .
We leverage HIT-CIR Tongyici Cilin (Extended)3 to provide semantic class knowledge.

Zero-shot Spelling Error Correction
We present a new framework for CGEC as shown in Figure 2, which consists of Spelling Error Correction (SEC) and Grammatical Error Correction (GEC).Firstly, we propose a smart zero-shot method for SEC.
Formally, the input erroneous sentence is represented as X = (x 1 , x 2 , ...x n ), and the target correct sentence is denoted as Y = (y 1 , y 2 , ...y m ), where n, m mean the length of sentence.We use X = (x 1 , x2 , ..., xn ) to represent the output of the zero-shot spelling error correction module: Specially, if a token x i in X has a relatively high probability of being written incorrectly, it will be substituted with a [MASK] token.Then, the pre-trained BERT model (Devlin et al., 2019) is employed to generate top-3 candidate tokens V = (v 1 , v 2 , v 3 ) with high probability.Among the candidates, we select the token that most likely appears in the [MASK] position: If there exists more than one token belonging to SimSet(x i ), we choose the token with the highest score generated by BERT.According to the previous study, over 80% spelling errors in Chinese are related to phonological similarity (Liu et al., 2010).So, we set SimSet(x i ) to be the collection of homophones of x i .Figure 2 gives an example in the left part.In the given sentence, 金(golden) is suspected to be written incorrectly and substituted with [MASK].The top three tokens with the highest score generated by BERT are: 去(last), 今(this) and 每(every).Among them, 今/jin shares the same PINYIN with 金/jin.So we replace 金 with 今 in the original sentence.
A problem in SEC is how to decide whether a token x i is likely to be written incorrectly.Intuitively, the punctuation and commonly used Chinese characters, whose occurrences in the training dataset are over k c , are less likely to be written incorrectly.We directly keep these tokens unchanged to improve the precision of SEC module and reduce the computational expense.To find out the appropriate threshold value k c , we conduct experiments on the test set of SIGHAN-2015(Tseng et al., 2015a), which is designed specially for Chinese spelling error correction and contains 1100 examples collected from Chinese language learners.
As shown in Figure 3, the precision score on SIGHAN test dataset has peaked at 66.1 when k c = 80, 000 and gradually declines after k c > 120, 000.In order to restrain error accumulation of the pipeline structure, we hope our SEC module to be high in precision.Accordingly, we set k c = 80, 000 in our experiment.

Integrating Semantics for Grammatical Error Correction
We adopt the Transformer encoder-decoder architecture for grammatical error correction.To project semantic knowledge to CGEC, at the encoder we incorporate the semantic knowledge representations, and at the decoding we design the POS sequence generation as an auxiliary task.
We add the semantic knowledge embedding E semantics to the original word embedding to serve as the input of encoder: In the decoding phase, we take the hidden state of timestep t to predict the t th token in the target sentence: where P v t (w) is the generation probability of each token.

Injecting Semantic Features
The semantic knowledge is composed of POS and semantic classes.Please note that, in this stage, the input sequence is X, where the spelling error correction has been conducted.
We leverages a POS tagger to obtain the POS tag Xp = (x p 1 , xp 2 , ...x p n ) for each token in X.The embedding of POS tag sequence is computed as: As shown in Table 1, there are different levels of semantic classes to specify a word.We use Xc,l = (x c,l 1 , xc,l 2 , ...x c,l n ) to represent the l th level class feature for each token.The high level semantic class brings precise information.If only the high level class feature is extracted, the model will treat words as individual groups and ignore their relations in low levels.Therefore, the semantic representation of token x i is calculated as the concatenation of embeddings of the first k-th levels: Considering that POS could be regarded as a kind of rough semantic knowledge and be located at the lowest level of semantic class, we concatenate the POS embedding and semantic class embedding to obtain the semantic representation: To make the dimension of semantic embedding equal to that of word embedding d E , the dimension of POS embedding d p and that of semantic class embedding d c are set to:

Predicting POS Sequence
As described in Section 2.1, the wrong POS tags are usually close to the erroneous parts of sentences, which indicates that token-level error correction shares the same target with POS-level error correction.Moreover, POS-level errors are more general, since various types of token-level errors might be mapped to the same on POS-level.Inspired by this observation, we design a sub-task to predict the error-free POS sequence.At timestep t, the generation probability of each token's POS tag is computed utilizing the linear function and softmax: The cross entropy loss is commonly used to stimulate the model to generate a target sequence.However, besides being close to the golden correct POS sequence, the generated POS sequence itself should be well-formed.To model the dependencies among neighboring POS tags, we adopt Conditional Random Fields (CRF) (Lafferty et al., 2001), under which the likelihood of target POS sequence Y p = (y p 1 , y p 2 , ..., y p m ) is computed as:  (Forney, 1973;Lafferty et al., 2001) is utilized to calculate the normalizing factor Z(X).

Training Objective
As shown in Figure 2, our model is trained to generate the target sentence and POS sequence simultaneously, and thus the final loss is computed as: 5 Experimental Setup

Dataset and Evaluation Metric
We conduct experiments on the dataset of NLPCC-2018 shared task (Zhao et al., 2018) which contains 1.12 million training samples collected from the language learning platform Lang-8 4 and 2000 human annotated samples for test.We randomly selected 5,000 instances from training data as the development set.Besides, we changed the format of CGED-2020 test dataset (Rao et al., 2020) to suit our task, and manually corrected 283 wordorder errors in CGED-2020 to obtain error-free sentences (Please refer to Appendix A).We evaluate our model on CGED-2020 (1457 samples) as a supplement.
For NLPCC-2018 test dataset, we segment model outputs by the official PKUNLP tool, and 4 https://lang-8.com/adopt the official MaxMatch (M 2 ) (Dahlmeier and Ng, 2012) scorer to calculate precision, recall and F 0.5 score.For CGED-2020 test dataset, we apply the simple char-based evaluation using ChER-RANT5 to avoid the influence brought by different word segmentation tools.

Training Details
Our model is implemented using Fairseq.We average parameters of the last 5 checkpoints.We use BART-base-chinese6 to initialize our model.We use BERT tokenizer for word tokenization and replace some [unused] tokens with Chinese punctuation.Please refer to Appendix B for more parameter settings.
Also, the following previous works are referred as baseline models: ESD-ESC uses a pipeline structure to firstly detect the erroneous spans and then generate the correct text for annotated spans (Chen et al., 2020).
HRG proposes a heterogeneous approach composed of a LM-based spelling checker, a NMT-base model and a sequence editing model (Hinson et al., 2020).
MaskGEC adds random noises to source sentences dynamically in the training process (Zhao and Wang, 2020).
S2A model combines the output of seq2seq framework and token-level action sequence prediction module (Li et al., 2022).
6 Results and Analysis

Overall Performance
Table 1 reports the main evaluation results of our proposed model on NLPCC-2018 and CGED test datasets, comparing with previous researches.
Our proposed SG-GEC model obtains a new state-of-the-art result with a 42.11 F 0.5 score, which outperforms the previous best single / ensemble model by 5.14 / 1.30 points.Meanwhile, our SG-GEC model surpasses GECToR, which achieves SOTA result on English GEC task.Comparing with the base BART fine-tuned method, our strategy brings a performance gain of 2.53 points.What's more, our SG-GEC model achieves a significant better result in precision among singe models, which is vital for some real-world applications.Without using pre-trained language models, our method outperforms the baseline Transformer by a large margin of 7.01 F 0.5 points.
Meanwhile, when being initialized by the pretrained BART, our proposed framework obviously surpasses MaskGEC.It demonstrates that our SG-GEC model brings additional semantic knowledge which is more beneficial to the strong BART model than simple data augmentation methods.
Our model consistently outperforms all other models when evaluated on CGED-2020 test dataset, which proves the generality of our model.

Ablation Study
We conduct ablation study on NLPCC dataset to evaluate the effect of each module, as shown in  iments to illustrate the effect of this module, and list the results in Table 3.Our proposed SEC module greatly improves the number of corrected spelling errors, with 90 more tokens over BART and 97 more tokens over the semantic-enriched BART.During the pre-training process of BART model, input tokens are substituted to [MASK] symbols and new tokens are generated without special constraints.Meanwhile, our SEC module intentionally masks misspelled tokens and takes phonetic similarity as constraints when generating new tokens, therefore corrects more spelling errors and achieves high precision score.
If semantic feature embeddings are directly added on BART without utilizing the SEC module, the number of corrected spelling errors will drop from 120 to 113.Because spelling errors influence word segmentation and thus lead to erroneous POS and semantic class features at the position of misspelled tokens.In contrast, our high precision SEC lays a solid foundation for the further semantic information injection.After applying the SEC module, BART + SEC + SemF (semantic features) obtains larger improvement in model performance.
We also compare our zero-shot SEC module with a BERT model finetuned on the spelling error correction dataset SIGHAN-2015(Tseng et al., 2015a).Our SEC module strictly focuses on correcting spelling errors and achieves a high precision.However, beside spelling errors, the finetuned BERT model automatically corrects other types of errors, leading to a high recall but low precision score.As the first step of pipeline structure, the low precision brings huge noise to the subsequent module and thus damages the final performance.

Analysis on POS representations
The part-of-speech feature is closely connected with grammar.In our model, we inject POS embedding in encoder and predict the correct POS sequence in decoder, which enable our model to learn a better POS representation.
To investigate the POS representations, we calculate the nearest neighbours to each POS tag by computing the cosine distance between embedding vectors, and list the results in Table 4.For each POS tag, most of their nearest tokens have the corresponding part-of-speech.It demonstrates that our POS embedding could capture general features of tokens sharing the same part-of-speech, which benefits our model and shows potential for other NLP applications.In our model, CRF is essential to capture neighboring POS dependencies of target sequences.We visualize the POS transition matrix that the CRF layer has learnt in Figure 4. Interestingly, several grammar rules could be found.For example, preposition is usually followed by noun, pronoun or spacename, but it has a low probability of transiting to punctuation (the end of a sentence).Adjective usually occurs before noun but seldom connects with preposition.This POS knowledge enables our model to generate grammatical sentences.

Case Study
Table 5 provides an example output of our SG-GEC model comparing with BART.Our SEC module firstly corrects the spelling error in this sentence.Benefiting by this, seq2seq model corrects the grammatical error subsequently.However, the BART model fails to correct both spelling and grammatical errors in this sentences.More cases are listed in Appendix F.
We list two cases in Table 6 to show the effect of our semantic class feature.When object and verb are mismatched (Case 1) or verb is missing (Csed  2), our model could correct these errors benefited from information provided by semantic class feature.When meeting rarely used words, for example 罪魁祸首(chief culprit), semantic class feature might provide extra information learning from examples which contain 主犯(principal criminal) or 要犯(important crimial) and help model to replace verb 了解(know about) with 了结(kill).
7 Related Work

Grammatical Error Correction
Seq2seq generation model and edit label prediction model are two mainstream models for GEC task.Benefiting by the rapid gains in hardware and high quality dataset, Transformer-based seq2seq models (Junczys-Dowmunt et al., 2018;Katsumata and Komachi, 2020;Kaneko et al., 2020) outperform traditional CNN and RNN-based model structures (Xie et al., 2016;Yuan and Briscoe, 2016;Chollampatt and Ng, 2018).Copy mechanism and subtask is also introduced to seq2seq model (Zhao et al., 2019).LaserTagger (Malmi et al., 2019) treats the GEC task as text edit task and predicts Keep, Delete and Append_# for each token in erroneous sentences to represent different edit operation.PIE (Awasthi et al., 2019) and GECToR (Omelianchuk et al., 2020) manually design detailed English-specific labels, regarding case and tense.Synthetic data is generated to enhance model performance (Ge et al., 2018;Grundkiewicz et al., 2019;Lichtarge et al., 2019).Besides two mainstream model structure, ESD-ESC (Chen et al., 2020) firstly detects erroneous spans and generates correct contents only for annotated spans.TtT model (Li and Shi, 2021) directly predicts each tokens in correct sentences given erroneous sentence.CGEC task is less addressed.Release of NLPCC-2018 dataset (Zhao et al., 2018) attracts much attention from participated teams, where top 3 systems are AliGM (Zhou et al., 2018), YouDao (Fu et al., 2018) andBLCU (Ren et al., 2018).HRG combines spelling checker, NMT-base model and sequence editing model (Hinson et al., 2020).However, spelling checker in HRG is based on language model which could not make full use of context.Zhao and Wang proposed data augmentation method MaskGEC, which adds random noise to input sentence dynamically in training process.S2A model combines seq2seq and sequence editing model by combining prediction probability of words and edit labels (Li et al., 2022).Zhang et al. ensembles seq2seq model and sequence editing model by edit-wise vote mechanism and achieves the state-of-the-art on NLPCC-2018 dataset.

Chinese Spelling Error Correction
Chinese spelling error correction is firstly tackled with CRF or HMM models (Tseng et al., 2015b;Zhang et al., 2015).In recent neural network models, phonological and graphic knowledge is introduced to help detecting and correcting Chinese spelling errors (Hong et al., 2019;Huang et al., 2021;Cheng et al., 2020).The pre-trained BERT model is also utilized to generate candidate sentences (Hong et al., 2019;Zhang et al., 2020).
Different from these models, we locate possibly misspelled tokens based on rule instead of neural network.We directly choose the homophone of masked token from candidates generated by BERT.Knowledge is utilized more explicitly in our module.More importantly, our method is zero-shot, without using any labeled data.

Limitation
Our zero-shot spelling error correction module is specifically designed for Chinese language.Meanwhile, the POS tagger and vocabulary of semantic class we used in SG-GEC model cannot be directly applied to other languages.To some degree, it makes SG-GEC model as a language-specific model.We try to find matched resources in English language and conduct experiments on English GEC dataset.The result is reported in Appendix E. It demonstrates that introducing semantic features after spelling check and employing sub-task of POS correction with CRF layer, which is the main idea of our work, could benefit GEC task of other languages.

E Experiment on English GEC dataset
For English GEC task, following Bryant et al. (2019), we use Lang-8 Corpus of Learner English (Mizumoto et al., 2011), FCE (Yannakoudakis et al., 2011), NUCLE (Dahlmeier et al., 2013) and W&I+LOCNESS (Bryant et al., 2019)  In Chinese language, token might be incorrectly written as its homophone.Meanwhile, in English language, spelling mistakes usually caused by missing or mis-writing letters.Our phonological knowledge based zero shot-spelling error correction module could not be directly applied to English language.Spelling errors in English language cause out-of-vocabulary words, which makes it easier to be detected and corrected compared with Chinese.Therefore, we simply utilize a spelling checker7 based on dictionary and edit distance to substitue zero-shot SEC module in SG-GEC.
We use NLTK8 as POS tagger.For semantic class knowledge, we could not find exactly matched resources in English language.We design two alternative solutions: • zero-class-feature We set the embedding of semantic class features to zero during both training and inference process.By introducing POS feature and sub-task of POS correction with CRF layer while setting embedding of semantic class features to zero, our model achieves 61.82 F 0.5 score, which outperforms BART model.Semantic class features provided by Wordnet also slightly improve the performance of the model.Wordnet focuses on modeling relations between words instead of classification of words.The same level semantic class feature of two different words might be different in scale.For example, root hypernym of "people" is "entity.n.01" while root hypernym of "get" is "get.v.01", which might brings influence to the model. 9https://wordnet.princeton.edu/ 10 https://huggingface.co/facebook/ bart-base Experimental result on English GEC dataset demonstrates that our proposed SG-GEC model could also benefit GEC task of other languages.

F More Case Studies
We list five more cases in Table 12 to demonstrate effectiveness of our pipeline structure.BART model might easily miss grammatical error (Case 1, Case 2) or spelling error (Case 3, Case 4) because of mixing spelling error and grammatical error correction together.It might be misguided by erroneous token (Case 5).

Figure 1 :
Figure 1: An example of erroneous-correct sentence pair.Black colour: correct tokens.Orange: erroneous tokens.Red: correct tokens with wrong POS tags.

Figure 2 :
Figure 2: Overview of our new framework for CGEC, which is composed of Spelling Error Correction (left part) and Grammatical Error Correction (right part).M refers to the [MASK] symbol in the BERT model.

Figure 3 :
Figure 3: Result of zero-shot spelling correction module evaluated on SIGHAN test dataset.

Figure 4 :
Figure 4: POS transition matrix in the CRF layer.Darker colour refers to higher transition probability. exp

Table 1 :
Performance comparison on the NLPCC-2018 test dataset

Table 2
6.3 Effect of Spelling Error CorrectionIn our framework, zero-shot spelling error correction (SEC) is a vital step.We conduct further exper-

Table 3 :
Effect of spelling error correction.Num.refers to the number of corrected spelling error tokens.SEC refers to zero-shot spelling error correction module.Bsec is a BERT model finetuned on the spelling error correction dataset SIGHAN15 and HybirdSet.+ SemF denotes integrating semantic features.

Table 5 :
Case study of our model.Red / blue color refers to correction of spelling / grammatical error.

Table 6 :
Case study of our model.Group words refer to words which share the same semantic class with the keyword.Blue / red color refers to verb / object.Green color refers to modification of verb or object.

Table 9 :
Effect of different types of sequence generation as a sub-task.POS pred / Class Lv.k pred refers to employ a sub-task to predict POS / k-th level semantic class sequence.w/o CRF represents the standard cross entropy loss is applied without using CRF.

•
Wordnet-class-feature We use WordNet 9 to get the semantic class features of a word by recursively searching the hypernym of this word.Number of values in 1rt / 2nd / 3rd level semantic class is 148 / 685 / 9852.We use BART-base 10 to initialize our model.

Table 10 :
Experiment on English GEC dataset.

Table 10
demonstrates that spell checker brings little benefit to BART-finetuned model on English GEC task.One reason is that spelling error in English causes out-of-vocabulary words, which is easily to detect.As shown in Table11, misspelled out-of-vocabulary words are usually divided into BPE level in BART model.

Table 11 :
Examples of misspelled words in English GEC dataset.