An Error-Guided Correction Model for Chinese Spelling Error Correction

Although existing neural network approaches have achieved great success on Chinese spelling correction, there is still room to improve. The model is required to avoid over-correction and to distinguish a correct token from its phonological and visually similar ones. In this paper, we propose an error-guided correction model (EGCM) to improve Chinese spelling correction. By borrowing the powerful ability of BERT, we propose a novel zero-shot error detection method to do a preliminary detection, which guides our model to attend more on the probably wrong tokens in encoding and to avoid modifying the correct tokens in generating. Furthermore, we introduce a new loss function to integrate the error confusion set, which enables our model to distinguish easily misused tokens. Moreover, our model supports highly parallel decoding to meet real application requirements. Experiments are conducted on widely used benchmarks. Our model achieves superior performance against state-of-the-art approaches by a remarkable margin, on both the correction quality and computation speed.


Introduction
Chinese spelling correction (CSC) attracts wide attention in recent years, which is significant for many real applications, such as search engine (Martins and Silva, 2004), optical character recognition(OCR) (Afli et al., 2016) and automatic speech recognition(ASR) (Hinton et al., 2012).
Given an input sentence with spelling errors, the model is trained to detect and correct these errors and output a correct sentence.According to Liu et al. (2010), phonologically and visually similar characters are major contributing factors for errors in Chinese text.As shown in Figure 1, in the first example, the error is caused by the misuse of "派"(send) and "拍"(take) which have similar Chinese pronunciation.In the second example, the error is caused by the misuse of "门"(door) and "们"(they) which have similar shapes.
First, given an input sequence, only a small fragment might be misspelled.However, for most of the previous models, they are totally blind to the errors at start, and so they attend on all tokens equally in encoding and generate every token from left to right for inference.As a result, previous models are inefficient and might create over-correction.As these models obtain a stronger ability to correct the errors, they also tend to modify the correct tokens by mistake.
Second, the confusion set, where a set of phonological and visual similar tokens are defined for each Chinese token, provides valuable knowledge for spelling correction, as shown in Figure 2.But the methodology to use it should be further improved.For example, Liu et al. (2021) propose a Confusion Set based Masking Strategy, in which they remove a token and replace the token with a random character in the confusion set.As the model randomly chooses a token from the confusion set each time, some tokens might be ignored.Besides, this method can't pay more attention to the token that is more easily to be misused.Wang et al. (2019) propose to generate a character from the confusion set rather than the entire vocabulary.In this hard restriction, the model cannot generate tokens that are not in the confusion set.
Third, when a CSC model is deployed in real applications, the time cost of inference is a critical problem to be considered.However, most previous models try to improve the generation quality but ignore the computation speed.
To address these issues mentioned above, we propose an Error-Guided Correction Model (EGCM) for CSC.Firstly, taking advantage of the strong ability of BERT (Devlin et al., 2018), we propose a novel zero-shot error detection method to do a preliminary detection, which provides precise guidance signals to the correction model.Following the guidance, our model attends more on the probably wrong tokens in encoding, and fixes the probably correct tokens during generation to avoid overcorrection.Furthermore, we introduce a new loss function that effectively integrates the confusion set.By applying this loss function, every similar token in the confusion set is learned to be distinguished from the target token, and the most similar token with a high possibility of being misused is given more attention.To speed up the inference, we apply a mask-predict strategy (Ghazvininejad et al., 2019) to support parallel decoding, where the tokens with low generation probability are masked and predicted iteratively.
We conduct extensive experiments on the widely used benchmark dataset SIGHAN (Wu et al., 2013;Yu and Li, 2014;Tseng et al., 2015).Experimental results show that our model significantly outperforms all previous approaches, achieving a new state-of-the-art performance for Chinese spelling correction.Moreover, our model has a distinct speed advantage over other models, which is 6.3 times faster than the standard Transformer and 1.5 times faster than the recent non-autoregressive model TtT (Li and Shi, 2021).
We summarize our contributions as follows: • We propose a novel zero-shot error detection method, which guides the correction model to attend more on the probably wrong tokens in encoding and fix the probably correct tokens in inference to avoid over-correction.
• We propose a new loss function to take advantage of the confusion set, which enables our model to distinguish similar tokens and attach more importance to the easily misused tokens.
• We apply an error-guided mask-predict decoding strategy for spelling correction, which supports highly parallel decoding and greatly accelerates the computation speed.
• We integrate all modules into a unified model, which achieves a new state-of-the-art performance for both correction quality and inference speed.

Related work
CSC is a task that detect and correct wrong tokens in Chinese Sentences.It's an active topic that varieties of approaches have been proposed to tackle the task (Wang et al., 2019;Cheng et al., 2020;Li and Shi, 2021;Xu et al., 2021;Liu et al., 2021;Huang et al., 2021).Earlier work in CSC focuses mainly on unsupervised methods, which typically adopts a confusion set to find correct candidates and employs a language model to select the correct one (Chen et al., 2013;Yu and Li, 2014).Recently, sequence translation and sequence tagging are the two most widely used methods in CSC.Wang et al. (2018) treats the CSC task as a sequence labeling problem, and use a bidirectional LSTM to predict the correct characters.Liu et al. (2021); Ji et al. (2021); Xu et al. (2021); Lv et al. (2022) try to enrich the representation generated by the encoder by introducing visual and phonetic features.Softmax operation is utilized to find a substitution for each token in the sentence.As the rapid development of neural machine translation (Vaswani et al., 2017), seq2seq encoder-decoder frameworks have been introduced to the CSC task in (Ji et al., 2017;Chollampatt et al., 2016;Wang et al., 2019).
Recent work tends to utilize character similarity as an external knowledge.The confusion set  where similar characters are stored is widely used (Liu et al., 2021;Zhang et al., 2020;Wang et al., 2019;Yu and Li, 2014;Cheng et al., 2020;Lv et al., 2022).There are several ways of using the confusion set.The first is to augment the training data by replacing the original token with it's similar tokens (Liu et al., 2021;Zhang et al., 2020).Wang et al. (2019) proposes to generate a character from the confusion set rather than the entire vocabulary.Yu and Li (2014) proposes to produce candidates by retrieving the confusion set and then filter them via language models.Cheng et al. (2020) uses similarity graphs derived from the confusion set and use graph convolution operation to absorb the informa-tion from neighboring characters in the graph.

Methodology
The proposed Error-Guided Correction Model (EGCM) is illustrated in Figure 3.We apply the conditional masked language model (CMLM) (Ghazvininejad et al., 2019) as a backbone, which is an encoder-decoder architecture trained with a masked language model objective (Devlin et al., 2018;Conneau and Lample, 2019).In the CMLM architecture, the source wrong sentence with n tokens is denoted as X = (x 1 , x 2 , x 3 , ..., x n ), the target sentence is denoted as Y = (y 1 , y 2 , y 3 , ..., y n ).Several tokens in Y are replaced with [MASK].These masked tokens construct the set Y mask .And the rest of the tokens in Y that are unmasked construct the set of Y obs .For Chinese spelling correction, given a source sentence X and the set of unmasked target tokens Y obs , the objec is to predict the probability P (y|X, Y obs ) and generate token y for each y ∈ Y mask .
We first propose a zero-shot spelling error detection method to provide two guidance signals to the correction model, as shown in Figure 4.The first guidance signal is the Guidance Attention Mask that is used in the error-focused encoder, in which the probably correct tokens are masked to push our model to attend more on the wrong tokens.The second guidance signal is the Guidance for Inference that serves as the start of decoding to avoid modifying correct tokens by mistake.Moreover, we introduce a new loss function to take advantage of the confusion set.During inference, we apply an error-guided mask-predict strategy in which the correct tokens are fixed and the probably wrong tokens are masked and repredicted iteratively.

Zero-shot Error Detection
Given a sentence X = (x 1 , x 2 , x 3 , ..., x n ) that contains n tokens, we want to make a preliminary decision on which tokens are probably wrong and which are correct.
As shown in Figure 4, firstly, we construct a n × n matrix by repeating the original sentence n times, where the k th token is masked in the k th row in the matrix (k is from 1 to n).Then, we employ BERT (Devlin et al., 2018) to predict each masked position condition on the unmasked tokens in the same row.Thus, for each position from x 1 to x n in the sentence X, we obtain the predicted tokens along with their probabilities.The tokens with the top-k probabilities are selected as candidates of modification.We assume that if the original token x i occurs in the candidates list, the token is considered correct.Otherwise, the token is probably wrong and needs to be corrected.
Based on the output of error detection, we construct two guidance signals namely Guidance Attention Mask and Guidance for Inference, as shown in Figure 4.The Guidance Attention Mask (GAM) is a matrix constructed by: where x ij denotes the j th token in the i th sentence.GAM ij denotes the element of the i th row and the j th colomn in GAM.The Guidance for Inference (GFI) is constructed by masking all the probably wrong tokens in the original sentence.Further, GAM will be projected into the error-focused encoder, and GFI will be utilized to initialize the decoder.

Error-aware Encoder
We adopt the Transformer (Vaswani et

Integrating Error Confusion Set for Training
During training, the tokens in Y mask are randomly selected among the target correct sentence as shown in Figure 3.To better fit the requirements of correcting both single-character errors and multi-character errors in Chinese spelling correction, we adopt two masking strategies, namely mask-separate and mask-range.In mask-separate, we first sample the number of masked tokens from a uniform distribution between [1, len(X)], and then randomly choose that number of tokens.For mask-range, we select l ∈ [2, 3], and randomly select a span with length l.We replace the tokens in Y mask with a special [MASK] token, which is the generation object of the model.There are three attention blocks in the Transformer decoder layer.After the self-attention block, the decoder will first attend to H s , the representation of the source wrong sentence.Then, the decoder will attend to H ef , the representation of the sentence with correct tokens being masked.The output of the previous decoder layer is then input into the next decoder layer.
where H 0 = Embedding(Y obs ).Q, K, V represents the Query, Key, Value matrix.Y obs is the set of unmasked tokens in the target sentence.The output probability distribution P is generated from the decoder over the vocabulary V : where denotes the sequence length.We optimize the model over every token in Y mask .Besides the traditional loss function, we introduce a new loss to integrate the confusion set knowledge.
We employ Maximum Likelihood Estimation (MLE) conduct parameter learning and utilize negative log-likelihood (NLL) as the loss function, which is computed as: To make full use of the confusion set knowledge, we introduce a new loss function L cs .We adopt the confusion set constructed by Lv et al. (2022).
For each token y i in Y mask , we find out the set of the similar tokens of y i based on the confusion set, namely Y conf .The tokens in Y conf are regarded as negative samples of y i .We use these negative samples to help our model better learn the difference between the target token and its similar ones.The optimization objective for the confusion loss L cs is defined as: where y c denotes the similar token of y i in the confusion set.
Overall, the final optimization objective of our model is: where γ is a hyperparameter to balance two loss functions.

Error-Guided Generation
In the inference stage, we apply a mask-predict approach (Ghazvininejad et al., 2019), where the tokens with low probability are masked and predicted within a constant number of iterations.
To provide the model a good start point for generation, we exploit the Guidance for Inference (GFI) as an initialization for decoding.GFI produces a draft sentence, where the probably wrong tokens are masked and the probably correct ones are remained unmasked.During generation, the unmasked tokens will be fixed, and only the masked tokens are taken into consideration for modification in each iteration.Fixing these correct tokens will effectively teach our model to avoid over-correction.Figure 5 shows how does our model correct a wrong sentence in 3 iterations.
The model runs for a pre-determined number of iterations T .The number of [MASK] in the draft sentence is denoted as N ori .Accordingly, the number of tokens that are masked in the t th iteration is defined as mask is the set of masked tokens in the Guidance for Inference.At a later iteration t, we choose N t tokens among the masked tokens in the previous iteration t−1 that has the lowest probability scores: Where p i is the probability score of y i calculated in Equation 13, 14.
mask is the set of masked tokens that are probably wrong at the t th iteration, and obs is the set of unmasked tokens that are considered correct and fixed in later iterations.At each iteration, the model predicts the probably wrong tokens in Y (t) mask conditioned on the source text X and Y (t) obs .We select the prediction with the highest probability for each masked token y i ∈ Y (t) mask , and update its probability score accordingly: where obs ) is the conditional probability of y i being predicted as the token w in the vocabulary set V .

Dataset and Metrics
Training dataset Following Liu et al. (2021), the training data is composed of 10K manually annotated samples from SIGHAN (Wu et al., 2013) and 271K automatically generated samples from (Wang et al., 2018).Evaluation dataset Following previous works, the SIGHAN15 test dataset (Tseng et al., 2015) is used to evaluate the proposed model.Statistics of the used datasets please refer to Appendix A.
Metrics We evaluate model performance of detection and correction at sentence-level, with accuracy, precision, recall and F1 scores.We evaluate these metrics using the script from Cheng et al. (2020)1 .Moreover, following Liu et al. (2021), we also report the sentence-level results evaluated by SIGHAN official tool2 .

Comparing Methods
We compare the performance of our model with several strong baseline methods as follows: Confusionset introduces a copy mechanism into seq2seq and generates characters from the confusionset (Wang et al., 2019).FASPell utilizes a denoising autoencoder to generate candidates (Hong et al., 2019).SpellGCN incorporates phonological and visual knowledge via a graph convolutional network (Cheng et al., 2020).Chunk proposes a chunk-based decoding method with global optimization (Bao et al., 2020).SM BERT uses soft-masking technique to connect the network of detection and correction (Zhang et al., 2020).TtT employs a Transformer Encoder with a Conditional Random Fields layer stacked (Li and Shi, 2021).PLOME proposes a confusion set based masking strategy (Liu et al., 2021).REALISE leverages the multimodal information and mixes them electively (Xu et al., 2021).PHMOSpell integrates pinyin and glyph with a multi-modal method (Huang et al., 2021).ECSpell adopts the Error Consistent masking strategy for pretraining (Lv et al., 2022).MLM-phonetics integrates phonetic features by leveraging pre-training and fine-tuning (Zhang et al., 2021).RoBERTa-DCN generates the candidates via a Pinyin Enhanced Generator (Wang et al., 2021).SpellBert employs a graph neural network to introduce visual and phonetic features (Ji et al., 2021).GAD learns the global relationships of the potential correct input characters and the candidates of potential error characters (Guo et al., 2021).BERT We also implement classical methods for comparison.We fine-tune the Chinese BERT model (Devlin et al., 2018) on the CGEC corpus directly.

Hyperparameter Setting
We follow most of the standard hyperparameters for transformers in the base configuration (Vaswani et al., 2017) and follow the weight initialization scheme from BERT (Devlin et al., 2018).For regularization, we use 0.3 dropout, 0.01 L2 weight decay.The hyperparameter γ which is used to weight the confusion loss is set to 2 after tuning.Adam optimizer (Kingma and Ba, 2014) with β = (0.9, 0.999), ε = 1e −6 is used to conduct the parameter learning.The learning rate is set to 5e −5 , and the model is trained with learning rate warming up and linear decay.

Overall Performance
Table 1 reports the performance of our proposed EGCM model and baseline models on the SIGHAN15 test set.For a fair comparison, we also employ the pre-trained model cBERT (Liu et al., 2021) which has the same architecture with BERT and pre-trained via the confusion set based masking strategy.Our model with pretrained cBERT (Pre-Tn EGCM) outperforms all existing approaches, achieving a 81.6 F1 at detection and 79.9 F1 at correction.Compared with the BERT baseline, Pre-Tn EGCM achieves 5.5% performance gain on detection F1 and 6.5% gain on correction F1.Among un-pretrained methods, EGCM also outperforms all competitor models by a wide margin.
We also evaluate the model performance using the official tool, and report the results in Table 2. Our model Pre-Tn EGCM obtains the best results for both detection and correction.Especially, it greatly outperforms previous methods in precision.
It should be emphasized that, our model EGCM is trained on 270k HybirdSet and outperforms several models that are pre-trained on a big size of synthetic data, such as PLOME (Liu et al., 2021) which is pre-trained using 162 million sentences.This demonstrates that our model effectively learns to correct spelling errors without relying on heavyweight data.An example output of our EGCM comparing with BERT is listed in Appendix B.

Ablation Study
We explore the contribution of each component in our EGCM model by conducting ablation studies with the following settings: (1) Removing the errorfocused encoder mentioned in 3. the confusion set loss L cs in equation 9.
(3) Initialize the start sequence of inference with all [MASK] instead of using the Guidance for Inference.The results are shown in Table 3. Specifically, the confusion set loss leads to the biggest improvement to our model with 4.2 points for detection and 4.1 points for correction.By removing the error-focused encoder, the drop of performance indicates that this encoder does learn to pay attention to the probably wrong tokens of the sentence and impel our model to correct the wrong tokens actively.Also, without the use of Guidance for Inference as the start of decoding for  inference, the performance drops especially on precision, which indicates that by fixing the tokens that are correct can effectively avoid over-correction and improve precision.

Evaluation on Zero-shot Error Detection
We employ a zero-shot detection approach to do a preliminary detection, in which all the tokens are divided into two groups, the probably wrong tokens and the probably correct one.In the inference stage, the probably correct ones are unmasked and will not be modified to avoid over-correction, while the probably wrong ones are masked and repredicted .We want to ensure that unmasked tokens are truly correct that don't need to be modified, and at the same time, the errors in the sentences are masked as many as possible.
As shown in Table 4, P error&mask/error denotes the percentage of errors that are masked, P correct/unmask denotes the percentage of truly correct tokens in the unmasked tokens.In our zeroshot error detection, the BERT predicted tokens with top-k probabilities are selected as candidates, if the original token is not in the candidates list, it is considered as wrong.We try different k and conduct experiments.Our method achieves promising results with high accuracy, which guarantees cor- rect signals for further processing.Obviously, the smaller k is, the more tokens are masked and less tokens are fixed, this might lead to over-correction.
We want the errors are masked as many as possible, and at the same time, fewer tokens are masked.Therefore, in our model, we set k = 2.

Analysis on Different Confusion Sets
To further prove the effectiveness of the confusion loss we proposed, and to show that this loss function can be generalized, we conduct experiments on three different confusion sets, including the confusion set proposed by Lv et al. (2022) 5.For all three confusion sets, our model outperforms the model that utilizes the same confusion set.Compared with previous methods, our model takes every token in the confusion set into consideration by computing it's possibility of being misused.Moreover, the results indicate that our model has strong generalization ability and is not limited to any specific confusion set.

Analysis on Decoding Iterations
With a predefined decoding iteration T = 10, we show the F1 score of previous iterations t(t < T ) to illustrate how the mask-predict strategy detects and corrects the wrong tokens step by step.As shown in figure 6, the F1 score of detection and correction improves as the decoding iteration goes up.This indicates that, by masking and repredicting the tokens of low probability in each iteration, our model corrects the tokens that are wrongly predicted during previous iterations.And as the number of unmasked tokens increases, more informa-  tion is given to help predict the hard masked tokens.With 8 iterations, our model achieves state-of-theart performance.

Analysis on Computing Efficiency
Chinese spelling correction can be applied in many real-life applications, such as writing assistant and search engine.Therefore, the time cost efficiency of models is a key point to be considered.We implement both the baseline models and our model on the single NVIDIA RTX 2080 GPU.Table 6 depicts the time cost per sample of our model comparing with some previous approaches.Our model runs faster than all previous approaches.Experimental results show that our model not only achieves superior performance against state-of-theart approaches but also is cost-saving and green.

Limitation
In this paper, we use results from zero-shot spelling error detection as a guidance signal.The sentence with probably wrong tokens masked and the other tokens fixed are used as a start of decoding.This means that if a wrong token is not assigned with a [MASK] token, it will never be corrected in later iterations.Even though we conduct experiments and the result shows that up to 94% of the wrong tokens are masked in the guidance signal, there are still some wrong tokens missed by our model.To limit the number of tokens that are free to be modified is one of our ways to improve precision, but we are also looking forward to a way to further improve recall.What's more, even though we make full use of the confusion set, we still think that's not enough.Now we are using the confusion set in which every token has a set of predefined similar tokens.And these sets of similar tokens are isolated with each other.However, Chinese has various kinds of spelling errors, the target token might not be in the predetermined similar tokens set of the original token.And this kind of mistakes can never be learned to correct by the model.We think a better design for the data structure of the confusion set needs to be proposed, in which the sets are not isolated and we are able to calculate the similarity distance between each pair of tokens using particular algorithms, for example, UnionFind on a dynamic Graph.This kind of dynamic confusion knowledge can help avoid ignoring the probably misused tokens.

Figure 1 :
Figure 1: Examples of Chinese spelling errors.Misspelled characters and their corresponding corrections are marked in red.

Figure 2 :
Figure 2: An example of the confusion set.

Figure 3 :
Figure 3: The architecture of our proposed model."M" denotes the [MASK] token.Guidance for Inference and Guidance Attention Mask are generated from Zero-shot error detection as shown in Figure 4.

Figure 4 :
Figure 4: An example of how our model obtains guidance signals using a zero-shot method.The original wrong token is marked red.

Figure 5 :
Figure 5: An example of Error-Guided Generation.In Guidance for Inference, the masked tokens are highlighted.In later iterations, the highlighted tokens are of lowest probabilities and are masked and repredicted.The wrong tokens are marked in red.

Figure 6 :
Figure 6: Results of different decoding iterations

Figure 7 :
Figure 7: Case Study.The wrong are marked red.
an extra attention mask in calculating selfattention in Encoder ef , which informs the model which error part of the sentence should be focused on.Concretely, the output of the Encoder s and Encoder ef is calculated respectively as: ef , which utilizes Guidance Attention Mask to expose the probably wrong tokens and divert the attention of our model from the correct tokens.The output of Encoder s is input into the error-focused encoder Encoder ef .The Guidance Attention Mask is used as

Table 1 :
2.(2) Removing Performance on the SIGHAN15 test set.Best results are in bold.The first group lists the models that are not pretrained, and the second group lists the methods that are pretrained (denoted with "*" ).

Table 2 :
Performance on the SIGHAN15 test evaluated by the official tools.Best results are in bold.

Table 4 :
An evaluation on Zero-shot error detection.

Table 5 :
Effects of different confusion sets.
Wang et al. (2018)4, andWang et al. (2018)5. F each confusion set, we compare our model with the models that use the same confusion set but in different way.As shown in table

Table 6 :
Comparisons of the computing efficiency.