Correcting Chinese Spelling Errors with Phonetic Pre-training

Chinese spelling correction (CSC) is an important yet challenging task. Existing state-of-the-art methods either only use a pre-trained language model or incorporate phonological information as external knowledge. In this paper, we propose a novel end-to-end CSC model that integrates phonetic features into language model by leveraging the powerful pre-training and ﬁne-tuning method. Instead of conventionally masking words with a special token in training language model, we replace words with phonetic features and their sound-alike words. We further propose an adaptive weighted objective to jointly train error detection and correction in a uniﬁed framework. Experimental results show that our model achieves signiﬁcant improvements on SIGHAN datasets and outperforms the previous state-of-the-art methods.


Introduction
Spelling errors are common in practice and the errors will be enlarged in the downstream tasks. Therefore, Spelling correction is important to many NLP applications such as search optimization (Martins and Silva, 2004;Gao et al., 2010), machine translation (Belinkov and Bisk, 2017), part-ofspeech tagging (Van Rooy and Schäfer, 2002;Sakaguchi et al., 2012), etc. Spelling correction requires a comprehensive grasp of word similarity, language modeling and reasoning, making it one of the most challenging tasks in NLP.
In this paper, we focus on Chinese spelling correction (CSC). Unlike alphabetic languages, Chinese characters cannot be typed without the help of input systems, such as Chinese Pinyin (a pronunciation-based input method) or automatic speech recognition (ASR). Thus typos of similarly pronounced characters occur quite often in Chinese text. According to , 83% of Chinese spelling errors on the Internet results from * Corresponding author.  The Chinese character "德(de, German) is incorrectly typed as its homophone "的(de, of)". The CSC model produces a fluent but incorrect sentence by replacing the character with "英(ying, English)", without considering the phonetic similarity.
phonologically similar characters. As illustrated in Figure 1, the character "德(de, German)" is incorrectly typed as one of its homophone 1 "的(de, of)". Traditional methods of CSC firstly detect misspelled characters and generate candidates via a language model, and then use a phonetic model or rules to filter wrong candidates (Chang, 1995;Chen et al., 2013;Dong et al., 2016). To improve CSC performance, studies mainly focus on two issues: 1) how to improve the language model (Wu et al., 2010;Dong et al., 2016;Zhang et al., 2020) and 2) how to utilize external knowledge of phonological similarity (Jia et al., 2013;Yu and Li, 2014;Cheng et al., 2020). The language model is used to generate fluent sentences and the phonetic features can prevent the model from producing predictions whose pronunciation deviates from that of the original word. As illustrated in Fig.  1, the original Wrong sentence contains an incorrect word "的(de, of)". The CSC model produces a fluent but incorrect sentence by replacing "的(de, of)" with "英(ying, English)". However, the pronunciations of these two words are totally different, because the model ignores phonetic features.
Recent studies tackle the issue using deep neural networks. Hong et al. (2019) used a pre-trained language model BERT (Devlin et al., 2019) to generate candidates and train a classifier with phonetic features to select the final correction. Wang et al. (2019) considered CSC as a sequence-to-sequence task and generated candidates from a confusion set 2 instead of the entire vocabulary. These methods take phonetic information as external knowledge but the discrete candidate selection obstructs the language model from learning directly via backpropagation. Zhang et al. (2020) proposed an endto-end CSC model by modifying the mask mechanism of BERT. However, they did not use any phonological information, which is important for exploring words similarity.
In this paper, we propose a novel end-to-end model for Chinese spelling correction. The model incorporates the phonetic information into language model and leverages the pre-training and fine-tuning framework. Concretely, we first modify the learning task of pre-trained masked language model (Devlin et al., 2019). Rather than replacing characters with an indiscriminate symbol "[MASK]", we mask characters with pinyin or similar pronounced characters. This enables the language model to explore the similarity between characters and pinyin. Then we fine-tune on error correction data with a model of two networks, a detection network predicts the probability of spelling error for each word, and a correction network generates correction by fusing the word embedding and pinyin embedding with the probabilities as input. We jointly optimize the detection and correction networks in a unified framework.
The contributions of this paper are summarized as follows: • We propose a novel end-to-end CSC model that incorporates phonetic features into language representation. The model encodes the Chinese characters and Pinyin tokens in a shared space.
• The integration of phonological information greatly facilitates CSC. Experimental results on the benchmark SIGHAN datasets show that 2 Related work Earier work on CSC follows the pipeline of error detection, candidate generation, and candidate selection (Wu et al., 2010;Jia et al., 2013;Chen et al., 2013;Chiu et al., 2013;Xin et al., 2014;Yu and Li, 2014;Dong et al., 2016;. These methods mainly employ unsupervised language models and rules to select candidates. With the development of end-to-end networks, some work proposed to optimize the error correction performance directly as a sequence-labeling task with conditional random fields (CRF) (Wu et al., 2018) and recurrent neural networks (RNN) (Zheng et al., 2016;Yang et al., 2017). Wang et al. (2019) used a sequence-to-sequence framework with copy mechanism to copy the correction results directly from a prepared confusion set for the erroneous words. Cheng et al. (2020) built a graph convolution network (GCN) on top of BERT (Devlin et al., 2019) and the graph was constructed from a confusion set. Zhang et al. (2020) proposed a soft-masked BERT model that first predicts the probability of spelling error for each word, and then uses the probabilities to perform a soft-masked word embedding for correction. However, they did not use any phonetic information.
Our work is most related to Zhang et al. (2020), but with some important differences. We will further discuss this in Section 3.4.

Methods
Formally, the Chinese spelling correction task is to map a sequence x w = (x w 1 , x w 2 , ..., x w N ) which may contain spelling errors to another correct sequenceŷ = (ŷ 1 ,ŷ 2 , ...,ŷ N ), where both x w i and We propose an end-to-end CSC model which consists of two components, detection and correction. The detection module takes x w as input and predicts the probability of spelling error for each character. The correction model takes the combination of the embedding of x w and its corresponding pinyin sequence x p = (x p 1 , x p 2 , ..., x p N ) as input and predicts the correct sequence y. We propose a method to fuse x w and x p embeddings using the probability of spelling error as weights.
Following the pre-train and fine-tune framework,

… …
Embedding ta de yu shuo de hen hao we first pre-train a masked language model, MLMphonetics, by learning to predict characters from similarly pronounced characters and pinyin. Then in fine-tuning, we jointly optimize the the detection and correction modules.
In this section, we first introduce the model architecture (Sec. 3.1), the optimization method (Sec. 3.2) , and the pre-training of MLM-phonetics (Sec. 3.3), then summarize the novelty of our method (Sec. 3.4). Detection Module Given a source sequence x w = (x w 1 , x w 2 , ..., x w N ), the goal of the detection module is to check whether a character x w i (1 ≤ i ≤ N ) is correct or not. For this labelling problem, we use class 1 and 0 to label misspelled characters and correct characters, respectively.

Model Architecture
We formalize the detection module as follows: where e w = (e w 1 , e w 2 , ..., e w N ) is the word embedding of x w , E is a pre-trained encoder and f det is a fully-connected layer that maps the sentence representation to a binary sequence y d = (y d 1 , y d 2 , ..., y d N ), y d i ∈ {0, 1}. We use p err i to denote the probability that character x w i is erroneous: where θ d is the parameters of error detection module.
Correction Module The goal of the correction module is to generate correct characters based on the output of the detection module.
We not only use the word embeddings for input, but also use the pinyin embeddings to integrate the phonetic information. Concretely, we first generate the pinyin sequence x p using the PyPinyin 3 tool, get the pinyin embedding e p from the embedding layer, and fuse it with the word embedding e w by linear combination: This combination uses the spelling error probability predicted by the detection module as weights to balance the importance of the semantic feature (character embedding) and phonetic feature (pinyin embedding). We introduce two special cases: If p err i = 0, indicating the character x w i is detected to be correct, and the model uses only its word embedding in e m . If p err i = 1, meaning that the character is detected to be erroneous, and the model uses its pinyin embedding.
Finally, the correction results y is predicted through a fully-conntected layer f crt :

Jointly Fine-tuning
There are two objectives for our model: to train the detection parameters f det and to adjust the detection and correction modules to achieve an optimal balance. We jointly optimize the detection loss L d and the correction loss L c that: where θ d and θ c is the parameter of the detection and correction module, respectively.ŷ d i is the ground-truth detection result and y d i is the prediction by the detection module, both of them is a binary value of 0 or 1. In particular, the correction loss is the negative log likelihood weighted by the probability of the detection result, p(y d i |x w ; θ d ) ∈ (0.5, 1]. This is to distinguish between the responsibilities of the two tasks. When the detection module gives a low-confidence prediction, that is, p(y d i |x w ; θ d ) approaches 0.5, e m fuses the semantic features and phonetic features with similar weights. But we hope that the detection module could provide clear judgement of right or wrong, i.e., p(y d i |x w ; θ d ) approaches to 1, so that e m can be dominated by either semantic features or phonetic features. In such case, the correction of error words will not be interfered by the semantic features in e m , and vice versa. Therefore, we penalize the low-confidence prediction given by the detection module. Concretely, when the probability of the detection result is low, L c decreases and the model will focus more on optimizing L d . And when the detection probability is high, the model optimizes L d and L c at balance.
The adaptive weighting objective enables us to jointly train our model with the sum of the two loss functions: We compare different weighting strategies with our adaptive weighting in experiments.

Pre-training MLM-phonetics
In this section, we introduce our pre-trained language model, MLM-phonetics, that 1) integrates phonetic features and 2) solves the problems of using standard masked language model in our CSC architecture.
The pre-train and fine-tune framework (Devlin et al., 2019) has been proven effective in facilitating downstream NLP tasks including sentence classification, question answering, etc. But the inputs of these tasks are of identical distribution with pretraining, while the input sentences in CSC are with errors, which are different from the pre-training samples. Some work thus far side-stepped the input divergence by avoiding to input error sentences directly to pre-trained models. For example, Zhang et al. (2020) use a bidirectional-GRU for error detection before a BERT-based correction network.
In order to take advantage of the pre-training technique, we modify the pre-training task. In the pre-training of a standard mask language model (MLM-base), the model is trained by predicting 15% randomly selected characters which are replaced with the [MASK] token, random character, and themselves at the sampling rate of 80%, 10%, and 10%, respectively.
To avoid input divergence and integrate phonetic features, we propose two pre-training replacements: confused-Hanzi 4 and a noisy-pinyin. We use Figure  3 to illustrate these replacements: • [MASK] replacement trains the reasoning ability of the language model by restoring masked characters only according to the context.
• Random Hanzi replacement trains MLMbase to correct words from random ones (e.g, to predict "得(de)" from "不(bu)"), which is a more difficult task compared with correcting from similarly pronounced characters. However, due to the different input distribution, this strategy is of little help to CSC.
• Confused-Hanzi replacement trains MLMphonetics to correct words to their commonly confused characters in the confusion set (e.g, to predict "豪(hao)" to "好(hao)"). It provides the model a way to access samples with typo.
• Noisy-pinyin replacement trains MLMphonetics to predict the original characters from pinyin of their commonly confused characters in the confusion set (e.g, to predict  "得(de)" from "de"). It helps clustering similarly pronounced characters with their corresponding pinyin tokens.
The first three replacements are used for pretraining standard MLM-base and the last two are proposed in our method to model the similarity between characters and pinyin tokens. In the pretraining of MLM-phonetics, our data generator randomly chooses 20% of token positions in the training samples. If the i th token is chosen, we empirically replace it with (1) the [MASK] token 40% of the time, (2) the noisy-pinyin of this token 30% of the time, and (3) a confused-Hanzi from its confusion set 30% of the time 5 . Then MLMphonetics is trained to predict the original sentence from the sentence with replacements.
The two proposed pre-training tasks can smooth out the input divergence between pre-training and fine-tuning the CSC model. The Confused-Hanzi replacement simulates the input of the detection module and the two replacements together facilitates the pre-trained model to adapt to the fused embedding (Eq. 3).

Novelty of our method
Our method is most related to Zhang et al. (2020), but different in the following aspects. First, our model combine the embedding of pinyin and character to prevent information loss, which is more like the human correction process that predicts correction with the pronunciation of the problematic words. On the contrary, Zhang et al. (2020) has to add residual connection before emitting the final correction, or it will forget the phonetic information of the error words after combining their embedding with [MASK].
Second, we share the pre-trained encoder in detection and correction by proposing new pretraining tasks, while Zhang et al. (2020) took an un-pre-trained bidirectional-GRU in detection to avoid the input divergence between pre-training and fine-tuning.
Third, we propose an adaptive weighting policy in jointly training the error detection and correction. This policy encourages the model to produce clear detection results, making the fused embedding dominated by either semantic features or phonetic features, which is close to the pre-training task. On the contrary, Zhang et al. (2020) proposed to linearly combine the detection and correction loss with a fixed hyper-parameter.

Experiments
We carry out experiments on the SIGHAN dataset, a benchmark for CSC.

Data Processing
The training set consists of two parts: 1) A pretraining corpus of 0.3 billion Chinese sentences, and 2) A CSC training corpus of 281K sentences pairs. The first corpus is used for pre-training MLM-phonetics, and the latter is used to fine-tune the CSC model initialized by MLM-phonetics.
For the pre-training corpus, we collect a variety of data, such as encyclopedia articles, news, scientific papers, and movie subtitles from a search engine. The CSC training data used in our experiments is the same as Wang et al. (2019) and Cheng et al. (2020), including three human-annotated training datasets Tseng et al., 2015) and an automatically generated dataset with the approach proposed in   6 .

Model Settings
We compare our methods with previous state-ofthe-art methods: • FASPell (Hong et al., 2019) first generates candidates for each character in the input sentence through a pre-trained MLM, then uses a filtering model with visual and phonetic similarity features to select the best candidate.
• Pointer Networks (Wang et al., 2019) uses a seq2seq system based on the constraint that each correct word is contained in the confusion set of the erroneous character.
• Soft-Masked BERT (Zhang et al., 2020), for each token in the sentence, linearly combines its embedding with the embedding of [MASK], and predicts the error character based on a fine-tuned masked language model.
• SpellGCN (Cheng et al., 2020) incorporates two similarity graphs into a pre-trained sequence-labeling model via graph convolutional network. The two graphs are derived from a confusion set and correspond to pronunciation and shape similarities.
• ERNIE (Sun et al., 2020) directly finetunes the standard masked language model on the CSC training data.
• MLM-phonetics, our proposed method uses an end-to-end system based on a pre-trained language model with phonetic features.
Pointer Network uses LSTM in both encoder and decoder. All the other methods take CSC as a sequence tagging problem with a pre-trained 12layer Transformer as encoder. FASPell and Soft-Masked BERT use the pre-trained BERT, while ERNIE and MLM-phonetics use the pre-trained ERNIE for initialization 7 . We use the sentencelevel and character-level f1-score to evaluate different systems. At the sentence-level, a prediction is considered correct only if all the errors in the sentence are detected or corrected. Therefore, sentence-level evaluation is stricter and results in lower scores. Following Cheng et al. (2020), we use the scripts in (Hong et al., 2019) to calculate the sentence-level results. Table 1 shows the detection and correction performance on three SIGHAN test sets. All the methods provide sentence-level results except Pointer Network, which provides the results at the characterlevel.

Overall Results
It shows that our method, MLM-phonetics significantly outperforms the other systems. For example on SIGHAN15, the detection F1-score has 2.5 point improvement (77.7→80.2) and the correction F1-score has 1.6 point improvement (75.9→77.5) compared with the previous best method SpellGCN. Our method also achieves over 6 points improvement over ERNIE in correction f1-score, verifying the effectiveness of our pre-training strategy.
All the listed methods except Pointer Networks use pre-trained model for initialization, but only FASPell, SpellGCN and our method take phonetic information into consideration. FASPell uses isolated phonetic features and language model, which inevitably lead to performance decline. SpellGCN incorporates phonetic knowledge into the language model by building a graph convolutional network on the top of BERT. It is proven to be effective, but the graph is derived from a prepared confusion set. Therefore, the performance of the model depends on the completeness of this set. As shown in Table 1, the precision of SpellGCN is close to MLM-phonetics, but there is a significant gap in the recall. Our method, with the help of additional Pinyin tokens, integrates phonetic features in word embedding, thus increasing the generalization of the model. Soft-Masked BERT corrects sentences without phonetic features. Its detection and correction performance is inferior to ours. This may partly due to the lack of phonological similarity, as well as the difference in model architecture. It is also notable that the training data of our method MLM-phonetics is consistent with that of Pointer Networks, Soft-Masked BERT*, SpellGCN and ERNIE, but Soft-Masked BERT and FASPell use different training data. See Appendix C for some case studies.

Pre-training Tasks
In order to analyze the effect of the three replacement tasks ([MASK], Confused-Hanzi, Noisypinyin) in the pre-training of MLM-phonetics, we take three models for comparison, each trained by only two of the three tasks with equal probability.
During fine-tuning, the testing curves on SIGHAN15 are plotted in Figure 4. MLMphonetics shows the best performance, achieving the correction f1-score of 77.5 in the 7 th epoch. However, it's interesting that its performance is inferior to the pre-trained model without (w/o) noisypinyin at the beginning. This is caused by the pretraining & fine-tuning discrepancy.
The model w/o noisy-pinyin only learns to predict the original characters from [MASK] and confused-Hanzi in pre-training. So the pinyin embedding has not been initialized until fine-tuning. Therefore, the pinyin embedding can be viewed as noise in the embedding fusion of the finetuning stage. Such embedding is close to its pretraining input distribution, thus the pre-trained model w/o noisy-pinyin performs good at the beginning. MLM-phonetics, on the contrary, is trained to reconstruct words based on either Hanzi embedding or pinyin embedding in the pre-training. But it needs to predict from a fusion of them in fine-tuning, thus it requires longer training time for adaption. As the training continues, the model benefits from the embedding fusion and finally achieves 0.6 points improvements (76.9→77.5) over the pre-trained model w/o noisy-pinyin.
Besides, the other two pre-trained models perform relatively low. The pre-trained model w/o confused-Hanzi suffers from input divergence in pre-training & fine-tuning. The model is not trained to correct words from spelling errors until the finetuning stage. The pre-trained model w/o [MASK] performs the worst, which shows the importance of  using [MASK] prediction to enhance the semantic comprehension.

Balance the objective of detection and correction
Next, we explore the impact of the weighting strategy that balances the two objectives in fine-tuning. In our CSC model, both the detection and correction are sequence labeling tasks. We use the detection probability to balance the two tasks, as depicted in Eq.(6). On the contrary, Zhang et al. (2020) balances the two tasks with a fixed hyperparameter λ: λL uc +(1−λ)L d , in which L uc is the un-weighted negative log-likelihood of correction: The results of the two strategies are shown in Table 2. Our method is generally better than the results of using a fixed hyper-parameter for combination. Among the three systems with fixed hyperparameters, the system with λ = 0.8 achieves the highest correction f1-score and the one with λ = 0.5 achieves the best detection f1-score. Note that the detection F1-score is evaluated based on the correction result (i.e., only the corrected characters are regarded as detected), rather than based on the prediction of the detection module. Therefore, it's not weird to find that the setting λ = 0.2 costs much on detection but its detection f1-score is the worst. This also provides us a hint that the detection and correction need to be coordinated. Setting λ to 0.2 may improve the performance of the detection module, but a poor correction module will bring down the final detection performance.
Our method, on the contrary, balances the L c and L d according to the confidence given by the detection module dynamically and achieves the best performance. Compared with the fixed hyperparameter strategy with λ = 0.8, our F1 scores have 1.6 points (78.6→80.2) improvement in detection and 0.9 points (76.6→77.5) improvement in correction, indicating the effectiveness of our dynamic balance strategy in alleviating the unbalanced problem between the two tasks.

Error Analysis
To analyze the prediction errors, we collect the incorrectly predicted samples and classify them into two classes: • Detection Error: the detection module produces an error prediction, i.e. y d i =ŷ d i .
• Correction Error: the detection module generates a correct prediction, but the correction module fails to generate the right character, i.e., y d i =ŷ d i and y i =ŷ i .
We summarize the two classes on the SIGHAN15 testset and the proportion of the Detection Error and Correction Error is 83.6% and 16.4%, respectively. This reveals that most of the false predictions are Detection Errors.
We further explore the reason behinds the poor detection performance. Is it mainly because many errors cannot be detected (false negative errors), or does the detection module make incorrect predictions of errors (false positive errors)? We decompose the 83.6% detection errors into the two types and find that false negative errors and false positive errors account for 41.1% and 42.5% respectively. The proportions of the two error types are almost equal. A possible reason is that some homonyms are indistinguishable, such as "的", "地", and "得". All of the three characters have the pronunciation of "de" and it makes sense to use any of these candidates in many sentences, phonetically or semantically. This problem has also been proposed in Cheng et al. (2020), which takes further fine-tuning to reduce the indistinguishability. In this case, the detection module produces many predictions that are different from the ground-truth results, affecting the detection performance.

Conclusion
In this paper, we propose a novel end-to-end framework for CSC with phonetic pre-training. In-spired by traditional pipeline systems, the model incorporates phonetic information of characters into pre-training. We first pre-train a masked language model with phonetic features to improve the model's ability to understand sentences with misspelling and model the similarity between characters and pinyin tokens. Further, we propose an end-to-end framework to integrate detection and correction in one model. Experiments on a benchmark dataset show that our model significantly outperforms the previous state-of-the-art. The CSC model with phonetic features can be used to reduce errors for speech recognition and translation systems. In the future, we are going to apply CSC to more challenging scenarios, such as streaming ASR error correction for automatic simultaneous translation, as well as variable-length correction.

A Datasets
All of our used datasets are listed in Table 3. The Pre-training corpus includes 0.3 billion sentences and the remaining four corpora contain 281K <error, correct> sentence pairs in sum 8 . The three SIGHAN datasets are human-annotated and the

B Difference between BERT and ERNIE
We evaluate the performance of two pre-trained models, BERT (Devlin et al., 2019) and ERNIE (Sun et al., 2020), on the SIGHAN testset. For both of them, we use the released model 9 of the base version (12 layers with the hidden size of 768). The zero-shot performance is listed in Table 4. In this setup, we directly use the released model for error correction without fine-tuning. It shows that ERNIE has a prominent advantage over BERT in both detection and correction. This is caused by a prediction problem of BERT that for most of the time, BERT corrects the first character to be the period symbol ("。"). We guess that this is because of a Chinese data pre-processing bug of BERT, that is, when a paragraph is divided into multiple sentences, it always divides the ending period of a sentence into the beginning of the next one. Therefore, a lot of sentences incorrectly predicts the beginning character to be the period symbol.
Then we finetune the 281K CSC training data on the two pretrained models. Table 5 shows the performance of the two models is basically the same. The difference between BERT and ERNIE is +0.4, -2.0, and +1.6 on SIGHAN13, SIGHAN14, 8 The 281K sentence pairs can be downloaded at https://github.com/ACL2020SpellGCN/SpellGCN/tree/master/data/merged. 9 The released model of ERNIE: https://github.com/PaddlePaddle/ERNIE. The released model of BERT: https://github.com/google-research/bert and SIGHAN15, respectively. Therefore the difference between BERT and ERNIE after fine-tuning is trivial.

C Ablation Study
We compare our method with ERNIE and Soft-Masked BERT trained on identical datasets shown in Table 3. Table 6 and Table 7 show that MLMphonetics performs better at generating semantically coherent and similar sounding corrections.
fried method is very good, and the curry chicken they make is also delicious! MLM-phonetics (ours) ta men de chao fan hen bu cuo zai shuo ta men zuo de ga li ji ye hao chi fried rice is very good, and the curry chicken they make is also delicious! Table 6: An example from SIGHAN15 test set. Errors are marked in red and right corrections are in blue. All of the three methods accurately detect the misspelling words, but only MLM-phonetics yields the correct result. ERNIE changes "吵翻" ("chao fan") to differently pronounced "烤餐" ("kao can") and Soft-Masked BERT* changes "吵 翻" to "炒法" ("chao fa", fried method), which sounds similar but is not so good as "炒饭" (fried rice) in terms of semantic coherence.  Table 7: Another example from the SIGHAN15 test set. Again, MLM-phonetics predicts phonetically similar and semantically coherent correction. But both ERNIE and Soft-Masked BERT* replace "斑"(ban) with "方"(fang), which is semantically coherent but sounds greatly different.