PLOME: Pre-training with Misspelled Knowledge for Chinese Spelling Correction

Chinese spelling correction (CSC) is a task to detect and correct spelling errors in texts. CSC is essentially a linguistic problem, thus the ability of language understanding is crucial to this task. In this paper, we propose a Pre-trained masked Language model with Misspelled knowledgE (PLOME) for CSC, which jointly learns how to understand language and correct spelling errors. To this end, PLOME masks the chosen tokens with similar characters according to a confusion set rather than the fixed token “[MASK]” as in BERT. Besides character prediction, PLOME also introduces pronunciation prediction to learn the misspelled knowledge on phonic level. Moreover, phonological and visual similarity knowledge is important to this task. PLOME utilizes GRU networks to model such knowledge based on characters’ phonics and strokes. Experiments are conducted on widely used benchmarks. Our method achieves superior performance against state-of-the-art approaches by a remarkable margin. We release the source code and pre-trained model for further use by the community (https://github.com/liushulinle/PLOME).


Introduction
Chinese spelling correction (CSC) aims to detect and correct spelling errors in texts (Yu and Li, 2014). It is a challenging yet important task in natural language processing, which plays an important role in various NLP applications such as search engine (Martins and Silva, 2004) and optical character recognition (Afli et al., 2016). In Chinese, spelling errors can be mainly divided into two types: phonological errors and visual errors, which are separately caused by the misuse of phonologically similar characters and visually similar characters. According to Liu et al. (2010), about 83% of errors are phonological and 48% are visual. Figure 1 illustrates examples of such errors. The first case is caused by the misuse of "没(gone)" and "美(beautiful)" with the same phonics, and the second case is caused by the misuse of "人(human)" and "入(enter)" with very similar shape.
Chinese spelling correction is a challenging task because it requires human-level language understanding ability to completely solve this problem (Zhang et al., 2020). Therefore, language model plays an important role in CSC. In fact, one of the mainstream solutions to this task is based on language models (Chen et al., 2013;Yu and Li, 2014;Tseng et al., 2015). Currently, the latest approaches (Zhang et al., 2020;Cheng et al., 2020) are based on BERT (Devlin et al., 2019), which is a masked language model. In these approaches, (masked) language models are independently pretrained from the CSC task. As a consequence, they did not learn any task-specific knowledge during pre-training. Therefore, language models in these approaches are sub-optimal for CSC.
Chinese spelling errors are mainly caused by the misuse of phonologically or visually similar characters. Thus, knowledge of the similarity between characters is crucial to this task. Some work leveraged the confusion set, i.e. a set of similar characters, to fuse such information (Wang et al., 2018(Wang et al., , 2019Zhang et al., 2020). However, confu-sion set is usually generated by heuristic rules or manual annotations, thus its coverage is limited. To circumvent this problem, Hong et al. (2019) computed the similarity based on character's strokes and phonics. The similarity was measured via rules rather than learned by the model, therefore such knowledge was not fully utilized.
In this paper, we propose PLOME, a Pre-trained masked Language mOdel with Misspelled knowl-edgE, for Chinese spelling correction. The following characteristics make PLOME more effective than vanilla BERT for CSC. First, we propose the confusion set based masking strategy, where each chosen token is randomly replaced by a similar character according to a confusion set rather than the fixed token "[MASK]" as in BERT. Thus, PLOME jointly learns the semantics and misspelled knowledge during pre-training. Second, the proposed model takes each character's strokes and phonics as input, which enables PLOME to model the similarity between arbitrary characters. Third, PLOME learns the misspelled knowledge on both character and phonic level by jointly recovering the true character and phonics for masked tokens.
We conduct experiments on the widely used benchmark dataset SIGHAN (Wu et al., 2013;Tseng et al., 2015). Experimental results show that PLOME significantly outperforms all the compared approaches, including the latest Soft-masked BERT (Zhang et al., 2020) and Spell-GCN (Cheng et al., 2020).
We summarize our contributions as follows: (1) PLOME is the first task-specific language model designed for Chinese spelling correction. The proposed confusion set based masking strategy enables our model to jointly learn the semantics and misspelled knowledge during pre-training. (2) PLOME incorporates phonics and strokes, which enables it to model the similarity between arbitrary characters. (3) PLOME is the first to model this task on both character and phonic level.

Related Work
Chinese spelling correction is a challenging task in natural language processing, which plays important roles in many applications, such as search engine (Martins and Silva, 2004;Gao et al., 2010), automatic essay scoring (Burstein and Chodorow, 1999;Lonsdale and Strong-Krause, 2003), and optical character recognition (Afli et al., 2016;Wang et al., 2018). It has been an active topic, and vari-ous approaches have been proposed in recent years (Yu and Li, 2014;Wang et al., 2018Wang et al., , 2019Zhang et al., 2020;Cheng et al., 2020).
Early work on CSC followed the pipeline of error identification, candidate generation and selection. Some researchers focused on unsupervised approaches, which typically adopted a confusion set to find correct candidates and employed language model to select the correct one (Chang, 1995;Huang et al., 2000;Chen et al., 2013;Yu and Li, 2014;Tseng et al., 2015). However, these methods failed to condition the correction on the input sentence. In order to model the input context, discriminative sequence tagging methods (Wang et al., 2018) and sequence-to-sequence generative models (Chollampatt et al., 2016;Ji et al., 2017;Ge et al., 2018;Wang et al., 2019) were employed.
BERT (Devlin et al., 2019) is a bidirectional language model based on Transformer encoder (Vaswani et al., 2017). It has been demonstrated effective in a wide range of applications, such as question answering (Yang et al., 2019), information extraction , and semantic matching (Reimers and Gurevych, 2019). Recently, it has dominated the researches on CSC (Hong et al., 2019;Zhang et al., 2020;Cheng et al., 2020). Hong et al. (2019) adopted the DAE-Decoder paradigm with BERT as encoder. Zhang et al. (2020) introduced a detection network to generate the masking vector for the BERT-based correction network. Cheng et al. (2020) employed the graph convolution network (GCN) (Kipf and Welling, 2016) combined with BERT to model character interdependence. However, BERT is designed and pretrained independently from the CSC task, thus it is sub-optimal. To improve the performance, we propose a task-specific language model for CSC.

Approach
We introduce PLOME and its detailed implementation in this section. Figure 2 illustrates the framework of PLOME. Similar to BERT (Devlin et al., 2019), the proposed model also follows the pre-training&fine-tuning paradigm. In the following subsections, we first introduce the confusion set based masking strategy, then present the architecture of PLOME and the learning objectives, finally show the details of fine-tuning.

Confusion Set based Masking Strategy
In order to train PLOME, we randomly mask some percentage of the input tokens and then recover them. Devlin et al. (2019) replaced the chosen tokens by a fixed token "[MASK]", which is nonexistent in downstream tasks. On the contrast, we remove this token and replace each chosen token by a random character that is similar to it. Similar characters are obtained from a publicly available confusion set (Wu et al., 2013), which contains two types of similar characters: phonologically similar and visually similar. Since phonological errors are two times more frequent than visual errors (Liu et al., 2010), these two types of similar characters are assigned different chance to be chosen during masking. Following Devlin et al. (2019), we totally mask 15% of tokens in the corpus. In addition, we use dynamic masking strategy , where the masking pattern is generated every time a sequence is fed into the model. Always replacing chosen tokens by characters in a confusion set will cause two problems. (1). The model tends to make correction decision for all inputs since all the tokens to be predicted during pre-training are "misspelled". To circumvent this problem, some percentage of the selected tokens are unchanged. (2). The size of confusion set is limited, however misspelling may be caused by the misuse of an arbitrary pair of characters in real texts. To improve generalization ability, we replace some percentage of chosen tokens by random characters from the vocabulary. To sum up, if

Random Masking 他想明天浩(hao)南京看奶奶。
Unchanging 他想明天去(qu)南京看奶奶。 Table 1: Examples of different masking strategies. The chosen token is marked in red, and the corresponding phonics is given in brackets.
the i-th token is chosen, we replace it with (i) a random phonologically similar character 60% of the time (ii) a random visually similar character 15% of the time (iii) the unchanged i-th token 15% of the time (iv) a random token in the vocabulary 10% of the time. Table 1 presents examples of different masking strategies.

Embedding Layer
As shown in Figure 2, the final embedding of each character is the sum of character embedding, position embedding, phonic embedding and shape embedding. The former two are obtained via looking up embedding tables, where the size of vocabulary and embedding dimension are the same as that in BERT base (Devlin et al., 2019). Phonic Embedding In Chinese, phonics (also known as Pinyin) represents the pronunciation of a character, which is a sequence of lowercase letters with a diacritic 2 . In this paper, we use the Unihan Database 3 to obtain the character-phonics mapping (diacritic is removed). To model the phonological relationship between characters, we feed the letters of each character's phonics to a 1-layer GRU (Bahdanau et al., 2014) network to generate the phonic embedding, where similar phonics are expected to have similar embeddings. An example is given in the middle part in Figure 3.
Shape Embedding We use the Stroke Order 4 to represent the shape of a character, which is a sequence of strokes indicating the order in which the strokes of a Chinese character are written. A stroke is a movement of a writing instrument on a writing surface. In this paper, stroke data is obtained via Chaizi Database 5 . In order to model the visual relationship between characters, the Stroke order of each character is fed into another 1-layer GRU network to generate the shape embedding. An example is given in the bottom part in Figure 3.

Transformer Encoder
The transformer encoder has the same architecture as that in BERT base (Devlin et al., 2019). The number of transformer layers (Vaswani et al., 2017) is 12, the size of hidden units is 768 and the number of attention head is 12. For more detailed configurations please refer to Devlin et al. (2019).

Output Layer
As illustrated in Figure 2, our model makes two predictions for each chosen character.
Character Prediction Similar to BERT, PLOME predicts the original character for each masked token based on the embedding generated by the last transformer layer. The probability of the character predicted for the i-th token in a given sentence is defined as: (1) where p c (y i = j|X) is the conditional probability that the true character of the i-th token x i is predicted as the j-th character in vocabulary, h i denotes the embedding output from the last transformer layer for x i , W c ∈ R nc×768 and b c ∈ R nc are parameters for character prediction, n c is the size of the vocabulary.
Pronunciation Prediction Chinese totally has about 430 different pronunciations (represented by phonics) but has more than 2,500 common used characters. Thus, many characters share the same pronunciation. Moreover, some pronunciations are so similar that it is easy to be misused, such as "jing" and "jin". Therefore, phonological error dominates Chinese spelling errors. In practice, about 80% of spelling errors are phonological (Zhang et al., 2020). In order to learn the misspelled knowledge on phonic level, PLOME also predicts the true pronunciation for each masked token, where pronunciation is presented by phonics without diacritic. The probability of pronunciation prediction is defined as: where p p (g i = k|X) is the conditional probability that the correct pronunciation of the masked character x i is predicted as the k-th phonics in the phonic vocabulary, h i denotes the embedding output from the last transformer layer for x i , W c ∈ R np×768 and b p ∈ R np are parameters for pronunciation prediction, n p is the size of the phonic vocabulary.

Learning
The learning process is driven by optimizing two objectives, corresponding to character prediction and pronunciation prediction, respectively.
where L c is the objective for character prediction, l i is the true character for x i , L p is the objective for pronunciation prediction, r i is the true pronunciation. The overall objective is defined as:

Fine-tuning Procedure
Above subsections present the details of the pretraining procedure. In this subsection, we introduce the fine-tuning procedure. PLOME is designed for the CSC task, which aims to detect and correct spelling errors in Chinese texts. Formally, given a character sequence X = {x 1 , x 2 , ..., x n } consisting of n characters, the model is expected to generate a target sequence Y = {y 1 , y 2 , ..., y n }, where errors are corrected.
Training The learning objective is exactly the same as that in the pre-training procedure(see Section 3.5). This procedure is similar to pre-training except that: (1). the masking operation introduced in Section 3.1 is eliminated.
(2). all input characters require to be predicted rather than only chosen tokens as in pre-training.
Inference As illustrated in Section 3.4, PLOME predicts both the character distribution and pronunciation distribution for each masked token. We define the joint distribution as: where p j (y i = j|X) is the probability that the original character of x i is predicted as the j-th character jointly considering the character and pronunciation predictions, p c and p p are separately defined in Equation 1 and Equation 2, j p is the pronunciation of the j-th character. To this end, we construct an indicator matrix I ∈ R nc×np , where I i,j is set to 1 if the pronunciation of the i-th character is the j-th phonics, otherwise set to 0. Then the joint distribution can be computed by: where is the element-wise production.
We use the joint probability as the predicted distribution. For each input token, the character with the highest joint probability is selected as the final output: y i =argmax p j (y i |X). The joint distribution simultaneously takes the character and pronunciation predictions into consideration, thus is more accurate. We will verify it in Section 4.5.

Experiments
In this section, we present the details for pretraining PLOME and the fine-tuning results on the most widely used benchmark dataset.

Pre-training
Dataset We use wiki2019zh 6 as the pre-training corpus, which consists of one million Chinese Wikipedia 7 pages. Moreover, we also collect three million news articles from a Chinese news platform. We split those pages and articles into sentences and totally obtain 162.1 million sentences. Then we concatenate consecutive sentences to obtain text fragments with at most 510 characters, which are used as the training instances.
Parameter Settings We denote the dimension of character embeddings, letter (in phonics) embeddings and stroke embeddings as d c , d l , d s , respectively, the dimension of hidden states in phonic and shape GRU networks as h p , and h s . Then we have d c = 768, d l = d s = 32, h p = h s = 768. The configuration of transformer encoder is exactly the same as that in BERT base (Devlin et al., 2019), and the learning rate is set to 5e-5. These parameters are set based on experience because of the large cost of pre-training. Better performance could be achieved if parameter tuning technique (e.g. grid search) is employed. Moreover, instead of training PLOME from scratch, we adopt the parameters of Chinese BERT released by Google 8 to initialize the Transformer blocks.

Fine-tuning
Training Data Following Cheng et al. (2020), the training data is composed of 10K manually annotated samples from SIGHAN (Wu et al., 2013;Tseng et al., 2015) and 271K automatically generated samples from Wang et al. (2018).

Evaluation Data
We use the latest SIGHAN test dataset (Tseng et al., 2015) as in Zhang et al. (2020) to evaluate the proposed model, which contains 1100 texts and 461 types of errors.
Evaluation Metrics Following previous work (Cheng et al., 2020;Zhang et al., 2020), we use the  precision, recall and F1 scores as the evaluation metrics. Besides character-level evaluation, we also report sentence-level metrics on the detection and correction sub-tasks. We evaluate these metrics using the script from Cheng et al. (2020) 9 .
Parameter Settings Following Cheng et al. (2020), we set the maximum sentence length to 180, batch size to 32 and the learning rate to 5e-5. All experiments are conducted for 4 runs and the averaged metric is reported. The code and trained models will be released (currently the code is attached in the supplementary files).

Baseline Models
We use the following methods for comparison.
Hybird (Wang et al., 2018) uses a BiLSTMbased model trained on an automatically generated dataset.
PN (Wang et al., 2019) is a Seq2Seq model incorporating a pointer network.
FASPell (Hong et al., 2019) adopts the DAE-Decoder paradigm and employs BERT as the denoising auto-encoder.
SKBERT (Zhang et al., 2020) introduces the Soft-masKing strategy in BERT to improve the performance of error detection. SpellGCN (Cheng et al., 2020) combines a GCN network with BERT to model the relationship between characters in the given confusion set.
Besides, we implement a baseline model cBERT (confusion set based BERT), whose input and encoder layers are the same as that in BERT base (De-9 https://github.com/ACL2020SpellGCN/SpellGCN vlin et al., 2019). The output layer is similar to PLOME, but only has the character prediction as defined in Equation 1. cBERT is also pre-trained via the confusion set based masking strategy. Table 2 illustrates the performance of the proposed method and baseline models. The results of recently proposed models are presented in the first group. The results of pre-trained and fine-tuned models are presented in the second and third group, respectively. From this table, we observe that: 1) Without fine tuning, pre-trained models in the middle group achieve relatively good results, even outperform the supervised approach PN with remarkable gains. This indicates that the confusion set based masking strategy enables our model to learn task-specific knowledge during pre-training.

Main Results
2) Compared the fine-tuned models, cBERT outperforms BERT on all metrics. Especially, the F score of sentence-level evaluations are improved by more than 4 absolute points. The improvement is remarkable with such a large amount of training data (281k texts), which indicates that the proposed masking strategy provides essential knowledge and it can not be learned from fine tuning.
3) With the incorporation of phonic and shape embeddings, PLOME-Finetune outperforms cBERT-Finetune by 2.3% and 2.8% absolute improvements in sentence-level detection and correction. This indicates that characters' phonics and strokes provide useful information and it can hardly be learned from the confusion set. 4) SpellGCN and our approach use the same con-   Table 4: The performance of PLOME with the character prediction p c and the joint prediction p j as output.
fusion set from Wu et al. (2013), but adopt different strategies to learn the knowledge contained in it.
SpellGCN built a GCN network to model this information, whereas PLOME learned it from huge scale data during pre-training. PLOME achieves better performance on all metrics, indicating that our approach is more effective to model such knowledge.
Previous work (Wang et al., 2019;Cheng et al., 2020) conducted the character-level evaluation on positive sentences which contain at least one error (sentence-level metrics were evaluated on the whole test set). Thus, the precision score is very high. The character-level results in table 2 are also evaluated in such manner for fair comparison. To make more comprehensive evaluation, we report the results evaluated on the whole test set in table 3. Moreover, following Cheng et al. (2020), we also report the sentence-level results evaluated by SIGHAN official tool. We observe that PLOME consistently outperforms BERT and SpellGCN on all metrics.
To make more comprehensive comparisons, we also evaluate the proposed model on SIGHAN13 (Wu et al., 2013) and SIGHAN14 . Following Cheng et al. (2020), we performed 6 additional fine-tuning epochs on SIGHAN13 as its data distribution differs from other datasets. Table5 illustrates the results, from which we observe that PLOME consistently outperforms all the compared models.

Effects of Prediction Strategy
As illustrated in Section 3.4 and 3.6, PLOME predicts three distributions for each character: the character distribution p c , the pronunciation distribution p p and the joint distribution p j . The latter two distributions are related to pronunciation prediction, which is first to be introduced in this work. In this subsection, we investigate the performance of PLOME with each of them as the final output. The CSC task requires character prediction, thus we only compare the effects of the character prediction p c and the joint prediction p j . Table 4 presents the experimental results, from which we observe that the joint distribution outperforms the character distribution on all evaluation metrics. Especially, the gap of precision scores is more obvious. The joint distribution simultaneously takes the character and pronunciation predic-  Table 6: The performance of cBERT and PLOME with different initialization strategies. *-Rand denotes that all the parameters are randomly initialized and *-BERT denotes parameters are initialized by BERT.
tions into consideration, thus the predicted results are more accurate.

Effects of Initialization Strategy
Generally speaking, initialization strategy has a great influence on the performance for deep models.
In this subsection, we investigate the effects of different initialization strategies in the pre-training procedure. For comparison, we implement four baselines based on cBERT and PLOME. Table 6 illustrates the results, where methods named with "*-Rand" initialize all the parameters randomly and methods named with "*-BERT" initialize the transformer encoder by BERT released by Google. From the table we observe that both cBERT and PLOME initialized with BERT achieve better performance. Especially, the recall score improves significantly for all evaluations. We believe the following two reasons may explain this phenomenon. 1) The rich semantic information in BERT can effectively improves the generalization ability. 2) PLOME is composed of two 1-layer GRU networks and a 12-layer transformer encoder, and totally contains more than 110M parameters. It is easily trapped into local optimization when training such a large-scale model from scratch.

Phonic/Shape Embedding Visualization
In this subsection, we investigate whether the phonic and shape GRU networks learned meaningful representations for characters. To this end, we generate the phonic and shape embeddings for each character by the GRU networks in Figure 2 and then visualize them. Figure 4 illustrates 30 characters nearest to '锭' according to the cosine similarity of the 768-dim embeddings generated by GRU networks, which is visualized via t-SNE (Maaten and Hinton, 2008). On one hand, nearly all the characters similar to '锭', such as '啶' and '绽', are included in this  figure. On the other hand, similar characters are very close to each other (labeled by circles). These phenomena indicate that the learned shape embedding well models the shape similarity. Figure 5 shows the same situation for the phonic embedding related to 'ding' and also demonstrates its ability in modeling phonic similarity.

Converging Speed of Various Models
In this subsection, we investigate the converging speed of various models in the fine-tuning procedure. Figure 6 shows the test curves for character-level detection metrics of BERT, cBERT and PLOME. Thanks to the confusion set based masking strategy, cBERT and PLOME learned taskspecific knowledge in the pre-training procedure, therefore they achieve much better performance than BERT at the beginning of the training. As the training went on, the gap gradually narrowed dur- Figure 6: The test curves for character-level detection metrics of various models in the fine-tuning procedure.
ing the first 35,000 steps and then remained stable with a gap of 6%(86% vs. 80%). In addition, the proposed model needs much less training steps to achieve a relatively good performance. PLOME needs only 7k steps to achieve the score of 80%, whereas BERT needs 47k steps.

Conclusions
We propose PLOME, a pre-trained masked language model with misspelled knowledge for CSC. To the best of our knowledge, PLOME is the first task-specific language model for CSC, which jointly learns semantics and misspelled knowledge thanks to the confusion set based masking strategy. Previous work demonstrated that phonological and visual similarity between characters is essential to this task. We introduce phonic and shape GRU networks to model such features. Moreover, PLOME is also the first model that makes decision via jointly considering the target pronunciation and character distributions. Experimental results showed that PLOME outperforms all the compared models with remarkable gains.