SpellBERT: A Lightweight Pretrained Model for Chinese Spelling Check

Chinese Spelling Check (CSC) is to detect and correct Chinese spelling errors. Many models utilize a predefined confusion set to learn a mapping between correct characters and its visually similar or phonetically similar misuses but the mapping may be out-of-domain. To that end, we propose SpellBERT, a pretrained model with graph-based extra features and independent on confusion set. To explicitly capture the two erroneous patterns, we employ a graph neural network to introduce radical and pinyin information as visual and phonetic features. For better fusing these features with character representations, we devise masked language model alike pre-training tasks. With this feature-rich pre-training, SpellBERT with only half size of BERT can show competitive performance and make a state-of-the-art result on the OCR dataset where most of the errors are not covered by the existing confusion set.


Introduction
Spelling Check is to detect and correct Chinese spelling errors in sentences. However, it is a nontrivial task for Chinese spelling check because of the nature of ideographic language. Chinese has a large vocabulary including at least 3,500 common characters which leads to huge search space and an unbalanced distribution of errors.
Though hard to cover most of the misuses, their patterns could be roughly reduced to visual or phonetic errors (Chang, 1995) as shown in Figure 1. The former type of errors have similar shapes as correct ones and they are often caused by optical character recognition (OCR) or morphology-based input method. The other type of errors have similar pronunciation as original ones and they are usually caused by automatic speech recognition (ASR) or phonetic-based input method. Previous work (Hsieh et al., 2013;Yu and Li, 2014;Wang et al., 2019a;Cheng et al., 2020) tend to employ a predefined confusion set to find and filter correction candidates. Confusion set is constructed by incorrect stats (Liu et al., 2010) and it has a mapping between visually similar pairs and phonetically similar pairs in accord with erroneous patterns. However, these models only learn a shallow mapping from confusion set and their performance is heavily dependent on the quality of confusion set. But it is hard to find an up-to-date and in-domain confusion set.
In this paper, we devise two pre-training tasks to model the two aforementioned erroneous patterns explicitly. To model visual errors, we introduce radical features. Chinese characters can be decomposed into various components namely radical. As for phonetic errors, we employ pinyin as features which are descriptions of pronunciation. We fuse these visual and phonetic features with character representations by relational graph convolutional network (Schlichtkrull et al., 2018). Likewise masked language model in BERT (Devlin et al., 2019), we randomly replace some characters and then predict the original visual and phonetic features with false input. Our model, SpellBERT, can intrinsically learn to correct errors based on visual or phonetic patterns rather than simple mapping. On the OCR dataset, where only a few errors are covered by confusion set, we make a state-ofthe-art result and this indicates that SpellBERT can generalize well without depending on confusion set.
On resource-constrained scenarios for deployment, making a model lightweight is necessary.
SpellBERT only has half size of BERT and is more efficient for these scenarios.
In summary, SpellBERT is independent on confusion set in training and inference phase. With only half size of BERT, SpellBERT can show competitive performance and generalize well.

Related Work
Current methods consider CSC as sequence generation problem or sequence labeling problem. Wang et al. (2019b) introduce copy mechanism to generate corrected sequence. Bao et al. (2020) unify single-character and multi-character correction by a chunk-based generative model.
Pretrained models (PTMs) have made a success on sequence labeling tasks . Masked language model (MLM) is introduced as pre-training task to predict masked or replaced words conditioned on context. The mode of MLM is intuitively appropriate to be transformed to predict spelling errors and correct them. Significant progress has been made by power of PTM (Hong et al., 2019). Based on MLM, confusion set is applied to narrow search space for predicting correct characters. Cheng et al. (2020) constructed a graph by confusion set to help final prediction. Nguyen et al. (2020) raised an adaptable confusion set but its training process is not end-to-end.
Ideally, CSC corpus can be infinitely constructed by replacing words based on confusion set. Wang et al. (2018) generated 270k data by OCR-and ASR-based approaches.  created 5 million augmented data and Li et al. (2021) created 9 million augmented data by substitutionbased method.  corrupted input sentence by randomly replacing characters with noisy-pinyin and the new pre-training task fitted well for CSC.
More recently, some methods also utilized phonetic and visual features in CSC. Liu et al. (2021) employed a GRU (Bahdanau et al., 2014) to encode pinyin sequence and Chinese strokes sequence as extra features. Xu et al. (2021) had similar design but they encoded pictures of characters to get visual features.  enriched character representations by knowledge of audio and visual modalities. Our method is different from all of these work. For phonetic features, we regard pinyin as a whole but not a sequence. For visual features, we used radicals which are higher-level features than strokes. And we incorporate these extra features by graph neural network.

Approach
We treat CSC as a sequence labeling problem. An input sequence with n characters is represented as X = {x 1 , x 2 , · · · , x n }. Our goal is to transform it into a target sequence Y = {y 1 , y 2 , · · · , y n }. During which, incorrect characters will be detected and corrected. Obviously, the input and output share the same vocabulary and most of the output characters can be directly copied from input. The framework of our model is shown in Figure 2. It contains three parts, i.e., a BERT-based encoder, a feature-fusing module and a component for pretraining. We will progressively elaborate our design in detail.

An MLM-based Backbone
Many attribute the success of BERT (Devlin et al., 2019) to its MLM pre-training task. BERT randomly masked or replaced some tokens and then predict the original tokens. Regarding the masked and replaced tokens as spelling errors, BERT is properly adapted to be a spelling checker. Each input character x i is indexed to its embedding representation e i by the BERT-embedding-layer. Then e i will be passed to BERT-encoder-layers to get a representation h i as follows: where e i , h i ∈ R 1×d and d is the hidden dimension. After that, the h i will be computed similarities with all character embeddings to get a predicted distributionŷ i over vocabulary as follows: where E ∈ R V ×d ;ŷ i ∈ R 1×V and V is the vocabulary size. Here E refers to the BERT-embeddinglayer and the i th row of E corresponds to e i in Equation 1. Finally we use the character x k as the correction result for x i whose e k has the highest similarity with h i .

Fusing Visual and Phonetic Features
The above backbone lacks special modeling for this task. Chinese spelling errors can be roughly classified into two patterns. Visual errors have similar shapes as correct characters while phonetic errors have similar pronunciation. Some work utilize an external confusion set that has predefined mappings between visually similar pairs and phonetically similar pairs (Yu and Li, 2014;Wang et al., 2019a;Cheng et al., 2020). These models relied on confusion set to filter candidates but the confusion set might be out-of-date or out-of-domain.
To model the two erroneous patterns, we infuse character representations e i with visual and phonetic features by incorporating radical and pinyin information. Chinese characters can be decomposed into components namely radicals and visual errors often have overlap radicals with the correct character. Pinyin is a sequence of pronunciation descriptions for Chinese characters and phonetic errors often have overlap pinyin. Based on the extra features, our model can automatically learn visually similar and phonetically similar mappings.
We employ a relational graph convolutional network (Schlichtkrull et al., 2018) short as R-GCN to infill multiple types of features into character representations e i in Equation 1. We view characters as nodes and input sequence X can be organized as a line graph naturally. Both radicals and pinyin are viewed as nodes of graph as well. If a radical or pinyin belong to a certain character, we construct connections between them as edges.
We regard these connections as different depending on the pair of nodes between them. Besides, we construct edges between neighboring characters because local context information is beneficial for better-incorporating pinyin and radical features. As a result, We define the following types of edges: • An edge between a character and a radical • An edge between a character and a pinyin • An edge between a character and a neighboring character within a fixed-length context • An edge between a character and itself We initialize feature of character-node by characterembedding e i in Euqation 1. To represent and update features of radical-node and pinyin-node, we also construct an extra embedding table which is initialized by averaging their most related characterembeddings. As shown in Figure 2, these features diffuse on a relational graph as following: where e i means character-embedding of x i and e j means feature of connected node j; r denotes the type of edge; N r i refers to the set of connected nodes for edge type r; W r is the transformation layer of edge type r and c i,r is a problem-specific normalization constant which is set as |N r i | here. The finalê i can be viewed as character representation enhanced by radical and pinyin information. Finally, we combine enhanced representation and original character-embedding and Equation 3 can be updated as following: where h i denote the final representation of each character.

Enhanced Pretraining Tasks for CSC
It has been shown that external information can be better integrated into BERT by pre-training alike tasks (Peters et al., 2019;Zhang et al., 2019;Ma et al., 2020). Considering the radical and pinyin features are externally added by design, we devise two more pre-training alike tasks which are radical prediction and pinyin prediction. In MLM, Devlin et al. (2019) randomly masked a percentage of input tokens and then predict these tokens. In radical and pinyin prediction, we randomly mask connections from characters to their radicals and pinyin and then predict the masked connections. Through reconstructing connections, the model can learn a better representation that contains not only contextual information but also visual and phonetic information.
Same as MLM, we randomly choose 15% of characters to process. If a character is chosen, our potential practices are shown below: • Keep it unchanged 10% of the time. Then predict the character itself, its radicals, and its pinyin. This is to match downstream finetuning where each character can directly see all of its radicals and pinyin.
• Replace it with [MASK] 60% of the time and mask all of its connections with a probability of 80%. Then predict the masked character and the masked connections.
• Replace it with a confusing word sampled from confusion set 30% of the time and mask all of its connections with a probability of 80%. Then predict the original character and its connections. This is to force our model to correct characters based on false radicals and pinyin of errors. Note that we only use confusion set in this stage to construct misspellings.
In our graph, edges have no representations and the graph is utilized only between BERT-embeddinglayer and BERT-encoder-layers. So we transform the task of edge-prediction into token-classification. For each character x i , we take one of its pinyin and radicals as ground-truth and negatively sample other pinyin and radicals that do not belong to the character. We use feature-embeddings of these pinyin and radicals as a classified layer to compute their similarities with h i from BERT-encoder-layer in Equation 2. Related embeddings will be drawn close to each other, and unrelated embeddings will be pulled away from each other.

Reducing Parameters
Given the need of computational efficiency for deployment, it is necessary to get a lightweight model. We only use 4 layers of BERT to initialize, pre-train, and fine-tune our model and which reduces the total number of parameters from 110M to 55M. We also measure the inference speed of our lightweight model and the experiments result show that it has better time-efficiency compared with a 12-layer BERT.

Pre-training Setup
We use BERT (Devlin et al., 2019) base as initialization and only the first 4 layers are utilized. Our model is implemented by PyTorch (Paszke et al., 2019) and DGL (Wang et al., 2019c). We randomly select 1M sentences provided by Xu (2019) as pretraining corpus and pad the sentences to a max length of 128. We set the learning rate as 5e-5, batch size as 1024, and pre-train 10K steps on 4 RTX 3090 for around 2 days.

Dataset and Fine-tuning Setup
We conduct CSC experiments on three widely used datasets SIGHAN14 , SIGHAN15 (Tseng et al., 2015), OCR (Hong et al., 2019) and mark them as csc 14 , csc 15 and ocr.
The original corpus of csc 14 and csc 15 was collected from essays written by learners of Chinese as a foreign language and it was in Traditional Chinese. Wang et al. (2019a), Nguyen et al. (2020) transformed it into Simplified Chinese and used augmented data provided by Wang et al. (2018). Because our pre-training corpus was in Simplified Chinese, we use the latter setting. We directly use the corpus provided by Cheng et al. (2020). Under this setting, the training set of csc 14 , csc 15 and the augmented data provided by Wang et al. (2018) are combined as a new training set. We fine-tune our model on the test set of csc 14 and csc 15 separately.
ocr is a Simplified Chinese dataset of which the sentences are much shorter and extracted from the entertainment domain. We only use the data from ocr to train and test and it has 4575 sentences in total.
For different datasets, we find the following ranges of hyperparameters work well: the batch size is set to among {32, 64}, the learning rate is set to among {1e-5, 2e-5, 3e-5} and the number of epochs is ranging from 5 to 20.
On csc 14 and csc 15 , we evaluate our model in sentence-level by the official tool (Tseng et al., 2015). And on ocr, the metric is in edit-level by a different official tool (Wu et al., 2013).

Results and Analysis
Main Results As shown in Table 1, we compare SpellBERT with recent work and a 4-layer BERT baseline. All of them are BERT-based which means that their number of parameters are at least twice   (Wu et al., 2013) on test data as many as ours. However, by fusing pinyin and radical features and the feature-rich pre-training, SpellBERT still has the best performance on all of the three datasets.

Effectiveness of Modules
We also remove graph and pre-training stage respectively to test their effectiveness. The results showed that pre-training can generally bring significant improvements on all datasets which suggests that pre-training is an effective way on CSC. The contribution of the graph mechanism is not that impressive but this makes it possible to only transfer our encoder parameters to other architectures. Impact of Confusion Set Notice that our improvements over previous work are more obvious on ocr than that on csc 14 and csc 15 . Firstly, there are inevitable noises when converting data into Simplified Chinese and the noisy ratio is 1.5% and 0.9% for csc 14 and csc 15 . The other reason is that previous work such as Nguyen et al. (2020)   improves the performance on this dataset which indicates that SpellBERT can generalize well on different corpus without dependence on confusion set. The ablation studies further demonstrate that our proposed modules help deal with unseen errors.
Efficiency Analysis With only half the number of parameters of a 12-layer BERT, SpellBERT has the best space efficiency compared to BERT-based work. To verify time efficiency, we incorporate a speed measure in terms of absolute time consumption per sentence mentioned in Hong et al. (2019). Results in Table 3 indicate that SpellBERT can speed up at least 1.5 times.

Conclusion
In this work, we propose a lightweight pretrained model, SpellBERT, for Chinese spelling check. We incorporate pinyin and radicals as phonetic and visual features and design two pre-training tasks to encourage the pre-trained model to explicitly capture erroneous patterns. Experiments show that SpellBERT has competitive performance compared to the large pretrained models. Besides, SpellBERT can be directly used without confusion set in the fine-tuning and inference phase, which is more convenient to use and easier to deal with the errors uncovered by the existing confusion sets.