PHMOSpell: Phonological and Morphological Knowledge Guided Chinese Spelling Check

Chinese Spelling Check (CSC) is a challenging task due to the complex characteristics of Chinese characters. Statistics reveal that most Chinese spelling errors belong to phonological or visual errors. However, previous methods rarely utilize phonological and morphological knowledge of Chinese characters or heavily rely on external resources to model their similarities. To address the above issues, we propose a novel end-to-end trainable model called PHMOSpell, which promotes the performance of CSC with multi-modal information. Specifically, we derive pinyin and glyph representations for Chinese characters from audio and visual modalities respectively, which are integrated into a pre-trained language model by a well-designed adaptive gating mechanism. To verify its effectiveness, we conduct comprehensive experiments and ablation tests. Experimental results on three shared benchmarks demonstrate that our model consistently outperforms previous state-of-the-art models.


Introduction
Chinese Spelling Check (CSC) is a fundamental task in Chinese Natural Language Processing (NLP), which aims to automatically detect and correct spelling errors in Chinese sentences. These errors typically consist of human writing errors and machine recognition errors by automatic speech recognition (ASR) or optical character recognition (OCR) systems . CSC serves as a preliminary component for other downstream tasks like information retrieval (IR) in search engine, thus significantly affects the final performance of these tasks.
Chinese is an ideograph language which contains numerous characters and has no between-word delimiters. These characteristics make its spelling check more difficult than other alphabetical languages such as English. Specifically, for error p-s error: wrong sentence: 人们必(pinyin: bi4) 生去追求的目标。 ground truth: 人们毕(pinyin: bi4) 生去追求的目标。 v-s error: wrong sentence: 迎接每一个固(radicals: 古,口) 难。 ground truth: 迎接每一个困(radicals: 木,口) 难。 Table 1: Examples of p-s (phonological similarity) error and v-s (visual similarity) error from SIGHAN13 (Wu et al., 2013). Here, the ground truth of the p-s error means "The goal that people pursue throughout their lives" and the ground truth of the v-s error means "Get prepared for every difficulty". detection, Chinese words usually consist of several characters and have no clear word boundaries, which makes it impossible to detect spelling errors just using individual word or character. They must be put in a specific sentence to capture contextual semantic information. For error correction, how to select correct candidates from tremendous character sets remains a great challenge. In contrast to English words that are composed of a small set of alphabet letters, there are more than 10k Chinese characters, and 3.5k of them are frequently used (Wang et al., 2019b). Besides, unlike English, almost all Chinese spelling errors are real-word errors which means the misspelling one is also a valid character in the vocabulary. (Kukich, 1992;Jia et al., 2013;Yu and Li, 2014). Since a great number of Chinese characters are similar either in phonology or morphology, they are easily misused with each other. According to (Liu et al., 2011), 76% of Chinese spelling errors belong to phonological similarity error and 46% belong to visual similarity error. Table 1 presents examples of these two common errors. The pronunciation and the shape of Chinese characters can be characterized by pinyin 1 and radicals 2 , respectively.

5959
Previous methods have made attempts to fuse these two information into the process of CSC (Jin et al., 2014;Hong et al., 2019;Nguyen et al., 2020). However, pinyin or radicals in these methods were used as external resources or heuristic filters and can not be trained with the model in an end-to-end style. More recently, Cheng et al. (2020) proposed SpellGCN, which incorporated phonological and morphological similarities into a pre-trained language model by graph convolutional network (GCN). However, their similarity graphs relied on specific confusion sets. Since confusion sets are unable to cover all characters, SpellGCN can only fuse limited information. Furthermore, they just used a simple aggregate strategy for feature fusion.
To tackle the above issues, we propose a novel framework called PHMOSpell. PHMOSpell incorporates pinyin and glyph features into a pre-trained language model via an adaptive gating module for CSC. These features are derived from intermediate representations of dominant Tacotron2  in text-to-speech (TTS) task and VGG19 (Simonyan and Zisserman, 2014) in computer vision (CV) task. We combine them with semantic representation from a pre-trained language model by the proposed adaptive gating module, enabling the model to be trained end-to-end. Comprehensive experiments are conducted on three shared benchmarks to prove that latent representations in our method can capture not only semantic but also phonological and morphological information. Experimental results demonstrate that our method outperforms all baseline methods on three benchmarks.
The contributions of this paper are in three folds: 1) We derive both phonological and morphological knowledge of Chinese characters from multimodality and apply them to CSC. 2) We design a novel adaptive gating mechanism, which effectively incorporates the multi-modal information into a pre-trained language model in an end-to-end trainable way. 3) We achieve state-of-the-art performance on three benchmark datasets using the proposed model.

Related Work
CSC has received active research in recent years. Previous studies on CSC can be divided into three categories: rule based methods, statistical based ters, there are about 216 different radicals in Chinese. methods and deep learning based methods. Mangu and Brill (1997) proposed a rule based approach for automatically acquiring linguistic knowledge from a small set of easily understood rules. Jiang et al. (2012) arranged a new grammar system of rules to solve both Chinese grammar errors and spelling errors. Xiong et al. (2015)'s HANSpeller was based on an extended HMM, ranker based models and a rule based model. For statistical based methods, Noisy Channel Model Moore, 2000, 2008;Chiu et al., 2014;Noaman et al., 2016;Bao et al., 2020) is the most widely used model. Statistical based methods usually narrowed the candidates choice by utilizing a predefined confusion set Hsieh et al., 2013;Wang et al., 2019a), which contains a set of similar character pairs. These similar characters were used to replace each other and language models were leveraged to measure the quality of the modified sentences Yu and Li, 2014;Xie et al., 2015). More recently, deep learning has achieved excellent results on many NLP tasks, including CSC. Wang et al. (2019a) proposed an end-toend confusionset-guided encoder-decoder model, which treated CSC as a sequence-to-sequence task and infused confusion sets information by copy mechanism. FASpell (Hong et al., 2019) employed BERT (Devlin et al., 2019) as a denoising autoencoder (DAE) for CSC. SpellGCN (Cheng et al., 2020) constructed two similarity graphs over the characters in confusion sets and employed graph convolutional network on these two graphs to capture the pronunciation/shape similarities between characters. Soft-Masked BERT (Zhang et al., 2020) was proposed to combine a Bi-GRU based detection network and a BERT based correction network, where the former passed its prediction results to the latter using soft masking mechanism. Nguyen et al. (2020) applied TreeLSTM (Tai et al., 2015;Zhu et al., 2015) on the tree structure of the character radicals to get hierarchical character embeddings, which was used as an adaptable filtering component for candidates selection.

Problem Formulation
Generally, CSC can be regarded as a revision task on Chinese sentences. Given a Chinese sentence X = {x 1 , x 2 , ..., x n } of length n, the model needs to detect spelling errors on character level and output its correct corresponding sentence Y = {y 1 , y 2 , ..., y n }. Although CSC can be viewed as a kind of sequence-to-sequence (Seq2Seq) task, it is different from other Seq2Seq tasks (e.g., Text Summarization, Machine Translation): the input and output sequences of the former are equal in length. Most or even all of the characters in the input sequence remain unchanged, only a few of them need to be corrected.

Model
Our model consists of three feature extractor modules and an adaptive gating module used to fuse kinds of features. Figure 1 illustrates the architecture of our model. Given a sentence, our model firstly extracts pinyin feature, glyph feature and context-sensitive semantic feature for every character, then integrates three features by the adaptive gating module. Finally, the integrated representation of each character is fed into a fully-connected layer to calculate the probabilities over the whole vocabulary, where the character with the highest probability is picked as the substitute.
In the following subsections, we will elaborate the implementation of each module.

Pinyin Feature Extractor
Neural TTS models, like Tacotron2 , have achieved high-quality performance in producing natural-sounding synthetic speech. We propose to generate the phonological representations of Chinese characters through a TTS model so that CSC can benefit from realistic pronunciation similarities between characters. In this paper, we leverage Tacotron2, a recurrent sequenceto-sequence mel spectrograms prediction network, to help modeling the phonological representations since its location-sensitive attention can create effective time alignment between the character sequence and the acoustic sequence. When training a Chinese TTS system with Tacotron2, characters are first converted to pinyin sequence as phoneme form. Then the sequence is represented by the encoder using an embedding layer and the hidden representations are consumed by the decoder to predict a corresponding mel spectrogram one frame at a time. Motivated by this, we train Tacotron2 separately using public Chinese female voice datasets 3 with teacher forcing. During training, we utilize pinyin transcription and mel spectrograms as input to help modeling pinyin representations. Then we extract pinyin embedding layer of the encoder as our pinyin feature extractor to generate the phonological representations for CSC. When given a Chinese sentence X, our model first converts it to a pinyin sequence using pypinyin 4 . Then dense feature for pinyin sequence F p = {f p 1 , f p 2 , ..., f p n } can be obtained by using pinyin feature extractor as a lookup table, where f p i ∈ R dp and d p is the dimen-sion of the pinyin feature.

Glyph Feature Extractor
As Chinese characters are composed of graphical components, it is intuitive that the representations for Chinese characters could benefit from the spatial layout of these components. Motivated by Meng et al. (2019) and Sehanobish and Song (2019)'s exploration on using glyph images for Chinese named entity recognition (NER) and Chinese word segmentation (CWS), we employ a glyph feature extractor to extract glyph features for Chinese characters. We make use of 8106 Chinese glyph images released by (Sehanobish and Song, 2019). To take advantage of powerful pre-trained models and avoid training from scratch, VGG19 (Simonyan and Zisserman, 2014) pretrained on ImageNet is adopted as the backbone of the glyph feature extractor. Following (Meng et al., 2019), we further finetune it with the objective of recovering the identifiers from glyph images to solve the problem of domain adaptation. After that, we drop the last classification layer and use the outputs of VGG19's last max pooling layer as glyph features. For a given sentence X, our glyph feature extractor is able to first retrieve images for its characters and then generate glyph features: .., f g n }, where f g i ∈ R dg is the glyph feature of the i th character x i and d g is the dimension of the glyph feature.

Semantic Feature Extractor
Beyond the phonological and the morphological information, we adopt empirically dominant pretrained language model to capture semantic information from context. Following (Hong et al., 2019;Cheng et al., 2020;Zhang et al., 2020), BERT is employed as the backbone of our semantic feature extractor. Given an input sentence X, the extractor outputs hidden states F s = {f s 1 , f s 2 , ..., f s n } at the final layer of BERT as semantic features, where f s i ∈ R ds and d s is the dimension of the semantic feature.

Adaptive Gating
Most previous methods for CSC simply used addition or concatenation to fuse different features. However, these fusion strategies ignore the relationship between the features. To tackle this issue, we propose an innovative adaptive gating mechanism served like a gate to finely control the fusion of features. It is defined as follows: where W p ∈ R dp×ds , b p ∈ R n×ds , W g ∈ R dg×ds , b g ∈ R n×ds are parameters to be learned. σ is a nonlinear activation function, which is a ReLU function in our implementation. "·" represents element-wise multiplication. We employ the proposed gating mechanism to control how much information in pinyin and glyph features is fused with semantic feature and transferred to the next classifier module. The enriched feature F e ∈ R n×ds is calculated as follows: where λ p + λ g = 1 are coefficients. Finally, we add residual connection to F e and F s by linear combination:

Training
During the training process, the representation F es is fed into a fully-connected layer for the final classification, which is defined as follows: where W f c ∈ R ds×V , b f c ∈ R n×V are learnable parameters for the fully-connected layer, V is the size of the vocabulary and Y p is the predicted sentence given the erroneous sentence X. The goal of training the model is to match the predicted sequence Y p and the ground truth sequence Y g . Overall, the learning process is driven by minimizing negative log-likelihood of the characters: whereŷ i , y i are the i th characters of Y p and Y g , respectively.

Inference
At inference time, we select candidates with the highest probability given by the model for each character's correction. As for detection task, it is accomplished by checking whether the picked candidate is different with the input character.

Datasets
To investigate the effectiveness of our proposed method, we conduct extensive experiments on three shared benchmark datasets for CSC task. Specifically, we make use of training datasets from SIGHAN13 (Wu et al., 2013), SIGHAN14  and SIGHAN15 (Tseng et al., 2015). We also include 271K training samples automatically generated by OCR-based and ASRbased methods  as in (Cheng et al., 2020;Nguyen et al., 2020). We employ test datasets of SIGHAN13, SIGHAN14, SIGHAN15 for evaluation. Following the same data preprocessing procedure with (Cheng et al., 2020;Nguyen et al., 2020), characters in all SIGHAN datasets are converted to simplified form using OpenCC 5 . We adopt SIGHAN's standard split of training and test data. The detailed statistic of the data is presented in Table 2.

Baseline Methods
We compare our method against several advanced methods proposed recently to investigate the potential of our framework. They are listed below: • FASPell (Hong et al., 2019): This method employs BERT as a denoising autoencoder to generate candidates for wrong characters and filters the visually/phonologically irrelevant candidates by a confidence-similarity decoder.
• SpellGCN (Cheng et al., 2020): This method learns the pronunciation/shape relationship between the characters by applying graph convolutional network on two similarity graphs. It predicts candidates for corrections by combining graph representations with semantic representations from BERT. 5 https://github.com/BYVoid/OpenCC • HeadFilt (Nguyen et al., 2020): This method uses adaptable filter learned from hierarchical character embeddings to estimate the similarity between characters and filter candidates produced by BERT.
• BERT: This method finetunes BERT with the training data and selects the character with the highest probability for correction.

Evaluation Metrics
We adopt sentence-level metrics for evaluation, which are widely used in previous methods for CSC task. Sentence-level metrics are stricter than character-level metrics since all errors in a sentence need to be detected and corrected. Metrics including accuracy, precision, recall and F1 score are calculated for errors detection and correction, respectively.

Experimental Setup
Our model is implemented based on huggingface's pytorch implementation of transformers 6 . We initialize weights of the semantic feature extractor using bert-base-chinese and weights of the glyph feature extractor using pretrained VGG19 from torchvision library 7 . Weights of the adaptive gating are randomly initialized. We train our model using AdamW optimizer for 5 epochs with learning rate 1e −4 . Batch size is 64 for training and 32 for evaluation. Best λ p , λ g are 0.6, 0.4 for SIGHAN13, 0.8, 0.2 for SIGHAN14 and SIGHAN15. We train Tacotron2 using its open-source implementation 8 for 130k steps with default parameters, except the decay step is set to 15000. The number of our pinyin is 1920 and the dimension of the pinyin feature is 512. Characters are written using Hei Ti font 9 in 8106 glyph images. We finetune VGG19 on glyph images for 50 epochs with a batch size 32 and a learning rate 5e −4 . The dimension of the glyph feature is 25088. All experiments are conducted on 2 Tesla V100 with 16G memory.  methods and achieves new state-of-the-art performance on all three datasets. Compared with the best baseline method (HeadFilt), the improvements of our method are 1.0%, 5.0%, 2.9% on detectionlevel F1 and 0.5%, 3.7%, 1.6% on correction-level F1 respectively, which verifies the effectiveness of our method. We observe that our method substantially outperforms SpellGCN on the precision and F1 scores, which indicates that our method is superior to Spell-GCN in fusing similarity knowledge. Although SpellGCN incorporates such knowledge, it relies on a predefined confusion set, which limits its generalization. Firstly, similarity knowledge cannot be obtained adequately since the confusion set is limited and unable to cover all characters. Secondly, the confusion set is manually constructed and has no golden-standard, which may bring about cascading errors. Our method achieves better F1 scores than HeadFilt, apparently because Head-Filt only leverages morphological knowledge in its post-filtering component. Finally, our method consistently beats vanilla BERT on all three datasets in terms of all metrics, which demonstrates the importance of incorporating the phonological and morphological knowledge into the semantic space for the CSC task.

Ablation Study
To study the effectiveness of each component in our method, we carry out ablation tests on three datasets. All ablation experiments with pinyin and glyph features are conducted using equal weights for pinyin feature and glyph feature (λ p = λ g ) to avoid unnecessary biases they bring. Table 4 presents the results. First, replacing adaptive gating with a simple aggregate strategy leads to worse performance for both detection and correction, which demonstrates the benefit of using adaptive gating. We then remove pinyin feature extractor or glyph feature extractor from the model. The performance degrades more when removing pinyin feature compared with removing glyph feature, which implies that phonological information is more crucial for CSC. This is consistent with the finding that most Chinese spelling errors are caused by phonological similarity (Liu et al., 2011). The result further degrades when removing both features and adaptive gating module, and this trend intuitively indicates that both phonological and morphological information contribute to the final performance.

Effect of Hyper Parameters
In this subsection, we conduct experiments to analyze the effect of weights of features and the dimension of the pinyin feature. Figure 2 shows how different weights influence the performance of the model. In this comparison, the value of λ p (λ g ) changes from 0.0 (1.0) to 1.0 (0.0) with the gap of 0.2. We plot the detection-level and correction-level F1 scores on three datasets in Figure 2. The results consistently show that our model performs better when λ p is set larger (e.g., 0.6 for SIGHAN13, 0.8 for Previous ablation tests show that the pinyin feature has more influence on the performance than the glyph feature. We further perform experiments by varying the dimension of the pinyin feature since it directly impacts the quality of the feature. Figure 3 shows larger dimensions perform better. However, it should be noted that the performance degrades when the dimension is larger than 512. This is reasonable due to the bias-variance phenomenon explained in (Yin and Shen, 2018). Feature with a small dimensionality can not capture all possible pinyin relations (high bias). On the other hand, fea-6,*+$1 6,*+$1 6,*+$1 Figure 3: The results of correction-level F1 score (%) w.r.t. the dimension of the pinyin feature.
ture with a large dimensionality includes too much noise (high variance). One must make a trade-off in dimensionality selection for high-quality features.

Features Visualization
To understand the effectiveness of our features more intuitively, we reduce features from highdimensional space to low-dimensional space and visualize some of them using t-SNE (Van der Maaten and Hinton, 2008). Figure 4 illustrates the embeddings of pinyin whose initial begins with "d", "f", "h" and "j". One can find from the figure that embeddings form several clusters based on their pronunciations. Pinyin embeddings with more similar pronunciations (eg. "fu4" and "hu2") are closer in distance than dissim-ilar ones (eg. "hu2" and "dao4"). This suggests that the model has learned alignment between the pinyin feature and the realistic acoustic feature. We also plot glyph embeddings of characters with radical "口", "土" at left side and characters with radical "口" at outside in Figure 5. They show the same trends as that of pinyin embeddings. Above all, this further verifies the effectiveness of both phonological and morphological knowledge derived from multi-modality.  Figure 4: The scatter of similar pinyin in terms of pronunciation. Pinyin whose initial begins with "d", "f", "h", "j" are shown in red, purple, blue, orange respectively. Figure 5: The scatter of similar characters in terms of shape. Characters with "口" and "土" at left side are shown in red and orange, characters with "口" at outside are shown in blue.
We also manually analyze the error cases of our model on the test datasets and find there are two common types of errors. One type is continuous errors, where several continuous characters in a sentence are wrong. For example, in sentence "...他 们有时候，有一点捞到...", "捞到(Caught)" are continuous errors, which should be "唠叨" (The correct sentence means 'Sometimes they are a little nagging'). The model fails to correct such continuous errors since the meaning of the whole sentence is more disturbed. Correcting another type of errors requires strong external knowledge. For instance, "心智 (mind)" in poem "...天将降大任于 斯人也，必先苦其心智，劳其筋骨... (...When Heaven is going to give a great responsibility to someone, it will first fill his mind with suffering, toil his sinews and bones...)" is erroneous but semantic plausible in Chinese. The model is still unable to correct it into "心志 (mind)" since the model lacks knowledge of poem.