Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking

Chinese Spell Checking (CSC) aims to detect and correct erroneous characters for user-generated text in the Chinese language. Most of the Chinese spelling errors are misused semantically, phonetically or graphically similar characters. Previous attempts noticed this phenomenon and try to use the similarity for this task. However, these methods use either heuristics or handcrafted confusion sets to predict the correct character. In this paper, we propose a Chinese spell checker called ReaLiSe, by directly leveraging the multimodal information of the Chinese characters. The ReaLiSe model tackles the CSC task by (1) capturing the semantic, phonetic and graphic information of the input characters, and (2) selectively mixing the information in these modalities to predict the correct output. Experiments on the SIGHAN benchmarks show that the proposed model outperforms strong baselines by a large margin.


Introduction
The Chinese Spell Checking (CSC) task aims to identify erroneous characters and generate candidates for correction. It has attracted much research attention, due to its fundamental and wide applications such as search query correction (Martins and Silva, 2004;Gao et al., 2010), optical character recognition (OCR) (Afli et al., 2016), automatic essay scoring (Dong and Zhang, 2016). Recently, rapid progress (Zhang et al., 2020;Cheng et al., 2020) has been made on this task, because of the success of large pretrained language models (Devlin et al., 2019;. In alphabetic languages such as English, spelling errors often occur owing to one or more wrong Table 1: Two examples of Chinese spelling errors and their candidate corrections. "Sent./Cand./Trans." are short for sentence/candidates/translation respectively. The wrong/candidate/correct characters with their pronunciation and translation are in red/orange/blue color. characters, resulting in the written word not in the dictionary problem (Tachibana and Komachi, 2016). However, Chinese characters are valid if they can be typed in computer systems, which causes that the spelling errors are de facto misused characters in the context of computer-based language processing. Considering the formation of Chinese characters, a few of them were originally pictograms or phono-semantic compound characters (Jerry, 1988). Thus, in Chinese, the spelling errors are not only the misused characters with confusing semantic meaning, but also the characters which are phonetically or graphically similar (Liu et al., 2010(Liu et al., , 2011. Table 1 shows two examples of Chinese spelling error. In the first example, phonetic information of "平" (flat) is needed to get the correct character "瓶" (bottle) since they share the same pronunciation "píng". The second example needs not only phonetic, but also graphic information of the erroneous character "轻" (light). The correct one, "经" (go), has the same right radical as "轻" and similar pronunciation ("qīng" and "jīng"). Therefore, considering the intrinsic nature of Chinese, it is essential to leverage the phonetic and graphic knowledge of the Chinese characters along with the textual semantics for the CSC task.
In this paper, we propose REALISE (Read, Listen, and See), a Chinese spell checker which leverages the semantic, phonetic and graphic information to correct the spelling errors. The REALISE model employs three encoders to learn informative representations from textual, acoustic and visual modalities. First, BERT (Devlin et al., 2019) is adopted as the backbone of the semantic encoder to capture the textual information. For the acoustic modality, Hanyu Pinyin (pinyin), the romanization spelling system for the sounds of Chinese characters, is used as the phonetic features. We design a hierarchical encoder to process the pinyin letters at the character-level and the sentence-level. Meanwhile, for the visual modality, we build character images with multiple channels as the graphic features, where each channel corresponds to a specific Chinese font. Then, we use ResNet (He et al., 2016) blocks to encode the images to get the graphic representation of characters.
With the representation of three different modalities, one challenge is how to fuse them into one compact multimodal representation. To this end, a selective modality fusion mechanism is designed to control how much information of each modality can flow to the mixed representation. Furthermore, as the pretrain-finetune procedure has been proven to be effective on various NLP tasks (Devlin et al., 2019;Dong et al., 2019;Sun et al., 2020), we propose to pretrain the phonetic and the graphic encoders by predicting the correct character given input in the corresponding modality.
We conduct experiments on the SIGHAN benchmarks (Wu et al., 2013;Tseng et al., 2015). By leveraging multimodal information, REALISE outperforms all previous state-ofthe-art models by a large margin. Compared to previous methods using confusion set (Lee et al., 2019) to capture the character similarity relationships, such as the SOTA SpellGCN (Cheng et al., 2020), REALISE achieves an averaging 2.4% and 2.6% F1 improvements at detection-level and correctionlevel. Further analysis shows that our model performs better on the errors which are not defined in the handcrafted confusion sets. This indicates that leveraging the phonetic and graphic information of Chinese characters can better capture the easily-misused characters.
To summarize, the contributions of this paper include: (i) we propose to leverage phonetic and graphic information of Chinese characters besides textual semantics for the CSC task; (ii) we introduce the selective fusion mechanism to integrate multimodal information; (iii) we propose acoustic and visual pretraining tasks to further boost the model performance; (iv) to the best of our knowledge, the proposed REALISE model achieves the best results on the SIGHAN CSC benchmarks.
2 Related Work

Chinese Spell Checking
The CSC task is to detect and correct spelling errors in Chinese sentences. Early works design various rules to deal with different errors Chu and Lin, 2015). Next, traditional machine learning algorithms are brought to this field, such as Conditional Random Field and Hidden Markov Model (Wang and Liao, 2015;. Then, neural-based methods have made great progress in CSC. Wang et al. (2018) treat the CSC task as a sequence labeling problem, and use a bidirectional LSTM to predict the correct characters. With the great success of large pretrained language models (e.g., BERT (Devlin et al., 2019)), Hong et al. (2019) propose the FASpell model, which use a BERT-based denoising autoencoder to generate candidate characters and uses some empirical measures to select the most likely ones. Besides, the Soft-Masked BERT model (Zhang et al., 2020) leverages a cascading architecture where GRU is used to detect the erroneous positions and BERT is used to predict correct characters.
Previous works (Yu and Li, 2014;Cheng et al., 2020) using handcrafted Chinese character confusion set (Lee et al., 2019) aim to correct the errors by discovering the similarity of the easily-misused characters.  leverage the pointer network (Vinyals et al., 2015) by picking the correct character from the confusion set. Cheng et al. (2020) propose a Spell-GCN model which models the character similarity through Graph Convolution Network (GCNs) (Kipf and Welling, 2016) on the confusion set. However, the character confusion set is predefined and fixed, which cannot cover all the similarity relations, nor can it distinguish the divergence in the similarity ! ! " q i n g  Figure 1: Architecture overview of the REALISE model. The semantic, phonetic and graphic encoders, are used to capture the information in textual, acoustic and visual modalities. The fusion module selectively fuses the information from three encoders. In the example input, to correct the erroneous character, "轻" (qīng, light), we need not only the contextual text information, but also the phonetic and graphic information of the character itself.
of Chinese characters. In this work, we discard the predefined confusion set and directly use the multimodal information to discover the subtle similarity relationship between all Chinese characters.

Multimodal Learning
There has been much research to integrate information from different modalities to achieve better performance. Tasks such as Multimodal Sentiment Analysis (Zadeh et al., 2016;Zhang et al., 2019), Visual Question Answering (Antol et al., 2015;Chao et al., 2018) and Multimodal Machine Translation (Hitschler et al., 2016;Barrault et al., 2018) have made much progress. Recently, multimodal pretraining models have been proposed, such as VL-BERT (Su et al., 2020), Unicoder-VL , and LXMERT (Tan and Bansal, 2019). In order to incorporate the visual information of Chinese characters into language models, Meng et al.
(2019) design a Tianzige-CNN to facilitate some NLP tasks, such as named entity recognition and sentence classification. To the best of our knowledge, this paper is the first work to leverage multimodal information to tackle the CSC task.

The REALISE Model
In this section, we introduce the REALISE model, which utilizes the semantic, phonetic, and graphic information to distinguish the similarities of Chinese characters and correct the spelling errors. As shown in Figure 1, multiple encoders are firstly employed to capture valuable information from textual, acoustic and visual modalities. Then, we develop a selective modality fusion module to obtain the context-aware multimodal representations. Finally, the output layer predicts the probabilities of error corrections.

The Semantic Encoder
We adopt BERT (Devlin et al., 2019) as the backbone of the semantic encoder. BERT provides rich contextual word representation with the unsupervised pretraining on large corpora. The input tokens X = (x 1 , . . . , x N ) are first projected into H t 0 through the input embedding. Then the computation of Transformer (Vaswani et al., 2017) encoder layers can be formulated as: where L is the number of Transformer layers. Each layer consists of a multi-head attention module and a feed-forward network with the residual connection (He et al., 2016) and layer normalization (Ba et al., 2016). The output of the last layer is used as the contextualized semantic representation of the input tokens in textual modality.

The Phonetic Encoder
Hanyu Pinyin (pinyin) is the romanization for Chinese to "spell out" the sounds of characters. We use it to calculate the phonetic representation in this paper. The pinyin of a Chinese character consists of three parts: initial, final, and tone. The initial (21 in total) and final (39 in total) are written with letters in the English alphabet. The 5 kinds of tones (take the final "a" as an example, {ā,á,ǎ,à, a }) can be mapped into numbers {1, 2, 3, 4, 0}. Though the vocabulary size of pinyin for all Chinese characters is a fixed number, we use a sequence of letters in REALISE to capture the subtle phonetic difference between Chinese characters. For example, the pinyin of "中" (middle) and "棕" (brown) are "zhōng" and "zōng" respectively. The two characters have very similar sounds but quite different meanings. We thus represent pinyin as a symbol sequence, e.g., {z, h, o, n, g, 1} for "中". We denote the pinyin of the i-th character in the input sentence In REALISE, we design a hierarchical phonetic encoder, which consists of a character-level encoder and a sentence-level encoder.
The Character-level Encoder is to model the basic pronunciation and capture the subtle sound difference between characters. It is a single-layer uni-directional GRU (Cho et al., 2014), which encodes the pinyin of the i-th character x i as: where E(p i,j ) is the embedding of the pinyin symbol p i,j , andh a i,j is the j-th hidden states of the GRU. The last hidden state is used as the characterlevel phonetic representation of x i .
The Sentence-level Encoder is a 4-layer Transformer with the same hidden size as the semantic encoder. It is designed to obtain the contextualized phonetic representation for each Chinese character. As the independent phonetic vectors are not distinguished in order, we add the positional embedding to each vector in advance. Then, we pack these phonetic vectors together, and apply the Transformer layers to calculate the contextualized representation in acoustic modality, denoted as H a = (h a 1 , h a 2 , ..., h a N ). Note that owing to the Transformer architecture, this representation is also normalized.

The Graphic Encoder
We apply the ResNet (He et al., 2016) as the graphic encoder. The graphic encoder has 5 layers of ResNet blocks (denoted as ResNet5) followed by a layer normalization (Ba et al., 2016) operation. We formulate this procedure as follows: where I i is the image of the i-th character x i in the input sentence, and LayerNorm means layer normalization.
In order to extract graphic information effectively, each block in ResNet5 halves the width and height of the image, and increases the number of channels. Thus, the final output is a vector with the length equal to the number of output channels, i.e., both height and width become 1. Furthermore, we set the number of output channels to the hidden size in the semantic encoder for the follow-up modality fusion. We denote the representation in visual modality of the input sentence as . The character image of x i is read from preset font files. Since the scripts of Chinese characters have evolved for thousands of years, to capture the graphic relationship between character as much as possible, we select three fonts, namely Gothic typefaces (黑体, hēitǐ) in both Simplified and Traditional Chinese, and Small Seal Script (小 篆, xiǎozhuàn). The three fonts correspond to the three channels of the character images, whose size is set to 32 × 32 pixel.

Selective Modality Fusion Module
After applying the previously mentioned semantic, phonetic and graphic encoders, we get the representation vectors H t , H a and H v in textual, acoustic and visual modalities. To predict the final correct Chinese characters, we develop a selective modality fusion module to integrate these vectors in different modalities. This module fuses information in two levels, i.e., character-level and sentence-level.
First, for each modality, a selective gate unit is employed to control how much information can flow to the mixed multimodal representation. For example, if a character is misspelled due to its similar pronunciation to the correct one, then more information in the acoustic modality should flow into the mixed representation. The gate values are computed by a fully-connected layer followed by a sigmoid function. The inputs include the character representation of three modalities and the mean of the semantic encoder output H t to capture the overall semantics of the input sentence. Formally, we denote the gate values for the textual, acoustic and visual modalities as g t , g a and g v . The mixed multimodal representationh i of the i-th character is computed as follows: where L is the number of Transformer layers, W o and b o are learnable parameters.

Acoustic and Visual Pretraining
While acoustic and visual information is essential to the CSC task, equally important is how to associate them with the correct character. In order to learn the acoustic-textual and visual-textual relationships, we propose to pretrain the phonetic and the graphic encoders.
For the phonetic encoder, we design an Input Method pretraining objective, that the encoder should recover the Chinese character sequence given the input pinyin sequence. This is what the  Chinese input methods do. We add a linear layer on the top of the encoder to transform the hidden states to the probability distributions over the Chinese character vocabulary. We pretrain the phonetic encoder with the pinyin of the sentences with spelling errors in the training data, and make it recover the character sequences without spelling errors. For the graphic encoder, we design an Optical Character Recognition (OCR) pretraining objective. Given the Chinese character images, the graphic encoder learns the visual information to predict the corresponding character over the Chinese character vocabulary. This is like what the OCR task does, but our recognition is only conducted on the character level and typed scripts. During the pretraining, we also add a linear layer on the top to perform the classification.
Finally, we load the pretrained weights of the semantic encoder, phonetic encoder, and graphic encoder, and conduct the final training process with the CSC training data.

Experiments
In this section, we introduce experimental details and results on the SIGHAN benchmarks (Wu et al., 2013;Tseng et al., 2015). We then verify the effectiveness of our model by conducting ablation studies and analyses.

Data and Metrics
Following previous works Cheng et al., 2020), we use the SIGHAN training data and the generated pseudo data (Wang et al., 2018, denoted   Results are reported at the detection level and the correction level. At the detection level, a sentence is considered to be correct if and only if all the spelling errors in the sentence are detected successfully. At the correction level, the model must not only detect but also correct all the erroneous characters to the right ones. We report the accuracy, precision, recall and F1 scores on both levels.

Implementation Details
The REALISE model is implemented using Py-Torch framework (Paszke et al., 2019) with the Transformer library (Wolf et al., 2020). The architecture of the semantic encoder is same as the BERT BASE (Devlin et al., 2019) model (i.e. 12 transformer layers with 12 attention heads, hidden 2 https://github.com/BYVoid/OpenCC size of 768). We initialize the semantic encoder with the weights of BERT-wwm model (Cui et al., 2019). For the phonetic sentence-level encoder, we set the number of layers to 4, and initialize its position embedding with BERT's position embedding. The selective modality fusion module has 3 transformer layers, i.e., L = 3, and the prediction matrix W o is tied with the word embedding matrix of the semantic encoder. All the embeddings and hidden states have the dimension of 768. We use the Pillow library to extract the Chinese character images. When processing the special tokens (e.g. [CLS] and [SEP] of BERT), we use the tensor with all zero values as their image inputs. We train our REALISE model with the AdamW (Loshchilov and Hutter, 2017) optimizer for 10 epochs. The learning rate is set to 5e-5, the batch size is set to 32, and the model is trained with learning rate warming up and linear decay.
In the SIGHAN13 test set, the annotation quality is relatively poor, that quite a lot of the mixed usage of auxiliary "的", "地", and "得" are not annotated (Cheng et al., 2020). Therefore, a wellperformed model may obtain bad scores on it. To alleviate the problem, Cheng et al. (2020) proposes to continue finetuning the model on the SIGHAN13 training set before testing. We argue that it's not a good practice because it reduces the model performance. Instead, we use a simple and effective post-processing method. We simply remove all the detected and corrected "的", "地", and "得" characters from the model output and then evaluate with the ground truth of SIGHAN13 test set.

Baselines
We compare REALISE with the following baselines: KUAS , NTOU (Chu and Lin, 2015), NCTU-NTUT (Wang and Liao, 2015), HanSpeller++ , LMC (Xie et al., 2015) mainly utilize heuristics or traditional machine learning algorithms, such as n-gram language model, Conditional Random Field and Hidden Markov Model. Sequence Labeling (Wang et al., 2018) treats CSC as a sequence labeling problem and applies a BiLSTM model. FASpell (Hong et al., 2019) utilizes a denoising autoencoder (DAE) to generate candidate characters. Soft-Masked BERT (Zhang et al., 2020) utilizes the detection model to help the correction model learn the right context. SpellGCN (Cheng et al., 2020) incorporates the predefined character confusion sets to the BERT-based correction model through Graph Convolutional Networks (GCNs). BERT (Devlin et al., 2019) is to directly fine-tune the BERT BASE model with the CSC training data. Table 3 shows the evaluation scores at detection and correction levels on the SIGHAN 13/14/15 test sets. The REALISE model performs significantly better than all the previous state-of-the-art models on all test sets. It can be seen that, by capturing valuable information from acoustic and visual modalities, REALISE yields consistent gain with a large margin against BERT. Specifically, at the correctionlevel, REALISE exceeds BERT by 5.2% F1 on SIGHAN13, 3.8% F1 on SIGHAN14, and 4.4% F1 on SIGHAN15. The results on SIGHAN13 are improved significantly with simple post-processing described in Section 4.2.

Main Results
There are several successful applications of BERT on the CSC task, such as FASpell and Spell-GCN, which also consider the Chinese character similarity. They attempt to calculate the similarity as the confidence of filtering candidates, or construct similarity graphs from predefined confusion sets. Instead, in our method, multiple encoders are  Table 4: Ablation results of the REALISE model averaged on SIGHAN test sets. We apply the following changes to REALISE: removing the phonetic encoder (-Phonetic), removing the graphic encoder (-Graphic), using only one font to build the graphic inputs (-Multi-Fonts), removing acoustic and visual pretraining (-Pretraining), replacing the selective modality fusion mechanism with simple summation (-Selective-Fusion).
directly applied to derive more informative representation from the acoustic and visual modalities. Compared with SpellGCN (Cheng et al., 2020), the SOTA CSC model, our REALISE model achieves an averaging 2.4% F1 improvements at detectionlevel and an averaging 2.6% F1 improvements at correction-level. This indicates that, compared with other extensions of BERT, the explicit utilization of multimodal information of Chinese characters is more beneficial to the CSC task.
With the simple post-processing as described in Section 4.2, results of each model on the SIGHAN13 test set are improved significantly. Compared with BERT and SpellGCN, we can see that, after the post-processing, the REALISE model is ahead of all the baseline models.

Ablation Study
We explore the contribution of each component in REALISE by conducting ablation studies with the following settings: 1) removing the phonetic encoder, 2) removing the graphic encoder, 3) using only one font (Gothic typefaces in Simplified Chinese) for the graphic encoder, 4) removing the acoustic and visual pretraining objectives, 5) replacing the selective modality fusion mechanism with simple summation.  Figure 2: Selective modality fusion visualization. "I" is the input sentence. "O" is the output of REALISE (also the ground truth), and "T" is the translation. g t , g a , g v are the gate values for the textual, acoustic, and visual modality respectively. We highlight the wrong/correct characters in red/blue color. Table 4 shows the averaged scores 3 on three SIGHAN test sets. The main motivation of this paper is to discover the character similarity relationships by incorporating the acoustic and visual information. If removing the phonetic or graphic encoder, we can see that the model performance drops at two levels but still outperforms BERT significantly. This suggests that the checking model can benefit from the multimodal information. No matter which component we remove, the performance of REALISE drops, which fully demonstrates the effectiveness of each part in our model. Figure 2 gives two examples to analyze the selective modality fusion module. In the first example, the acoustic and visual selective gate values of "埤", i.e. g a and g v , are much larger than most other characters since "埤(pí)" and "啤(pí)" have the same pronunciation and right radical "卑". This shows that the selective fusion module can judge whether to introduce phonetic or graphic information into the mixed representation. The second example shows a similar trend for the pronunciation of "带(dài)" and "戴(dài)". More selective fusion visualization can be found in the Appendix A.2. Besides, we calculate the averaged gate values of erroneous characters for each modality on SIGHAN15. The largest one is the textual modality that the value is almost equal to 1.0. The second one is the acoustic modality that the averaged 3 Full ablation results can be found in the Appendix A.1.

In: 我打算去法国流行，你要不要跟我一起去？
I am going to popular to France, would you like to go with me?
Out: 我打算去法国旅行，你要不要跟我一起去？ I am going to travel to France, would you like to go with me?
In: 回国之后，我跟快去你家。 After returning home, I will go to your house with.
Out: 回国之后，我很快去你家。 After returning home, I will go to your house soon. value is 0.334, and the smallest one is the visual modality that the value is 0.229. It means that the information from the semantic encoder is the most important for correcting the spelling errors. The acoustic modality is more important than the visual modality, which is consistent with the fact that the spelling errors caused by similar pronunciations are more frequent than errors caused by similar character shapes (Liu et al., 2010).

Case Study
In the first example in Table 5, "流" is the erroneous character. If ignoring the Chinese character similarities, we can find that there are multiple candidate corrections to replace the "流" character. For instance, we can replace it with "游" and the English translation is "I am going to parade in France". However, the REALISE's output is the best correction, because "流(liú)" and "旅(lˇü)" have a similar pronunciation. In the second example, not only the phonetic information, but also the visual information is important for correcting "跟(gēn)" to "很(hěn)". In detail, the two characters share the same final "en" in pronunciation, and have the same right radical "艮". The errors in the above examples are not corrected by SpellGCN, since they are not defined as confusing character pairs in the handcrafted confusion sets (Lee et al., 2019). Specifically, in the SIGHAN15 test set, there are 16% erroneouscorrected character pairs not in the predefined confusion sets. SpellGCN corrects 64.6% of them but REALISE performs better with 73.5% correction. Besides, for the easily-confused pairs in the predefined sets, SpellGCN corrects 82.5% of them and REALISE corrects 85.8%. This indicates that leveraging multimodal information of Chinese charac-ters helps the model generalize better in capturing the character similarity relationships.

Conclusion
In this paper, we propose a model called REALISE for Chinese spell checking. Since the spelling errors in Chinese are often semantically, phonetically or graphically similar to the correct characters, RE-ALISE leverages information in textual, acoustic and visual modalities to detect and correct the errors. The REALISE model captures information in these modalities using tailored semantic, phonetic and graphic encoders. Besides, a selective modality fusion mechanism is proposed to control the information flow of these modalities. Experiments on the SIGHAN benchmarks show that the proposed REALISE outperforms the baseline models using only textual information by a large margin, which verifies that leveraging acoustic and visual information helps the Chinese spell checking task.   Figure 3: Selective modality fusion visualization. "I" is the input sentence. "O" is the output of REALISE (also the ground truth), and "T" is the translation. g t , g a , g v are the gate values for the textual, acoustic, and visual modality respectively. We highlight the wrong/correct characters in red/blue color.