Investigating Glyph-Phonetic Information for Chinese Spell Checking: What Works and What’s Next?

While pre-trained Chinese language models have demonstrated impressive performance on a wide range of NLP tasks, the Chinese Spell Checking (CSC) task remains a challenge. Previous research has explored using information such as glyphs and pronunciations to improve the ability of CSC models to distinguish mis-spelled characters, with good results at the accuracy level on public datasets. However, the generalization ability of these CSC models has not been well understood: it is unclear whether they incorporate glyph-phonetic information and, if so, whether this information is fully utilized. In this paper, we aim to better understand the role of glyph-phonetic information in the CSC task and suggest directions for improvement. Additionally, we propose a new, more challenging, and practical setting for testing the generalizability of CSC models. Our code will be released at https://github.com/piglaker/ConfusionCluster.


Introduction
Spell checking (SC) is the process of detecting and correcting spelling errors in natural human texts.For some languages, such as English, SC is relatively straightforward, thanks to the use of tools like the Levenshtein distance and a well-defined vocabulary.However, for Chinese, Chinese spell checking (CSC) is a more challenging task, due to the nature of the Chinese language.Chinese has a large vocabulary consisting of at least 3,500 common characters, which creates a vast search space and an unbalanced distribution of errors (Ji et al., 2021).Moreover, substitutions or combinations of characters can significantly alter the meaning of a Chinese sentence while still being grammatically correct.The CSC task, therefore, requires requires the output to retain as much of the original meaning and wording as possible.Figure 1 shows different types of errors and corresponding target characters.Previous work has attempted to incorporate inductive bias to model the relationship between Chinese character glyphs, pronunciation, and semantics (Xu et al., 2021).
In recent years, pre-trained language models (PLMs) have shown great success in a wide range of NLP tasks.With the publication of BERT (Devlin et al., 2018), using PLMs for CSC tasks has become a mainstream approach, with examples including FASpell (Hong et al., 2019), Softmasked-BERT (Zhang et al., 2020), SpellGCN (Cheng et al., 2020), and PLOME (Liu et al., 2021).Some researchers have focused on the special features of Chinese characters in terms of glyphs and pronunciations, aiming to improve the ability to distinguish misspelled characters by incorporating glyphphonetic information (Ji et al., 2021;Liu et al., 2021;Xu et al., 2021).However, despite these advances, the generalization of CSC models to realworld applications remains limited.How can we improve the generalization ability of CSC models?Can current models recognize and utilize glyphphonetic information to make predictions?As we re-examine previous work, we have identified some previously unexplored issues and potential future directions for research.
Q1: Do existing Chinese pre-trained models encode the glyph-phonetic information of Chinese characters?Chinese writing is morpho-semantic, and its characters contain additional semantic information.Before studying existing CSC models, we seek to investigate whether existing mainstream Chinese pre-trained language models are capable of capturing the glyph-phonetic information.
Q2: Do existing CSC models fully utilize the glyph-phonetic information of misspelled characters to make predictions?Intuitively, introducing glyph-phonetic information in the CSC task can help identify misspelled characters and improve the performance of the model.However, there has been little research on whether existing CSC models effectively use glyph-phonetic information in this way.
Empirically, our main observations are summarized as follows: • We show that Chinese PLMs like BERT encode glyph-phonetic information without explicit introduction during pre-training, which can provide insight into the design of future Chinese pre-trained models.We also propose a simple probe task for measuring how much glyph-phonetic information is contained in a Chinese pre-trained model.We hope that this detailed empirical study will provide follow-up researchers with more guidance on how to better incorporate glyph-phonetic information in CSC tasks and pave the way for new state-of-the-art results in this area.

Glyph Information
Learning glyph information from Chinese character forms has gained popularity with the rise of deep neural networks.After word embeddings (Mikolov et al., 2013b) were proposed, early studies (Sun et al., 2014;Shi et al., 2015;Yin et al., 2016) used radical embeddings to capture semantics, modeling graphic information by splitting characters into radicals.Another approach to modeling glyph information is to treat characters as images, using convolutional neural networks (CNNs) as glyph feature extractors (Liu et al., 2010;Shao et al., 2017;Dai and Cai, 2017;Meng et al., 2019).With pre-trained language models, glyph and phonetic information are introduced end-to-end.Chinese-BERT (Sun et al., 2021) is a pre-trained Chinese NLP model that flattens the image vector of input characters to obtain the glyph embedding and achieves significant performance gains across a wide range of Chinese NLP tasks.

Phonetic Infomation
Previous research has explored using phonetic information to improve natural language processing (NLP) tasks.Liu et al. propose using both textual and phonetic information in neural machine translation (NMT) by combining them in the input embedding layer, making NMT models more robust to homophone errors.There is also work on incorporating phonetic embeddings through pre-training.Zhang et al. propose a novel end-to-end framework for CSC with phonetic pre-training, which improves the model's ability to understand sentences with misspellings and model the similarity between characters and pinyin tokens.Sun et al. apply a CNN and max-pooling layer on the pinyin sequence to derive the pinyin embedding.

Task Description
Under the language model framework, Chinese Spell Checking is often modeled as a conditional token prediction problem.Formally, let X = c 1 , c 2 , . . ., c T be an input sequence with potentially misspelled characters c i .The goal of this task is to discover and correct these errors by estimating the conditional probability P (y i |X) for each misspelled character c i .

CSC Datasets
We conduct experiments on the benchmark SIGHAN dataset (Wu et al., 2013;Yu et al., 2014;Tseng et al., 2015), which was built from foreigners' writings and contains 3,162 texts and 461 types of errors.However, previous studies have reported poor annotation quality in SIGHAN13 and SIGHAN14 (Wu et al., 2013;Yu et al., 2014), with many errors, such as the mixed usage of auxiliary characters, remaining unannotated (Cheng et al., 2020).To address these issues and enable fair comparisons of different models, we apply our probe experiment to the entire SIGHAN dataset and use only clean SIGHAN15 for metrics in our review.The statistics of the dataset are detailed in the appendix.

Methods for CSC
To investigate the role of glyph-phonetic information in CSC, we conduct a probe experiment using different Chinese PLMs as the initial parameters of the baseline.The models we use are detailed in the appendix.For our first probe experiment, we use the out-of-the-box BERT model as a baseline.We input the corrupted sentence into BERT and get the prediction for each token.If the predicted token for the corresponding output position is different from its input token, we consider BERT to have detected and corrected the error (Zhang et al., 2022).We also consider two previous pre-trained methods that introduced glyph and phonetic information for CSC.PLOME (Liu et al., 2021) is a pre-trained masked language model that jointly learns how to understand language and correct spelling errors.It masks chosen tokens with similar characters according to a confusion set and introduces phonetic prediction to learn misspelled knowledge at the phonetic level using GRU networks.RealiSe (Xu et al., 2021) leverages the multimodal information of Chinese characters by using a universal encoder for vision and a sequence modeler for pronunciations and semantics.

Metrics
For convenience, all Chinese Spell Checking metrics in this paper are based on the sentence level score (Cheng et al., 2020).We mix the original SIGHAN training set with the enhanced training set of 270k data generated by OCR-and ASR-based approaches (Wang et al., 2018) which has been widely used in CSC task.

Experiment-I: Probing for Character Glyph-Phonetic Information
In this section, we conduct a simple MLP-based probe to explore the presence of glyph and phonetic information in Chinese PLMs and to quantify the extent to which tokens capture glyph-phonetic information.We consider glyph and phonetic information separately in this experiment.

Glyph Probe
For glyphs, we train a binary classifier probe to predict if one character is contained within another character.We use the frozen embeddings of these characters from Chinese PLMs as input.That is, as shown in the upper part of Figure 2, if the probe is successful, it will predict that "称" contains a "尔" at the glyph level however not "产" (it is difficult to define whether two characters are visually similar, so we use this method as a shortcut).For the glyph probe experiment, we consider the static, non-contextualized embeddings of the following Chinese PLMs: BERT (Cui et al., 2019), RoBERTa (Cui et al., 2019), Chinese-BERT (Sun et al., 2021), MacBERT (Cui et al., 2020), CPT (Shao et al., 2021), GPT-2 (Radford et al., 2019), BART (Shao et al., 2021), and T5 (Raffel et al., 2020).We also use Word2vec (Mikolov et al., 2013a) as a baseline and a completely randomized Initial embedding as a control.See Appendix C.1 for details on the models used in this experiment.
The vocabulary of different Chinese PLMs is similar.For convenience, we only consider the characters that appear in the vocabulary of BERT, and we also remove the characters that are rare and too complex in structure.The details of our datasets for the probe are shown in Appendix C.2.
We divide the character w into character component {u 1 , u 2 , . . ., u i } using a character splitting tool1 .That is, "称" will be divided into "禾" and "尔".The set of all characters (e.g."称") is W = {w 1 , w 2 , . . ., w d }, where d is number of characters.The set of all components of characters (e.g."禾", "尔") is U = {u 1 , u 2 , . . ., u c }, where c is the number of components of each character.If u i exists in w i , in other words, is a component of w i in glyph level, then u i , w i is a positive example, and vice versa is a negative example.Then, we constructed a positive dataset D pos = {{u 1 , w 1 }, {u 2 , w 1 }, . . ., {u i , w d }}, where the u corresponds to w separately.Also, we constructed a balanced negative dataset where d is equal to D pos and u n is randomly selected in the set U .We mix D pos and D neg and split the dataset into training and test according to the ratio of 80:20 to ensure that a character only appears on one side.
We train the probe on these PLMs' static nontrainable embeddings.For every u i , w i , we take the embedding of u i and w i , and concatenation them as the input x i .The classifier trains an M LP to predict logits ŷi , which is defined as : To control the variables as much as possible and mitigate the effects of other factors on the probe experiment, we also experimented with the number of layers of M LP .The results of this are detailed in Appendix C.3.

Phonetic Probe
For phonetics, we train another binary classifier probe to predict if two characters have the similar pronunciation, also using the frozen embeddings of these characters from Chinese PLMs as input.The meaning of 'similar' here is that the pinyin is exactly the same, but the tones can be different.That is, as shown in the lower part of Figure 2, if the probe is successful, it will predict that "称"(cheng) has the similar pronunciation with "程"(cheng) however not "产"(chan).The pronunciation information for the Chinese characters comes from the pypinyin2 toolkit.
We consider static non-contextualized embedding of Chinese PLMs, which are the same as the glyph probe.We also mainly analyze the characters in the vocabulary of BERT, and mainly consider common characters.
The dataset construction is also similar to the glyph probe.To create positive examples, for each character w i in character list W , we find a character u i which has the similar pronunciation as w i , then u i , w i is a positive example.For each positive, we also find a character s i which has a different pronunciation from w i to construct negative example s i , w i .For example, the positive example is the two characters with similar pronunciation, such as "称" (cheng) and "程"(cheng).And the negative example is the two characters with different pronunciation, such as "称"(cheng) and "产"(chan).The divide ratio and other settings are the same as the glyph probe.
We train the probe on these PLMs' static nontrainable embeddings as the glyph probe and also concatenate the embeddings of the pairs as input.

Results and Analysis
The following conclusions can be drawn from Figure 3.
The Chinese PLMs encoded the glyph information of characters From the results, we can see that for glyphs, all models outperform the control model.The results of the control are close to 50% that there is no glyph information encoded in the input embedding, and the model guesses the result randomly.Comparing Word2vec and other Chinese PLMs side-by-side, we find that the large-scale pre-trained model has a significant advantage over Word2vec, suggesting that large-scale pre-training can lead to better representation of characters.In addition, we find that the results of these Chinese PLMs are concentrated in a small interval.Chine-seBERT boasts of introducing glyph-phonetic information, which do not have advantages in glyph.
PLMs can hardly distinguish the phonetic features of Chinese characters In our experiments, the control group performed similarly to the phonetic probe, with an accuracy of approximately 50%.Unlike the glyph probe, the accuracy of Word2vec and other Chinese PLMs are also low in this probe.However, the introduction of phonetic embedding allowed ChineseBERT to perform significantly better than the other models.Our anal-Figure 3: Results of Probe for Chinese PLMs.We found that the language models modeled by different paradigms are roughly close in perceiving graphical information but weak in speech.It is worth noting that ChineseBERT performs more significantly on this probe, probably because it explicitly introduces graphical and pronunciation information from the embedding stage.ysis suggests that current Chinese PLMs may have limited phonetic information.

Method
Acc. Model training on the CSC task does not enrich glyph and phonetic information We perform the same two probes using models fine-tuned on the SIGHAN dataset.We aim to investigate whether the training for the CSC task could add glyph and phonetic information to the embeddings, and the results are shown in Table 1.We found that the difference between the fine-tuned and untrained models is almost negligible, indicating that the relevant information is primarily encoded during the pre-training stage.

Experiment-II: Probing for Homonym Correction
In this experiment, we aim to explore the extent to which existing models can make use of the information from misspelled characters.To do this, we propose a new probe called Correction with Misspelled Character Coverage Ratio(CCCR), which investigates whether the model can adjust its prediction probability distribution based on the glyph-phonetic information of misspelled characters when making predictions.

Correction with Misspelled Character Coverage Ratio
Measure models utilizing the misspelled characters In this paper, we propose a method to evaluate the ability of a model to make predictions using additional information from misspelled characters, as well as to assess whether the model contains glyph-phonetic information.
Assume that C is a combination set of all possible finite-length sentence C i in the languages L, C = {C 0 , ..., C i , ...}, C i = {c i,1 , ..., c i,n , ...}, while c i,j ∈ L. Let sentence C n,a i be C n,a i = {c i,1 , ..., c i,n−1 , a, c i,n+1 , ...}, then assume that the representation learning model, let H w (C) be the hiddens of model w, X i is an example in C, For model w, the probability of token in position i should be:

P (y
Dataset D is a subset of C, Then we can approximate the probability of the model.The CCCR is composed of MLM and Homonym.The former indicates which samples need the information on misspelled characters to be corrected while the latter shows which sample models adjust the output distribution.We take the intersection to get the frequency of whether the model is adjusted for the samples whose distribution should be adjusted.
Homonym Same to MLM, For input sentence C i ∈ D, C i = {c 1 , c 2 , c misspelled , . . ., c T } and the position of c misspelled is spelling error.For all sentences C i in the dataset D, C i ∈ Homonym if: Correction with Misspelled Character Coverage Ratio (CCCR) The measured ratio is used to describe the lower bound of the probability that the model uses the information of the misspelled characters for the sentences C i in the dataset C.
Baseline Independently, we give an estimation method for the base value.Given model w, noise, dataset D, ground truth correction y.The baseline of CCCR should be estimated as: The baseline can be understood as a model with no glyph-phonetic information at all, and the probability of being able to guess the correct answer.But no such language model exists.For this purpose, instead of inputting the misspelled characters into the model, we artificially design strategies for the model to randomly guess answers by weight from the remaining candidates, which is equivalent to the probability of being able to guess correctly.
This probability is comparable to CCCR.CCCR restricts the condition for y to overtake noise.In the case of baseline, considering rearranging the candidates, the probability of y overtaking noise can also be re-normalized by probability.

Isolation Correction Setting Experiment
In the previous section, we test CCCR on the model finetuned on the SIGHAN dataset then found the CCCR of the models approached 92%.The results are shown in Table 3.As shown in Table 4, we analyze the overlap of correction pairs in the training and test sets in the SIGHAN dataset.
To test the model generalization ability, we design Isolation Correction Task, which removes all overlapped pairs in the training set and duplicate pairs in the test set.With isolation, the training set is reduced by about 16%.We believe that such a setup can better test the generalizability of the model and is more challenging and practical.Within the CCCR probe, We explore the ability of the model whether rely on its information, not just the ability to remember the content on the isolated SIGHAN dataset.The result is shown in   Between CCCR and F1 score, the mismatch phenomenon we refer to as stereotype is observed.The correction pair remembered while training harms the generalization of models.

Results and Analysis
We conducted experiments on three generic Chinese PLMs, BERT, RoBERTa, and ChineseBERT, and two CSC Models, PLOME, and Realise.We compare the metrics difference between the Initial model and the model after finetuning the isolation training set.The result is shown in Table 2. CCCR and F1 values mismatch Our experimen- tal results show that the CCCR and F1 values mismatch for CSC models.In the isolation training setting, we observed that the F1 values of PLOME and ReaLise are both significantly lower than their performance in Table 2, indicating that their ability to make correct predictions is primarily based on the memory of correction pairs in the training set.However, their CCCR values remained high, suggesting that they are able to discriminate glyphphonetic information but are not able to correct it effectively.

Stereotype harm the generalization ability of the model in Isolation Correction Experiments
These results suggest that the correction performance of the models is primarily dependent on their memory ability and that a strong reliance on memory can hinder generalization.The poor performance in the isolation setting indicates that none of the current methods generalize well, which presents a significant challenge for future CSC research.We recommend that future research in this field follow the isolation experiment setting to address this challenge.

Conclusion
In this paper, we have explored the role of glyphphonetic information from misspelled characters in Chinese Spell Checking (CSC).Based on our experimental results, we have reached the following conclusions: • Current Chinese PLMs encoded some glyph information, but little phonetic information.• Existing CSC models could not fully utilize the glyph-phonetic information of misspelled characters to make predictions.• There is a large amount of overlap between the training and test sets of SIGHAN dataset, which is not conducive to testing the generalizability of the CSC model.We propose a more challenging and practical setting to test the generalizability of the CSC task.
Our detailed observations can provide valuable insights for future research in this field.It is clear that a more explicit treatment of glyph-phonetic information is necessary, and researchers should consider how to fully utilize this information to improve the generalizability of their CSC models.We welcome follow-up researchers to verify the generalizability of their models using our proposed new setting.

Limited number of CSC models tested
During our research, we encountered difficulties in reproducing previous models due to unmaintained open source projects or the inability to reproduce the results claimed in the papers.As a result, we are unable to test all of the available models.

Limited datasets for evaluating model performance
There are currently few datasets available for the CSC task, and the mainstream SIGHAN dataset is relatively small.The limited size of the data used to calculate the metrics may not accurately reflect the performance of the models.Furthermore, we found that the quality of the test set is poor, the field is narrow, and there is a large gap between the test set and real-world scenarios.
to enhance the ability to model the Chinese corpus.We consider the base model.Model Card:'junnyu/ChineseBERT-base' under Joint Laboratory of HIT and iFLYTEK Research.
MacBERT (Cui et al., 2020) suggests that [M ASK] token should not be used for masking, but similar words should be used for masking because [M ASK] has rarely appeared in the finetuning phase.We also consider the base model.Model Card:'hfl/chinese-macbert-base' under Joint Laboratory of HIT and iFLYTEK Research.
CPT (Shao et al., 2021) proposes a pre-trained model that takes into account both understanding and generation.Adopting a single-input multipleoutput structure, allows CPT to be used flexibly in separation or combination for different downstream tasks to fully utilize the model potential.We consider the base model.Model Card:'fnlp/cpt-base' under Fudan NLP.
BART-Chinese (Lewis et al., 2019;Shao et al., 2021) proposes a pre-training model that combines bidirectional and autoregressive approaches.BART first uses arbitrary noise to corrupt the original text and then learns the model to reconstruct the original text.In this way, BART not only handles the text generation task well but also performs well on the comprehension task.We consider the base model.Model Card:'fnlp/bart-base-chinese' under Fudan NLP.
T5-Chinese (Raffel et al., 2020;Zhao et al., 2019) leverages a unified text-to-text format that treats various NLP tasks as Text-to-Text tasks, i.e., tasks with Text as input and Text as output, which attains state-of-the-art results on a wide variety of NLP tasks.We consider the base model.Model Card:'uer/t5-base-chinese-cluecorpussmall' under UER.

C.2 The Statistics of Probe Dataset
We remove some rare characters for two reasons.Firstly, these characters are rarely encountered as misspellings in CSC task.Secondly, these characters appeared infrequently in the training corpus of the PLMs, which we believe would make it excessively challenging for the PLMs to learn effectively.The statistics are shown in Table 7 and Table 8 models are finally concentrated in the interval of 0.75-0.76.The Chinese pre-training models of the BERT family are slightly less effective when the number of layers is relatively small and similar to other Chinese pre-training models after more than three layers.D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Left blank.

Figure 1 :
Figure 1: An example of different errors affecting CSC results.red/green/blue represents the misspelled character, the expected correction and the unexpected correction.

Figure 2 :
Figure 2: Examples of the input and label in Experiment-I MLP Probe.We highlight the two characters in red/blue color.
Figure 4: Take BERT as an example.The first half shows examples of MLM and Homonym respectively.The bottom half shows the change in the probability distribution predicted by the model in this example.

Figure 5 :
Figure 5: Results of CCCR Probe.We observe CCCR and F1 values mismatch.and for the pre-trained CSC model, we observe a phenomenon we call stereotype, which maintains a high CCCR under the isolation setting while performing worse on the F1 score, implying that stereotyping during pre-training weakens the generalization of the model.

Figure 6 :
Figure 6: Results for each model in the case of 1-5 layers of MLP.

Table 1 :
Results of Probe for Models trained on the CSC task.We find that training on spell checking dataset does not enhance the graphical perception capability of models.

Table 2 :
Model performance in the isolation correction setting of SIGHAN15.'-Initial' means without any training.

Table 3 :
Model performance in the original version of SIGHAN15, which is finetuned.We found that the CCCR of the model fine-tuned on the CSC dataset is very high.We found that this is caused by overlapped pairs in the training data.

Table 4 :
The overlap of the correction pairs in the train and test sets and the statistics of the isolation SIGHAN set.

.
C.3 Probing Results from Models with Different Numbers of MLP LayersFrom the experimental results, it can be seen that the number of layers of MLP has little effect on the results, and most of the results of the pre-training #Pos.#Neg.#Total

Table 7 :
The statistics of the dataset for the glyph probe.

Table 8 :
The statistics of the dataset for the phonetic probe.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Left blank.C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Left blank.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Left blank.D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Left blank.