Exploration and Exploitation: Two Ways to Improve Chinese Spelling Correction Models

A sequence-to-sequence learning with neural networks has empirically proven to be an effective framework for Chinese Spelling Correction (CSC), which takes a sentence with some spelling errors as input and outputs the corrected one. However, CSC models may fail to correct spelling errors covered by the confusion sets, and also will encounter unseen ones. We propose a method, which continually identifies the weak spots of a model to generate more valuable training instances, and apply a task-specific pre-training strategy to enhance the model. The generated adversarial examples are gradually added to the training set. Experimental results show that such an adversarial training method combined with the pre-training strategy can improve both the generalization and robustness of multiple CSC models across three different datasets, achieving state-of-the-art performance for CSC task.


Introduction
Chinese Spelling Correction (CSC) aims to detect and correct spelling mistakes in Chinese texts. Many Chinese characters are visually or phonologically similar, while their semantic meaning may differ greatly. Spelling errors are usually caused by careless writing, automatic speech recognition, and optical character recognition systems. The CSC task has received steady attention over the past two decades (Chang, 1995;Xin et al., 2014;Wang et al., 2018;Hong et al., 2019). Unlike English, Chinese texts are written without using whitespace to delimit words, and it is hard to identify whether and which characters are misspelled without the information of word boundaries. The context information should be taken into account to reconstruct † These authors contributed equally to this work. the word boundaries when correcting spelling mistakes, which makes CSC a long-standing challenge for Chinese NLP community.
Many early CSC systems follow the same recipe with minor variations, adopting a three-step strategy: detect the positions of spelling errors; generate candidate characters for these positions; and select a most appropriate one from the candidates to replace the misspelling (Yeh et al., 2013;Yu and Li, 2014;Zhang et al., 2015;Wang et al., 2019). Recently, a sequence-to-sequence (seq2seq) learning framework with neural networks has empirically proven to be effective for CSC, which transforms a sentence with errors to the corrected one (Zhang et al., 2020;Cheng et al., 2020b).
However, even if training a CSC model with the seq2seq framework normally requires a huge amount of high-quality training data, it is still unreasonable to assume that all possible spelling errors have been covered by the confusion sets (i.e. a set of characters and their visually or phonologically similar characters which can be potentially confused) extracted from the training samples. New spelling errors occur everyday. A good CSC model should be able to exploit what it has already seen in the training instances in order to achieve reasonable performance on easy spelling mistakes, but it can also explore in order to generalize well to possible unseen misspellings.
In this study, we would like to pursue both the exploration (unknown misspellings) and exploitation (the spelling errors covered by the confusion sets) when training the CSC models. To encourage a model to explore unknown cases, we propose a character substitution-based method to pre-train the model. The training data generator chooses about 25% of the character positions at random for prediction. If a character is chosen, we replace it with the character randomly selected from its confusion set (90% of the time) or a random character (10% of the time). Then, the model is asked to predict the original character.
Because of the combination of spelling errors and various contexts in which they occur, even though the confusion sets are given and fixed, models may still fail to correct characters that are replaced by any character from its confusion set. To better exploit what the models has experienced during the training phase, we generate more valuable training data via adversarial attack (i.e. tricking models to make false prediction by adding imperceptible perturbation to the input (Szegedy et al., 2014)), targeting the weak spots of the models, which can improve both the quality of training data for fine-tuning the CSC models and their robustness against adversarial attacks. Inspired by adversarial attack and defense in NLP (Jia and Liang, 2017;Zhao et al., 2018;Cheng et al., 2020a;Wang and Zheng, 2020), we propose a simple but efficient method for adversarial example generation: we first identify the most vulnerable characters with the lowest generation probabilities estimated by a pre-trained model, and replace them with characters from their confusion sets to create the adversarial examples.
Once the adversarial examples are obtained, they can be merged with the original clean data to train the CSC models. The examples generated by our method are more valuable than those already existed in the training set because they are generated towards to the weak spots of the current models. Through extensive experimentation, we show that such adversarial examples can improve both generalization and robustness of CSC models. If a model pre-trained with our proposed character substitution-based method is further fine-tuned by adversarial training, its robustness can be improved about 3.9% while without suffering too much loss (less than 1.1%) on the clean data.

Problem Definition
Chinese Spelling Correction aims to identify incorrectly used characters in Chinese texts and giving its correct version. Given an input Chinese sentence X = {x 1 , ..., x n } consisting of n characters, which may contain some spelling errors, the model takes X as input and outputs an output sentence Y = {y 1 , ..., y n }, where all the incorrect characters are expected to be corrected. This task can be formulated as a conditional generation problem by modeling and maximizing the conditional probability of P (Y |X).

Base Models
We use vanilla BERT (Devlin et al., 2019) and two recently proposed BERT-based models (Cheng et al., 2020b;Zhang et al., 2020) as our base models. When applying BERT to the CSC task, the input is a sentence with spelling errors, and the output representations are fed into an output layer to predict target tokens. We tie the input and output embedding layer, and all the parameters are fine-tuned using task-specific corpora. Soft-Masked BERT (Zhang et al., 2020) uses a Bi-GRU network to detect errors, and applies a BERT-based network to correct errors. SpellGCN (Cheng et al., 2020b) utilizes visual and phonological similarity knowledge through a specialized graph convolutional network and substitutes parameters of the output layer of BERT with the final output of it.
These models achieved state-of-the-art or close to state-of-the-art performance on the CSC task. However, we found that their performance and robustness could be further improved through pretraining and adversarial training, which help models explore unseen spelling errors and exploit weak points of themselves.

Pre-training Method
We collected unlabeled sentences from Wikipedia and Weibo corpora (Shang et al., 2015), covering both formal and informal Chinese texts. Training example pairs are generated by substituting characters in clean sentences, and models are trained to predict the original character. According to Chen et al. (2011), a sentence contains no more than two spelling errors on average, so we select and replace 25% characters in a sentence. The chosen Chinese character will be substituted by a character randomly selected from its confusion set (90% of the time) or a random Chinese character (10% of the time). The latter helps models to explore unknown misspellings not covered by the confusion sets.

Adversarial Example Generation and Adversarial Training
To efficiently identify and alleviate the weak spots of trained CSC models, we designed an adversarial attack algorithm for CSC tasks, which replaces the tokens in a sentence with spelling mistakes.
The adversarial examples generation algorithm in this paper can be divided into two main steps: (1) determine the vulnerable tokens to change (2) replace them with the spelling mistakes that most likely to occur in the contexts (Algorithm 1).
For the i-th position of input sentence X, the positional score s i can be obtained by the logit output o i as follows: where o r i denotes the logit output of character r in the i-th position, and y i denotes the i-th character of ground truth sentence Y . The lower the positional score, the less confident the model is in predicting the position. Attacking this position makes the model output more likely to change. Once the positional score of each character in the input sentence is calculated, we sort these positions in ascending order according to the positional scores. This process can reduce the substitutions and maintain the original semantics as much as possible.
Once a vulnerable position is determined, the token at that position is replaced with one of its phonologically or visually similar characters. Confusion set D contains a set of visually or phonologically similar characters. In order to fool the target CSC model while maintaining the context, the character with the highest logit output in the confusion set is used as a replacement.
Given a sentence in training sets, its adversarial examples are generated by substituting a few characters based on the algorithm mentioned above. Adversarial training was conducted with these examples, improving the robustness of CSC models by alleviating their weak spots, and exploiting knowledge about easy spelling mistakes from confusion sets to help models generalize better.

Datasets
Statistics of the datasets used are shown in Table 1.
Pretraining data We generated a large corpus by a character substitution-based method. Models were first pre-trained on these nine million sentence pairs, and then fine-tuned using the training data mentioned below.

Algorithm 1 Adversarial Attack Algorithm
Input: X = {x1, x2, . . . , xn}, input Chinese sentence; Y = {y1, y2, . . . , yn}, the corresponding ground truth; λ, proportion of characters can be changed; f , a target CSC model; D, a confusion set created based on visually or phonologically similar characters; Output: X = {x1,x2, . . . ,xn}, adversarial example; 1: X ← X 2: if f (X) = Y then 3: return X 4: else 5: .., p k } ← Sort the position pi in ascending order based on sp i ( 1 ≤ pi ≤ n and yp i is a Chinese character ) 9: for each i ∈ [1, k] do 10: ifxp i = yp i then 11: continue 12:  Test data Models' performance in detection and correction stage was evaluated in sentence level on three benchmark datasets, in the metrics of F1 scores (detection and correction). Characters in these datasets were transferred into simplified Chinese characters using OpenCC 2 . We revised the processed datasets for one simplified Chinese character may correspond to multiple traditional Chinese characters. Table 2: Performance of three models trained with the proposed pretraining strategy and adversarial training method. "CLEAN" stands for the testing results on the clean data, and "ATTACK" denotes the F1 scores under test-time attacks. "DET" and "COR" denote the F1 scores of detection and correction. The F1 scores were increased 4.1% on average by our pre-training method across the various models on the different datasets. Models' robustness was also improved about 3.9% while without suffering too much loss (less than 1.1%) on the clean data.

Models and Hyper-parameter Settings
For BERT and Soft-Masked BERT, we used the BERT model pre-trained on Chinese text provided by transformers 3 and fine-tuned it. Adam optimizer was used and the learning rate was 2e-5, except when adversarial training on SIGHAN 13 dataset, which was 1e-5. We followed Zhang et al. (2020) to set our hyper-parameters. The size of the hidden state in Bi-GRU in Soft-Masked BERT was 256. Similarly, we followed the hyper-parameters settings of SpellGCN (Cheng et al., 2020b) except the batch size. Batch size was reduced to eight due to GPU memory. The BERT model used in SpellGCN was provided by the repository of BERT 4 .
We conducted adversarial training on base models gained through pre-training and fine-tuning. The threshold λ was tuned on the validation set for each dataset. The number of sentence pairs directly used for training was twice that that used to generate adversarial examples.

Results and Analysis
As shown in Table 2, through pre-training particularly designed for CSC, the models achieve better results on three benchmark datasets. The average improvement of correction F1 score was 4.3% over base CSC models, which proves that our pre-training method has significant contribution to improving the model. Notably, BERT achieves state-of-the-art results on three datasets through our method.  Figure 1 shows the trade-off between generalization and robustness during adversarial training. As the threshold increases, the robustness of BERT also increases with a slight performance decrease on clean dataset (less than 0.7%).
The experiments of the models under adversarial attacks were conducted with the base, pre-trained and adversarially trained models (λ = 0.02). We found that CSC models are vulnerable to adversarial examples as expected. The average drop in F1 score of three base models was 51.6%. Under the attacks, the F1 scores of adversarially trained model decreased less (44.1%), which indicates the adversarial training can substantially improve the robustness of CSC models. Compared with other models, BERT is more robust against adversarial attack (-41.2%). The reason for the more serious robustness issues of other models may be related to the modules added to BERT, which increases the number of parameters, therefore it is more likely to overfit on the CSC data set.

Conclusion
In this paper, we have described a character substitution-based method to create large pseudo data to pre-train the models by encouraging them to explore unseen misspellings. We also proposed a data augmentation method for training the CSC models by continually adding the adversarial examples, particularly generated to alleviate the weak spot of the current model, to the training set. By the proposed pre-training strategy and adversarial training method, we can pursue both the exploration and exploitation when training the CSC models. Experimental results demonstrate that the CSC models trained with the data augmented by these pseudo data and adversarial examples can substantially be improved in both generalization and robustness.