Global Attention Decoder for Chinese Spelling Error Correction

Recent progress has been made in using BERT framework for Chinese spelling error correction (CSC). However, most existing methods correct words based on local contextual information, without considering the inﬂuence of error words in sentences. Imposing attention on error contextual information could mislead and decrease the overall performance of CSC. To address this issue, we propose a G lobal A ttention D ecoder (GAD) approach for CSC. Speciﬁcally, the proposed method learns the global relationship of the potential correct input characters and the candidates of potential error characters. Rich global contextual information is obtained to alleviate the impact of the local error contextual information. In addition, a BERT with C onfusion set guided R eplacement S trategy (BERT CRS) is designed to narrow the gap between BERT and CSC. The candidates generated by BERT CRS covering the correct character with more than 99.9% probability. To demonstrate the effectiveness of our proposed framework, we test our method on three human-annotated datasets. The experimental results show that our approach outperforms all competitor models by a large margin of up to 6.2%, achieving state-of-the-art methods on all datasets.


Introduction
Spelling error correction plays an important role in NLP domain. A good spelling error system is the key to improve the performance of upper-layer applications. Spelling error correction aims to detect and correct erroneous characters/words. These spelling errors are mainly from human writing, speech recognition and optical character recognition (OCR) (Afli et al., 2016) systems. In Chinese, erroneous type is usually from character/word's phonological, visual and semantic similarity. According to (Cheng et al., 2020), about Input 餐厅的换经费产适合约会 The restaurant's swap property is suitable for dates BERT CRS 餐厅的环经非常适合约会 The restaurant's ring is perfect for dates +GAD 餐厅的环境非常适合约会 The restaurant environment is perfect for dates Table 1: A sample data from SIGHAN 2014 , the incorrect and correct characters marked in red and green color respectively. Since "经" is highly related to "费" in its context, BERT CRS is difficult to correct. GAD method learns the global relationship between "环" and "境" in candidates of input error characters "换" and "经" respectively (see Fig.1). Rich global contextual information is learned to alleviate the impact of the local noisy contextual information.
83% and 48% of errors are related to phonological and visual similarity respectively. Although lots of researches have made great progress, Chinese spelling error correction (CSC) still remains a challenging task. Moreover, because the Chinese is composed of pictographic characters without word delimiters, methods from the languages like English can hardly be used for the Chinese. In addition, the meaning of same character in different contexts may change greatly.
Many methods have been proposed for CSC task, which are mainly divided into two categories: 1) based on language models (Yeh et al., 2013;Yu and Li, 2014;Xie et al., 2015); 2) based on seq2seq model (Wang et al., 2019(Wang et al., , 2018. Specially, with the emerge of the pre-trained BERT model, many methods (Hong et al., 2019;Zhang et al., 2020;Cheng et al., 2020) are proposed and made great progress. Almost all methods leveraged a confusion set, which contains a set of similar character group in terms of phonological and visual. Specifically, (Yu and Li, 2014) proposed to generate candidates based on confusion set and find the best  Figure 1: The framework of our proposed global attention decoder method. To illustrate the effectiveness of our model, error words and detection probability are marked with red. For instance "换经费产" and corresponding error detection probability in bottom right. Attention weights are represented with color shade in right.
candidate with highest language model probability. (Cheng et al., 2020) introduced a convolutional graph network that captures similarity and prior dependencies among characters using confusion set. (Wang et al., 2019) proposed a pointer network to generate a character from the confusion set. Previous methods predict each character or word based on its local context that may has noisy information (other errors). So far, no method has been proposed to alleviate the impact of this noisy information.
In this paper, we firstly introduced a BERT with confusion set guided replacement strategy (BERT CRS), that narrows the gap between BERT and CSC task. Then, we proposed a novel global attention decoder (GAD) based on our BERT CRS model (see Fig.1), which learns rich global contextual representations to alleviate the influence of the error contextual information during correction. Specifically, in order to solve the impact of the local error contextual information, we introduce additional candidates of potential error characters and hidden state generated by BERT CRS. Next, global attention component learns the relationships of candidates to obtain the global hidden state and latent global attention weights of candidates. Then, weighted sum operator is adopted among candidates of each character to generate a rich global contextual hidden state. Finally a fully-connected layer to generate the correct characters. As shown in Table.1, Our proposed method is able to correct all spelling errors correctly. It is worthwhile to highlight the following aspects for the proposed approach: • To narrows the gap between BERT and CSC, we introduce a BERT with confusion set guided replacement strategy, that contains a decision network and a fully-connected layer to simulate the detection and correction subtasks of CSC respectively.
• We proposed a global attention decoder model, which learns the global relationships of the potential correct input characters and the candidates of potential error characters. Rich global contextual information is learned to effectively alleviate the influence of local error contextual information.
• Experiments on the three benchmark datasets demonstrate that our method outperforms the state-of-the-art methods by a large margin of up to 6.2%.

Related Work
There is a vast prior research on Chinese spelling error correction (CSC) task so far. Next, We will discuss the algorithms in different periods.
N-gram period. Early research in CSC follow the pipeline of error detection, candidate generation and candidate selection. Almost all proposed methods (Yeh et al., 2013;Yu and Li, 2014;Xie et al., 2015;Tseng et al., 2015) employed an unsupervised n-gram language model to detect errors. Next, a confusion set which is an external knowledge of the similarity between characters is introduced to confine the candidates. Finally, the best candidate with highest n-gram language model probability is considered as correction character. Specifically, (Yeh et al., 2013) proposed an inverted index based n-gram to map the potential spelling error character to the corresponding characters. (Xie et al., 2015) utilizes the confusion set to replace the characters and then evaluates the modified sentence via a joint bi-gram and tri-gram language model. In (Jia et al., 2013;Xin et al., 2014), a graph model is used to represent the sentence and a single source shortest path (SSSP) algorithm is performed on the graph to correct spell errors. The others viewed it as a sequential labeling problem and employed conditional random fields or hidden Markov models (Tseng et al., 2015;Wang et al., 2018).
Deep learning period. With the development of deep learning methods (Vaswani et al., 2017;Zhang et al., 2020;Hong et al., 2019;Wang et al., 2019;Song et al., 2017;Guo et al., 2016), great progress has been made in all NLP tasks. BERT (Devlin et al., 2018), XLNET (Yang et al., 2019), and Roberta , and ALBERT (Lan et al., 2019) achieve superior performance in almost all NLP task. Confusion set is still an important part in recent research for CSC task, but more upgrades have been made. Specifically, in (Hong et al., 2019), a pre-trained masked language model is employed as encoder. A confidence-similarity decoder utilizes similarity score to select candidates instead of the confusion set. (Vaswani et al., 2017) proposed a specialized graph convolutional network to incorporate phonological and visual similarity knowledge into BERT model. In (Zhang et al., 2020), a GRU based detection network is introduced and connected with BERT based correction network by a soft-masking technique. The others (Wang et al., 2019) employed a Seq2Seq model with copy mechanism, which generates a new sentence considering the extra candidates from confusion set.

The Proposed Approach
In this section, firstly, the problem formulation is elaborated. Then, we briefly describe how to narrow the gap between BERT (Devlin et al., 2018) and Chinese spelling error correction (CSC) using our BERT CRS model. Finally, we introduce our novel global attention decoder (GAD) framework.

Problem Formulation
CSC aims to detect and correct errors in Chinese text.
Given a sequence X = {x 1 , x 2 , · · · , x n }, n denotes the number of characters, our BERT CRS model encodes it into a continuous representation space V = {v 1 , v 2 , · · · , v n }, v i ∈ R d is the contextual level feature of the ith character, and it is d-dimensional. Here a decision network Φ d models V to fit a sequence Z = {z 1 , z 2 , · · · , z n }, where z i denotes the detection label of the i-th character, and z i =1 means the character is incorrect and z i =0 means it is correct. A fully-connected layer on the top of BERT CRS as correction network Φ c models V to fit a sequence Y = {y 1 , y 2 , · · · , y n }, where y i is the correction label of the i-th character. Instead of a simple fully-connected layer as a decoder, our GAD models the additional candidates c = {c 1 , c 2 , · · · , c n } to alleviate the impact of local error contextual information, where c represents the potential correct input characters and the candidates of potential error characters and: where k is the number of candidates. t is the threshold of error probability for characters.

BERT CRS approach for CSC
In this section, we take advantage of previous models (Devlin et al., 2018;Cui et al., 2020) and introduce a replacement strategy using confusion set that narrows the gap between BERT and CSC model. There we call this model as BERT CRS (BERT with Confusion set guided Replacement Strategy). Unlike BERT tasks, BERT CRS has several modifications.
• We drop NSP task and adopt a decision network for detecting error information, that is similar to detection sub-task of CSC.
• As MacBERT (Cui et al., 2020), instead of masking with [MASK] token, we introduce confusion set guided replacement strategy by replacing phonological and visual similar character for masking purpose. Rarely, when there is no confusion character, we will maintain [MASK] token. The strategy similar to correction sub-task of CSC • We use 23% of input characters for masking.
To keep the balance of detection targets (0 for un-replacement, 1 for replacement), we set 35%, 30%, 30%, 5% probability for unmasking, replacing with confusion character, masking with [MASK] token and replacing with random word respectively. Calculated, the replacing and masking probabilities are approximately the same as masking probabilities of BERT.
With model trained by confusion set guided replace strategy, the top-k candidate characters are almost from the confusion set. That prepares for our GAD model.
Learning. Similar to RoBERTa , confusion set guided replace strategy uses a dynamic approach during training. Error detection and correction is optimized simultaneously in the learning process.
where L d and L c is the objective of detection and correction loss respectively, L is the overall objective that linearly combines L d and L c , and λ ∈ [0, 1] denotes the coefficient of detection loss L d . Specially, λ = 0 represents that detection loss is not considered.

Global Attention Decoder
In this paper, we propose an global attention decoder (GAD) model to alleviate the impact of the local error contextual information. Our GAD is an extension of transformer layer (Vaswani et al., 2017), shown in Fig.2.
Self Attention. Relatively, the self-attention mechanism is part of the transformer layer, which takes the output of previous transformer layer or input embedding layer as input to obtain the hidden states with higher semantic representation, as shown in left part of Fig.1. Token representation VA l i at i-th position of l-th layer in self-attention method is defined as below: where a p i is the attention weight from i-th to pth token, and n p=1 a i = 1, V l−1 p is the p-th token representation of (l-1)-th layer, W V is a learnable projection matrix. This strategy could effectively encode rich token and sentence-level features. However, spelling error information also encoded into hidden states for CSC. Then, Imposing attention on error contextual information could mislead and decrease the overall performance of CSC.
Global Attention. Instead of using only local input information (see Eq.5), we consider potential correct inputs and the candidates of potential error characters to learn their latent relationships, that alleviate the influence caused by local error context. Specifically, as shown in Fig.2, we consider two input sources: • Contextual representation V, that contains rich semantic information • Top-k candidates c generate by Φ c correction network. To reduce the confusion of our GAD during learning, we only generate candidates for the potential error characters (see Eq.1).
To model the two different information, we first embed candidates into continuous representation using the word embedding E from BERT CRS. Then, dense and layer-norm layers are introduced to model V and E(c) into input state GI: Our global attention is introduced to learn the latent relationships between candidates c. Token representation GA i,j at j-th candidate of i-th token of global attention component is defined as below: where W V g is a learnable projection matrix and a p,q i,j is the attention weight from j-th candidate of i-th token to q-th candidate of p-th token, GI p,q denotes the input state of q-th candidate from pth token. Masking strategy is adopted between candidates from the same token.
and n p=1 k q=1 a p,q i,j = 1. Finally, the global attention state GA i at i-th position of of global attention component is defined as below: where β i,j is the global attention weight at j-th candidate of i-th token which quantifies the global relevance of feature GA i,j , p,q i,j and i,j denote the unnormalized relevant scores of a p,q i,j and β i,j respectively. Similar to standard transformer layer, feed forward and layer normalization to encode GA into final global continuous representation. Moreover, We adopt the multi-head technique used in the transformer layer in our global attention.
Learning. Given hidden states V and candidates c generated by our BERT CRS, our GAD model fit the correct sequence Y in the learning process.
where Φ g is our GAD network and L g denotes our overall objective of GAD

Experiments
In this section, we evaluate our algorithm on the task of Chinese spelling error correction (CSC).  We first present the training data, test data and the evaluation metrics. Secondly we introduce our main results compared with previous state-of-theart baselines. Then we conduct ablation studies to analyze the effectiveness of the proposed components. Finally, case study are explored.

Datasets
We consider three publicly available SIGHAN datasets from the 2013 , 2014 (Tseng et al., 2015 Chinese Spell Check Bake-offs. Following (Cheng et al., 2020), we adopted the standard split of training and test data of SIGHAN. We also follow the same data pre-processing, that converted the characters in dataset from traditional Chinese to simple Chinese using OpenCC 1 .
For training dataset, we also collect 3 million unlabeled corpus from news, wiki and encyclopedia QA domains to pre-train our BERT CRS model. Following (Wang et al., 2019), we also include additional 271K samples as the labeled training data, which are generated by an automatic method (Wang et al., 2018). The statistics of the data is showed in Table.2

Baselines
To evaluate the performance of our proposed algorithm, we compare it with following baseline methods.
• JBT (Xie et al., 2015): This method utilizes the confusion set to replace the characters and then evaluates the modified sentence via a Joint Bi-gram and Tri-gram LM.
• Seq2Seq (Wang et al., 2019): This method introduces a Seq2Seq model with a copy mechanism to consider the extra candidates from the confusion set.
• FASpell (Hong et al., 2019): This model changes the paradigm by utilizing the similarity metric to select candidate instead of a pre-defined confusion set.
• Soft-Masked BERT (Zhang et al., 2020): This method proposes a detection network, which connected error correction model by a softmasking technique.
• SpellGCN (Cheng et al., 2020): This model incorporate phonological and visual similarity knowledge into language models for CSC via a specialized graph convolutional network.
• BERT (Devlin et al., 2018): The word embedding on the top of BERT as correction decoder for the CSC task.

Implementation Details
Training Details. Our code is based on the repository of Transformers 2 . We first fine-tune 2 https://github.com/huggingface/transformers our BERT CRS model in 3 million unlabeled corpus based on the pre-trained whole word masking BERT 3 . The procedure runs 5 epochs with a batch size of 1024, learning rate of 5e-5 and max sequence length of 512. Then, we performed the fine-tuning process for our BERT CRS model in all labeled training data with 6 epochs, a batch size of 32 and a learning rate of 2e-5. Next we fix our BERT CRS model, and set the number of candidates k and error detection probability t as 4 and 0.25 respectively. Finally we fine-tune our GAD model with 3 epochs, a batch size of 32 and a learning rate of 5e-5. For SIGHAN 13 dataset, we performed additional fine-tune steps for 6 epochs as the data distribution in SIGHAN13 differs from other datasets, e.g. "的", "得" and "地 are rarely distinguished.
Evaluation Metrics. To evaluate the performance, we employ character and sentence-level accuracy, precision, recall and F1 followed by (Cheng et al., 2020), which are commonly used in the CSC task. In addition, we introduce the official evaluation metrics tool 4 , which gives False Positive Rate (FRT), precision, recall, F1 and accuracy in sentence level.

Main Results
We compare our model with the state-of-theart methods on the three test datasets, and the results are shown in Tab.3 and Tab.4, that compared the results in character-level and sentence level respectively. BERT CRS outperforms almost all methods in three datasets, and combined with GAD achieving the best performance. Specifically, under the same amount labeled training data, for character level metric, our method gains the improvement against previous best results (SpellGCN) are 0.4%, 2.1%, 0.7% respectively for correction level F1 metric. For sentence level score, our model outperform SpellGCN by a margin of 6.2%, 2.2%, 0.6% respectively for correction level F1 metric. In addition, Soft-Masked BERT uses 5 million examples that generate by replaced strategy for extra training data. our method outperforms it by a large margin in SIGHAN15 test dataset.
We further consider the official evaluation results of BERT CRS and GAD to compete with BERT and SpellGCN in SIGHAN15, shown in Tab.6. Our proposed BERT CRS+GAD achieving better performance than SpellGCN by a margin of 0.2% for correction level F1 metric. In addition, the FPR are 13.1% (BERT CRS+GAD) v.s. 13.2% (SpellGCN).

Ablation Studies
In this sub-experiment, we explore the impact of several components, including the coefficient λ and learning rate lr in BERT CRS and the effective parameter k that is the number of candidates in

GAD
The Effect of BERT CRS. Our BERT CRS introduces confusion set guided replacement strategy using BERT model. Compared with BERT model, for character level metric in Table.3, BERT CRS improves the performance by a margin 6.6%, 2.5%, 1.8% respectively for correction level F1. For sentence level metric in We also show the effect of coefficient λ and learning rate during fine-tuning in all labeled datas, shown in Tabel.5. First we fix learning rate as 2e-5 and tune λ ∈ [0.1, 0.5, 1] on SIGHAN15. When λ=0.1, the best performance is achieved. In addition, big variation is shown with different λ, That is to say, if more attention of detection loss, the performance is unsatisfactory. The reason of the situation may be caused by the imbalance of detection label during the fine-tuning process. In the following  experiments, we set λ=0.1. We tune learning rate from [2e-5, 5e-5]. When 2e-5 is adopted, the better performance is achieved. We set learning rate equal to 2e-5 during experiments. The Effect of GAD When combined with GAD in BERT CRS model, the performance is improved under character and sentence level metric, shown in Tabel.3 and Tabel.4. Specifically, for sentence level metric, BERT CRS+GAD outperform BERT CRS by a margin of 0.4%, 0.8% and 0.6% respectively for correction level F1 metric.
We also study the impact of candidate number k. Since k is the key parameter which determines the coverage of correct character in candidates, it affects the performance of our algorithm. We study the performance variance with different k ∈ [3, 4, 5] on SIGHAN15. Shown in Tabel.5, more candidates may degrade the performance. According to statistics, there are 161,365 error characters in all test data and 106, 75, 64 not in candidates for k equal to 3, 4, 5 respectively. The candidates generated by BERT CRS model have 99.9% probability covering the correct character. Consider the trade-off between cover rate of candidates and performance, We set k = 4 in our experiments.

Case Study
To further analyze our approach, we show some correction results on test data (see Table.7). In Table.7, three categories of spelling error are selected: 1) Continuous characters error; 2) Single character error; 3) No character error. For Continuous characters error instance, "介绍" (introduce) was misspelled as "借少" (borrow less). Due to the influence of error characters, BERT CRS is difficult to correct them all correctly. However, BERT CRS+GAD alleviates the impact of the local error contextual to correct them all correctly. For single character error instance, "抱" (pick up) was misspelled as "包" (pack). Our BERT CRS+GAD can also learns richer global contextual information to correct it than BERT CRS. Here it has the same meaning of "提议" (suggestion) and "建议"  Table 7: Examples of CSC results, the incorrect and correct characters marked in red and green respectively. The first line in the block is input sentence. The second and third line is corrected by BERT CRS and BERT CRS+GAD respectively. And the rest is the English translation of the correct sentence.
(suggestion) in no character instance, BERT CRS miscorrects it. These cases prove that our GAD can learn rich global contextual information to alleviate the impact of the local error contextual for CSC. We also showed some incorrect case to further analyze our model. For example, for the sentence "希望您帮我素取公平，得到他们适当的 赔偿"(I hope you can help me x for justice and get proper compensation from them) where the incorrect word 'x' is not comprehensible, our GAD changes "素取"(x) to "争取"(strive for) that is appropriate in the context, but ground-truth "诉 取"(sue for) is more suitable because the context contains the meaning of litigation. Our GAD model also lacks the inference ability of context strong correlation as described in (Zhang et al., 2020).

Conclusions
In this paper, we propose a novel global attention decoder (GAD). Condition on the potential correct input characters and the candidates of po-tential error characters, GAD reforms the self attention mechanism to learn their global relationships and obtains the rich global contextual information to alleviate the influence caused by error context. In addition, a BERT with Confusion set guided Replacement Strategy (BERT CRS) is designed to narrow the gap between BERT and spelling error correction. Experimental results on three datsets show that our BERT CRS outperform almost all previous state-of-the art methods, and higher performance is obtained by combining with our GAD.