An Alignment-Agnostic Model for Chinese Text Error Correction

This paper investigates how to correct Chinese text errors with types of mistaken, missing and redundant characters, which is common for Chinese native speakers. Most existing models based on detect-correct framework can correct mistaken characters errors, but they cannot deal with missing or redundant characters. The reason is that lengths of sentences before and after correction are not the same, leading to the inconsistence between model inputs and outputs. Although the Seq2Seq-based or sequence tagging methods provide solutions to the problem and achieved relatively good results on English context, but they do not perform well in Chinese context according to our experimental results. In our work, we propose a novel detect-correct framework which is alignment-agnostic, meaning that it can handle both text aligned and non-aligned occasions, and it can also serve as a cold start model when there are no annotated data provided. Experimental results on three datasets demonstrate that our method is effective and achieves the best performance among existing published models.


Introduction
Chinese text error correction plays an important role in many NLP related scenarios (Martins and Silva, 2004;Afli et al., 2016;Wang et al., 2018;Burstein and Chodorow, 1999). For native Chinese speakers, common errors include mistaken characters, missing characters, and redundant characters. Mistaken characters refer to wrong characters needed to be replaced. Missing characters mean a lack of characters needed to be inserted into the identified position. Redundant characters mean useless or repeated characters needed to be deleted. Corrections for mistaken characters will not change the sentence length while corrections for the other two types will do. If texts only contain mistaken errors, we call it a text-aligned situation; if there exist missing or redundant errors, we call it a text non-aligned situation.
For text-aligned situation, many approaches apply the detect-correct framework, which is to detect the positions of wrong characters first and then correct them (Hong et al., 2019;Cheng et al., 2020). Despite of competitive performance of such methods, they cannot deal with text non-aligned situation with missing and redundant errors. For text non-aligned situations, the reversed order error or complex structural change with multiple errors are not in our scope, first because we target to cover common mistakes made by Chinese native speakers, which are different from foreign Chinese learners in Chinese error correction(GEC) Qiu and Qu, 2019) task, second because the mentioned complex errors are beyond our model settings. The two mainstream model schemes for text non-aligned situation are Seq2Seq-based and sequence taggingbased. The former is inspired by machine translation, which sets wrong sentences as input and correct sentences as output (Zhao et al., 2019;Kaneko et al., 2020;Chollampatt et al., 2019;Zhao and Wang, 2020;Lichtarge et al., 2019;Ge et al., 2018;Junczys-Dowmunt et al., 2018). Such approaches require a large number of training data and may generate uncontrollable results (Kiyono et al., 2019;Koehn and Knowles, 2017). The latter takes wrong sentences as input and modification operations of each token as output (Awasthi et al., 2019;Malmi et al., 2019;Omelianchuk et al., 2020). However, as Chinese language has more than 20,000 characters that can generate many combinations of token operations, it is difficult for sequence tagging models to cover all combinations and generate results with high coverage rates.
To address the above issues, we propose an alignment-agnostic detect-correct model, which can not only handle text non-aligned errors compared to the current detect-correct methods, but also can relieve the problem of huge value search space leading to uncontrollable or low coveraged results of Seq2Seq or Sequence tagging based methods. We conduct experiments to compare our alignment-agnostic model with other models on three datasets: CGED 2020, SIGHAN 2015, SIGHAN-synthesize. Experimental results show that our model performs better than other models.
The contributions of our work include (1) proposal of a novel detect-correct architecture for Chinese text error correction, (2) empirical verification of the effectiveness of the alignment-agnostic model, (3) easy reproduction and fast adaptation to practical scenario with limited annotation data.

Problem Description
Chinese text error correction can be formalized as follows. Given a sequence of n characters X = (x 1 , x 2 , x 3 , ..., x n ) , the goal is to transform it into an m-character sequence Y = (y 1 , y 2 , y 3 , ..., y m ), where n and m can be equal or not. The task can be viewed as a sequence transformation problem with a mapping function f : X → Y Figure 1: Architecture of the alignment-agnostic model

Model
As illustrated in Figure 1, the basic structure of our model includes a detection network evolved from ELECTRA discriminator (Clark et al., 2020) and a correction network based on BERT MLM (Devlin et al., 2019). The two networks are connected through a modification logic unit and are trained separately. The detection network locates the errors and identifies error types. The modification logic unit handles where and how to correct. Finally the correction network focuses on detailed correction.
The detection network is composed of an ELECTRA discriminator and a token-level error type classifier. The architecture of ELEC-TRA discriminator has been described in Clark et al. (2020). Here we modify the original classifier, and define the new token-level classifier with four categories, namely label keep , label mistaken , label missing , label redundant . label keep means the character is correct and should not change. label mistaken indicates the character is mistaken and needs to be replaced. label missing denotes we should insert characters before the current character. label redundant means the character is useless and needs to be deleted. We get the label probability of each token with the 4-class token-level classifier: Where P i,label (z i = k|X) denotes the conditional probability of character x i being tagged with the label k, h D (X) is the last hidden state of ELEC-TRA discriminator and k is in label sets [label keep , label mistaken , label missing , label redundant ]. The loss function of the detection network is: The modification logic unit, denoted by M (X, Z), rewrites the input sequence X according to detection network's output Z: Based on the above formula, we get a new sequence X = (x 1 , x 2 , x 3 , ..., x n , ) . For each token with empty characters , we delete it directly from the sequence X , For each token with [M ASK] x i , we reformulate it as two characters and obtain the final modified sequence Y = (y 1 , y 2 , y 3 , ..., y m , ), whose length might be different from X .
The correction network is BERT. We do the prediction for positions with the [M ASK] symbol on the sequence Y .

Datasets and Metrics
Chinese text error correction tasks mainly have two public datasets: the benchmark of SIGHAN 2015 (Tseng et al., 2015) which only contains text-aligned data and the competition of CGED 2020 (Rao et al., 2020) which contains text nonaligned data. In order to better verify our models' effectiveness on text non-aligned scenario, we synthesized some non-aligned data based on SIGHAN 2015 dataset. Next, we will introduce how to utilize the three datasets. For SIGHAN 2015 dataset, in order to keep accordance with other models in comparison, we incorporated SIGHAN 2013 and 2014 datasets in the training phase, as well as the SIGHAN 2013 confusion set. The test set contains 1100 passages and the train set contains 8738 passages. To ensure comparability, we also trained another model on a considerably larger train set to be consistent with SpellGCN' (Cheng et al., 2020), which has 281379 passages in train set. We used the evaluation tool provided by SIGHAN, with metrics of precision (Prec.), recall(Rec.) and F 1 , all are based on sentence level. CGED 2020 dataset is comprised of foreign Chinese learners' writing, and contains an additional error type besides the three types mentioned above, which is the reversed order. As this type happens less frequently in native Chinese writing scenario, and is also beyond the scope of our model setting, we remove 575 relevant samples from a total sample of 2586, and get 846 training samples and 1165 testing samples. In consequence, we redo experiments with published models instead of comparing directly with the published benchmarks of other systems due to the inconsistency of test set.
To better verify our model's effectiveness on text non-aligned scenario, we synthesized some non-aligned data based on SIGHAN 2015 dataset (SIGHAN-synthesized). For mistaken characters error type, we kept the original errors unchanged. For missing characters error type, we randomly selected 50% samples and deleted one character from each of them. For redundant characters error type, we randomly selected 50% samples and inserted characters in each of them through four ways.
(1) We inserted repeated characters in 35% of the selected samples.
(2) We inserted confusing characters in 30% of the selected samples. (3) We inserted characters from high-frequency words in 30% of the selected samples. (4) We also inserted random characters in 5% of the selected samples. For CGED 2020 dataset and SIGHANsynthesized dataset, we adopted the M 2 score (Dahlmeier and Ng, 2012) and ER-RANT (Bryant et al., 2017) to evaluate models' performance, which are two commonly used evaluation tools for text non-aligned situations.

Experiment Settings
The pre-trained ELECTRA discriminator model and BERT model adopted in our experiments are all from https://github.com/ huggingface/transformers. We use the large-size ELECTRA and the base-size BERT. We train detection network and correction network on the three datasets respectively by Adam optimizer with default hyperparameters. All experiments are conducted on 2 GPUs (Nvidia Tesla P100).
For SIGHAN 2015, since it only contains one error type, we kept the default binary classifier of ELECTRA discriminator during finetuning. We applied two methods to retrain BERT. One is an unsupervised method by continue pretraining BERT with its original MLM objective. The other is a supervised method by masking mistaken characters   and predicting them. For CGED 2020 and SIGHAN-synthesized datasets, we added a 4-class classifier to recognize error types on ELECTRA discriminator's last hidden layer and finetune it. We applied the same methods as in SIGHAN 2015 to retrain BERT. Table 1 shows the results on SIGHAN 2015 dataset. The first 5 lines implies that our method outperforms the method Soft-Masked BERT  by 1.8% on F 1 score in correction phrase. With a larger train set, our model achieved higher F 1 score in both detection and correction phases. Compared with the previous SOTA method Spell-GCN (Cheng et al., 2020), our model showed higher precision and comparable F 1 score. Table 2 shows the results in comparison on CGED 2020 dataset and SIGHAN-synthesized dataset. Our model performs the best on correction level, exceeding the second best model by 9.87% on CGED 2020 and 7.2% on SIGHANsynthesized dataset with F 0.5 M 2 score. Since Copy-augmented (Zhao et al., 2019), as a Seq2Seq model, requires a large size of training data to get an acceptable result, it underperforms Lasertagger (Malmi et al., 2019) and PIE (Awasthi et al., 2019) models on both two datasets with a small training sample size. As analyzed before, sequence tagging models like Lasertagger and PIE do not work well on Chinese language due to huge value search space.

Ablation Study
We carried out ablation study of our model on the three datasets. Table 3 shows the results on correction level. For SIGHAN 2015, finetuning ELEC-TRA can bing in a great improvement of 27.5% on F 1 score, while finetuning BERT only generates a relatively small rise of 2% on F 1 score and continue pretraining BERT leads to a decrease of 24.2% on F 1 score. A possible reason is that finetuning can incorporate confusion sets knowledge about similar characters easy to be mistaken, while unsupervised pretraining may destroy the original learned words distribution when training data largely differs from the original ones. Besides, our model achieves 38.7% on F 1 score with no training data and thus can work as a good baseline in cold start conditions. For CGED 2020 and SIGHAN-synthesized datasets, the two ways of retraining BERT didn't improve much. Compared with the results of other SOTA models, the modification and finetuning of ELECTRA is the most effective part.

Conclusion
We proposed a new detect-correct model for Chinese text error correction. It can handle both textaligned and non-aligned situations, and can serve as a good baseline even in cold start situations. Experimental results on three datasets show that our model performs better than existing methods. Furthermore, it can be easily reproduced and achieve good results even with a small training data size, which is key to rapid application in the industry.