Tail-to-Tail Non-Autoregressive Sequence Prediction for Chinese Grammatical Error Correction

We investigate the problem of Chinese Grammatical Error Correction (CGEC) and present a new framework named Tail-to-Tail (TtT) non-autoregressive sequence prediction to address the deep issues hidden in CGEC. Considering that most tokens are correct and can be conveyed directly from source to target, and the error positions can be estimated and corrected based on the bidirectional context information, thus we employ a BERT-initialized Transformer Encoder as the backbone model to conduct information modeling and conveying. Considering that only relying on the same position substitution cannot handle the variable-length correction cases, various operations such substitution, deletion, insertion, and local paraphrasing are required jointly. Therefore, a Conditional Random Fields (CRF) layer is stacked on the up tail to conduct non-autoregressive sequence prediction by modeling the token dependencies. Since most tokens are correct and easily to be predicted/conveyed to the target, then the models may suffer from a severe class imbalance issue. To alleviate this problem, focal loss penalty strategies are integrated into the loss functions. Moreover, besides the typical fix-length error correction datasets, we also construct a variable-length corpus to conduct experiments. Experimental results on standard datasets, especially on the variable-length datasets, demonstrate the effectiveness of TtT in terms of sentence-level Accuracy, Precision, Recall, and F1-Measure on tasks of error Detection and Correction.


Introduction
Grammatical Error Correction (GEC) aims to automatically detect and correct the grammatical errors that can be found in a sentence (Wang et al., 2020c). It is a crucial and essential application task  Figure 1: Illustration for the three types of operations to correct the grammatical errors: Type I-substitution; Type II-deletion and insertion; Type III-local paraphrasing.
We investigate the problem of CGEC and the related corpora from SIGHAN (Tseng et al., 2015) and NLPCC (Zhao et al., 2018) carefully, and we conclude that the grammatical error types as well as the corresponding correction operations can be categorised into three folds, as shown in Figure 1: (1) Substitution. In reality, Pinyin is the most popular input method used for Chinese writings. Thus, the homophonous character confusion (For example, in the case of Type I, the pronunciation of the wrong and correct words are both "FeiChang") is the fundamental reason which causes grammatical errors (or spelling errors) and can be corrected by substitution operations without changing the whole sequence structure (e.g., length). Thus, substitution is a fixed-length (FixLen) operation.
(2) Deletion I feel fly long happy today! I am always happy when I come to Fei today! I very feel happy today! I feel very happy today! Type I Type II Type III Figure 2: Illustration of the token information flows from the bottom tail to the up tail.
and Insertion. These two operations are used to handle the cases of word redundancies and omissions respectively.
(3) Local paraphrasing. Sometimes, light operations such as substitution, deletion, and insertion cannot correct the errors directly, therefore, a slightly subsequence paraphrasing is required to reorder partial words of the sentence, the case is shown in Type III of Figure 1. Deletion, insertion, and local paraphrasing can be regarded as variable-length (VarLen) operations because they may change the sentence length.
However, over the past few years, although a number of methods have been developed to deal with the problem of CGEC, some crucial and essential aspects are still uncovered. Generally, sequence translation and sequence tagging are the two most typical technical paradigms to tackle the problem of CGEC. Benefiting from the development of neural machine translation (Bahdanau et al., 2015;Vaswani et al., 2017), attention-based seq2seq encoder-decoder frameworks have been introduced to address the CGEC problem in a sequence translation manner (Wang et al., 2018;Ge et al., 2018;Wang et al., , 2020bKaneko et al., 2020). Seq2seq based translation models are easily to be trained and can handle all types of correcting operations above mentioned. However, considering the exposure bias issue (Ranzato et al., 2016;Zhang et al., 2019), the generated results usually suffer from the phenomenon of hallucination (Nie et al., 2019;Maynez et al., 2020) and cannot be faithful to the source text, even though copy mechanisms (Gu et al., 2016) are incorporated . Therefore, Omelianchuk et al. (2020) and Liang et al. (2020) propose to purely employ tagging to conduct the problem of GEC instead of generation. All correcting operations such as deletion, insertion, and substitution can be guided by the predicted tags. Nevertheless, the pure tagging strategy requires to extend the vocabulary V to about three times by adding "insertion-" and "substitution-" prefixes to the original tokens (e.g., "insertion-good", "substitutionpaper") which decrease the computing efficiency dramatically. Moreover, the pure tagging framework needs to conduct multi-pass prediction until no more operations are predicted, which is inefficient and less elegant. Recently, many researchers fine-tune the pre-trained language models such as BERT on the task of CGEC and obtain reasonable results (Zhao et al., 2019;Zhang et al., 2020b). However, limited by the BERT framework, most of them can only address the fixed-length correcting scenarios and cannot conduct deletion, insertion, and local paraphrasing operations flexibly.
Moreover, during the investigations, we also observe an obvious but crucial phenomenon for CGEC that most words in a sentence are correct and need not to be changed. This phenomenon is depicted in Figure 2, where the operation flow is from the bottom tail to the up tail. Grey dash lines represent the "Keep" operations and the red solid lines indicate those three types of correcting operations mentioned above. On one side, intuitively, the target CGEC model should have the ability of directly moving the correct tokens from bottom tail to up tail, then Transformer (Vaswani et al., 2017) based encoder (say BERT) seems to be a preference. On the other side, considering that almost all typical CGEC models are built based on the paradigms of sequence tagging or sequence translation, Maximum Likelihood Estimation (MLE) (Myung, 2003) is usually used as the parameter learning approach, which in the scenario of CGEC, will suffer from a severe class/tag imbalance issue. However, no previous works investigate this problem thoroughly on the task of CGEC.
To conquer all above-mentioned challenges, we propose a new framework named tail-to-tail non-  autoregressive sequence prediction, which abbreviated as TtT, for the problem of CGEC. Specifically, to directly move the token information from the bottom tail to the up tail, a BERT based sequence encoder is introduced to conduct bidirectional representation learning. In order to conduct substitution, deletion, insertion, and local paraphrasing simultaneously, inspired by (Sun et al., 2019), a Conditional Random Fields (CRF) (Lafferty et al., 2001) layer is stacked on the up tail to conduct nonautoregressive sequence prediction by modeling the dependencies among neighbour tokens. Focal loss penalty strategy (Lin et al., 2020) is adopted to alleviate the class imbalance problem considering that most of the tokens in a sentence are not changed. In summary, our contributions are as follows: • A new framework named tail-to-tail nonautoregressive sequence prediction (TtT) is proposed to tackle the problem of CGEC. • BERT encoder with a CRF layer is employed as the backbone, which can conduct substitution, deletion, insertion, and local paraphrasing simultaneously. • Focal loss penalty strategy is adopted to alleviate the class imbalance problem considering that most of the tokens in a sentence are not changed. • Extensive experiments on several benchmark datasets, especially on the variable-length grammatical correction datasets, demonstrate the effectiveness of the proposed approach.
2 The Proposed TtT Framework 2.1 Overview Figure 3 depicts the basic components of our proposed framework TtT. Input is an incorrect sen-tence X = (x 1 , x 2 , . . . , x T ) which contains grammatical errors, where x i denotes each token (Chinese character) in the sentence, and T is the length of X. The objective of the task grammatical error correction is to correct all errors in X and generate a new sentence Y = (y 1 , y 2 , . . . , y T ). Here, it is important to emphasize that T is not necessary equal to T . Therefore, T can be =, >, or < T . Bidirectional semantic modeling and bottom-to-up directly token information conveying are conducted by several Transformer (Vaswani et al., 2017) layers. A Conditional Random Fields (CRF) (Lafferty et al., 2001) layer is stacked on the up tail to conduct the non-autoregressive sequence generation by modeling the dependencies among neighboring tokens. Low-rank decomposition and beamed Viterbi algorithm are introduced to accelerate the computations. Focal loss penalty strategy (Lin et al., 2020) is adopted to alleviate the class imbalance problem during the training stage.

Variable-Length Input
Since the length T of the target sentence Y is not necessary equal to the length T of the input sequence X. Then in the training and inference stage, different length will affect the completeness of the predicted sentence, especially when T < T . To handle this issue, several simple tricks are designed to pre-process the samples. Assuming X = (x 1 , x 2 , x 3 , <eos>): (1) When T = T , i.e., Y = (y 1 , y 2 , y 3 , <eos>), then do nothing; (2) When T > T , say Y = (y 1 , y 2 , <eos>), which means that some tokens in X will be deleted during correcting. Then in the training stage, we can pad T − T special tokens <pad> to the tail of Y to make T = T , then Y = (y 1 , y 2 , <eos>, <pad>); (3) When T < T , say Y = (y 1 , y 2 , y 3 , y 4 , y 5 , <eos>), which means that more information should be inserted into the original sentence X. Then, we will pad the special symbol <mask> to the tail of X to indicate that these positions possibly can be translated into some new real tokens:

Bidirectional Semantic Modeling
Transformer layers (Vaswani et al., 2017) are particularly well suited to be employed to conduct the bidirectional semantic modeling and bottom-to-up information conveying. As shown in Figure 3, after preparing the input samples, an embedding layer and a stack of Transformer layers initialized with a pre-trained Chinese BERT (Devlin et al., 2019) are followed to conduct the semantic modeling. Specifically, for the input, we first obtain the representations by summing the word embeddings with the positional embeddings: where 0 is the layer index and t is the state index. E w and E p are the embedding vectors for tokens and positions, respectively. Then the obtained embedding vectors H 0 are fed into several Transformer layers. Multi-head self-attention is used to conduct bidirectional representation learning: where SLF-ATT(·), LN(·), and FFN(·) represent self-attention mechanism, layer normalization, and feed-forward network respectively (Vaswani et al., 2017). Note that our model is a non-autoregressive sequence prediction framework, thus we use all the sequence states K 0 and V 0 as the attention context. Then each node will absorb the context information bidirectionally. After L Transformer layers, we obtain the final output representation vectors H L ∈ R max(T,T )×d .

Non-Autoregressive Sequence Prediction
Direct Prediction The objective of our model is to translate the input sentence X which contains grammatical errors into a correct sentence Y . Then, since we have obtained the sequence representation vectors H L , we can directly add a softmax layer to predict the results, just similar to the methods used in non-autoregressive neural machine translation (Gu and Kong, 2020) and BERT-based finetuning framework for the task of grammatical error correction (Zhao et al., 2019;Zhang et al., 2020b). Specifically, a linear transformation layer is plugged in and softmax operation is utilized to generate a probability distribution P dp (y t ) over the target vocabulary V: Then we obtain the result for each state based on the predicted distribution: However, although this direct prediction method is effective on the fixed-length grammatical error correction problem, it can only conduct the samepositional substitution operation. For complex correcting cases which require deletion, insertion, and local paraphrasing, the performance is unacceptable. This inferior performance phenomenon is also discussed in the tasks of non-autoregressive neural machine translation (Gu and Kong, 2020). One of the essential reasons causing the inferior performance is that the dependency information among the neighbour tokens are missed. Therefore, dependency modeling should be called back to improve the performance of generation. Naturally, linear-chain CRF (Lafferty et al., 2001) is introduced to fix this issue, and luckily, Sun et al. (2019) also employ CRF to address the problem of non-autoregressive sequence generation, which inspired us a lot.
Dependency Modeling via CRF Then given the input sequence X, under the CRF framework, the likelihood of the target sequence Y with length T is constructed as: where Z(X) is the normalizing factor and s(y t ) represents the label score of y at position t, which can be obtained from the predicted logit vector s t ∈ R |V| from Eq. (3), i.e., s t (V yt ), where V yt is the vocabulary index of token y t . The value t(y t−1 , y t ) = M y t−1 ,yt denotes the transition score from token y t−1 to y t where M ∈ R |V|×|V| is the transition matrix, which is the core term to conduct dependency modeling. Usually, M can be learnt as neural network parameters during the end-to-end training procedure. However, |V| is typically very large especially in the text generation scenarios (more than 32k), therefore it is infeasible to obtain M and Z(X) efficiently in practice. To overcome this obstacle, as the method used in (Sun et al., 2019), we introduce two low-rank neural parameter metrics E 1 , E 2 ∈ R |V|×dm to approximate the fullrank transition matrix M by: where d m |V|. To compute the normalizing factor Z(X), the original Viterbi algorithm (Forney, 1973;Lafferty et al., 2001) need to search all paths. To improve the efficiency, here we only visit the truncated top-k nodes at each time step approximately (Sun et al., 2019).

Training with Focal Penalty
Considering the characteristic of the directly bottom-to-up information conveying of the task CGEC, therefore, both tasks, direct prediction and CRF-based dependency modeling, can be incorporated jointly into a unified framework during the training stage. The reasons are that, intuitively, direct prediction will focus on the fine-grained predictions at each position, while CRF-layer will pay more attention to the high-level quality of the whole global sequence. We employ Maximum Likelihood Estimation (MLE) to conduct parameter learning and treat negative log-likelihood (NLL) as the loss function. Thus, the optimization objective for direct prediction L dp is: log P dp (y t |X) And the loss function L crf for CRF-based dependency modeling is: Then the final optimization objective is: As mentioned in Section 1, one obvious but crucial phenomenon for CGEC is that most words in a sentence are correct and need not to be changed. Considering that maximum likelihood estimation is used as the parameter learning approach in those two tasks, then a simple copy strategy can lead to a sharp decline in terms of loss functions. Then, intuitively, the grammatical error tokens which need to be correctly fixed in practice, unfortunately, attract less attention during the training procedure. Actually, these tokens, instead, should be regarded as the focal points and contribute more to the optimization objectives. However, no previous works investigate this problem thoroughly on the task of CGEC.
To alleviate this issue, we introduce a useful trick, focal loss (Lin et al., 2020) , into our loss functions for direct prediction and CRF: (1 − P dp (y t |X)) γ log P dp (y t |X) where γ is a hyperparameter to control the penalty weight. It is obvious that L fl dp is penalized on the token level, while L fl crf is weighted on the sample level and will work in the condition of batchtraining. The final optimization objective with focal penalty strategy is:

Inference
During the inference stage, for the input source sentence X, we can employ the original |V| nodes Viterbi algorithm to obtain the target global optimal result. We can also utilize the truncated top-k Viterbi algorithm for high computing efficiency (Sun et al., 2019).  3 Experimental Setup

Settings
The core technical components of our proposed TtT is Transformer (Vaswani et al., 2017) and CRF (Lafferty et al., 2001). The pre-trained Chinese BERTbase model (Devlin et al., 2019) is employed to initialize the model. To approximate the transition matrix in the CRF layer, we set the dimension d of matrices E 1 and E 2 as 32. For the normalizing factor Z(X), we set the predefined beam size k as 64. The hyperparameter γ which is used to weight the focal penalty term is set to 0.5 after parameter tuning. Training batch-size is 100, learning rate is 1e − 5, dropout rate is 0.1. Adam optimizer (Kingma and Ba, 2015) is used to conduct the parameter learning.

Datasets
The overall statistic information of the datasets used in our experiments are depicted in Table 1. SIGHAN15 (Tseng et al., 2015) 2 This is a benchmark dataset for the evaluation of CGEC and it contains 2,339 samples for training and 1,100 samples for testing. As did in some typical previous works Zhang et al., 2020b), we also use the SIGHAN15 testset as the benchmark dataset to evaluate the performance of our models as well as the baseline methods in fixed-length (FixLen) error correction settings.
HybirdSet (Wang et al., 2018) 3 It is a newly released dataset constructed according to a prepared confusion set based on the results of ASR (Yu and Deng, 2014) and OCR (Tong and Evans, 1996). This dataset contains about 270k paired samples and it is also a FixLen dataset.

Comparison Methods
We compare the performance of TtT with several strong baseline methods on both FixLen and VarLen settings. NTOU employs n-gram language model with a reranking strategy to conduct prediction (Tseng et al., 2015). NCTU-NTUT also uses CRF to conduct label dependency modeling (Tseng et al., 2015). HanSpeller++ employs Hidden Markov Model with a reranking strategy to conduct the prediction (Zhang et al., 2015).
Hybrid utilizes LSTM-based seq2seq framework to conduct generation (Wang et al., 2018) and Confusionset introduces a copy mechanism into seq2seq framework .
FASPell incorporates BERT into the seq2seq for better performance . SoftMask-BERT firstly conducts error detection using a GRU-based model and then incorporating the predicted results with the BERT model using a soft-masked strategy (Zhang et al., 2020b). Note that the best results of SoftMask-BERT are obtained after pre-training on a large-scale dataset with 500M paired samples. SpellGCN proposes to incorporate phonological and visual similarity knowledge into language models via a specialized graph convolutional network (Cheng et al., 2020).
Chunk proposes a chunk-based decoding method with global optimization to correct single character and multi-character word typos in a unified framework (Bao et al., 2020). We also implement some classical methods for comparison and ablation analysis, especially for the VarLen correction problem. Transformer-s2s is the typical Transformer-based seq2seq framework for sequence prediction (Vaswani et al., 2017). GPT2-finetune is also a sequence generation framework fine-tuned based on a pre-trained Chinese GPT2 model 4 (Radford et al., 2019;Li, 2020). BERT-finetune is just fine-tune the Chinese BERT model on the CGEC corpus directly. Beam search decoding strategy is employed to con-   duct generation for Transformer-s2s and GPT2finetune, and beam-size is 5. Note that some of the original methods above mentioned can only work in the FixLen settings, such as SoftMask-BERT and BERT-finetune.

Evaluation Metrics
Following the typical previous works Zhang et al., 2020b), we employ sentence-level Accuracy, Precision, Recall, and F1-Measure as the automatic metrics to evaluate the performance of all systems 5 . We also report the detailed results for error Detection (all locations of incorrect characters in a given sentence should be completely identical with the gold standard) and Correction (all locations and corresponding corrections of incorrect characters should be completely identical with the gold standard) respectively (Tseng et al., 2015).

Results in VarLen Scenario
Benefit from the CRF-based dependency modeling component, TtT can conduct deletion, insertion, local paraphrasing operations jointly to address the Variable-Length (VarLen) error correction problem. The experimental results are described in Table 3. Considering that those sequence generation methods such as Transformer-s2s and GPT2-finetune can also conduct VarLen correction operation, thus we report their results as well. From the results, we can observe that TtT can also achieve a superior performance in the VarLen scenario. The reasons are clear: BERT-finetune as well as the related methods are not appropriate in VarLen scenario, especially when the target is longer than the input. The text generation models such as Transformer-s2s and GPT2-finetune suffer from the problem of hallucination (Maynez et al., 2020) and repetition,   which are not steady on the problem of CGEC.

Ablation Analysis
Different Training Dataset Recall that we introduce several groups of training datasets in different scales as depicted in Table 1. It is also very interesting to investigate the performances on differentsize datasets. Then we conduct training on those training datasets and report the results still on the SIGHAN2015 testset. The results are shown in Table 4. No matter what scale of the dataset is, TtT always obtains the best performance.
Impact of L dp and L crf Table 5 describes the performance of our model TtT and the variants without L dp (TtT w/o L dp ) and L crf (TtT w/o L crf ). We can conclude that the fusion of these two tasks, direct prediction and CRF-based dependency mod-eling, can indeed improve the performance.
Parameter Tuning for Focal Loss The focal loss penalty hyperparameter γ is crucial for the loss function L = L dp + L crf and should be adjusted on the specific tasks (Lin et al., 2020). We conduct grid search for γ ∈ (0, 0.1, 0.5, 1, 2, 5) and the corresponding results are provided in Table 6. Finally, we select γ = 0.5 for TtT for the CGEC task.

Computing Efficiency Analysis
Practically, CGEC is an essential and useful task and the techniques can be used in many real applications such as writing assistant, post-processing of ASR and OCR, search engine, etc. Therefore, the time cost efficiency of models is a key point which needs to be taken into account.  some baseline approaches. The results demonstrate that TtT is a cost-effective method with superior prediction performance and low computing time complexity, and can be deployed online directly.

Conclusion
We propose a new framework named tail-to-tail non-autoregressive sequence prediction, which abbreviated as TtT, for the problem of CGEC. A BERT based sequence encoder is introduced to conduct bidirectional representation learning. In order to conduct substitution, deletion, insertion, and local paraphrasing simultaneously, a CRF layer is stacked on the up tail to conduct non-autoregressive sequence prediction by modeling the dependencies among neighbour tokens. Low-rank decomposition and a truncated Viterbi algorithm are introduced to accelerate the computations. Focal loss penalty strategy is adopted to alleviate the class imbalance problem considering that most of the tokens in a sentence are not changed. Experimental results on standard datasets demonstrate the effectiveness of TtT in terms of sentence-level Accuracy, Precision, Recall, and F1-Measure on tasks of error Detection and Correction. TtT is of low computing complexity and can be deployed online directly.
In the future, we plan to introduce more lexical analysis knowledge such as word segmentation and fine-grained named entity recognition (Zhang et al., 2020a) to further improve the performance.