Instantaneous Grammatical Error Correction with Shallow Aggressive Decoding

In this paper, we propose Shallow Aggressive Decoding (SAD) to improve the online inference efficiency of the Transformer for instantaneous Grammatical Error Correction (GEC). SAD optimizes the online inference efficiency for GEC by two innovations: 1) it aggressively decodes as many tokens as possible in parallel instead of always decoding only one token in each step to improve computational parallelism; 2) it uses a shallow decoder instead of the conventional Transformer architecture with balanced encoder-decoder depth to reduce the computational cost during inference. Experiments in both English and Chinese GEC benchmarks show that aggressive decoding could yield identical predictions to greedy decoding but with significant speedup for online inference. Its combination with the shallow decoder could offer an even higher online inference speedup over the powerful Transformer baseline without quality loss. Not only does our approach allow a single model to achieve the state-of-the-art results in English GEC benchmarks: 66.4 F0.5 in the CoNLL-14 and 72.9 F0.5 in the BEA-19 test set with an almost 10x online inference speedup over the Transformer-big model, but also it is easily adapted to other languages. Our code is available at https://github.com/AutoTemp/Shallow-Aggressive-Decoding.

2020; Omelianchuk et al., 2020) for its poor inference efficiency in modern writing assistance applications (e.g., Microsoft Office Word 1 , Google Docs 2 and Grammarly 3 ) where a GEC model usually performs online inference, instead of batch inference, for proactively and incrementally checking a user's latest completed sentence to offer instantaneous feedback.
To better exploit the Transformer for instantaneous GEC in practice, we propose a novel approach -Shallow Aggressive Decoding (SAD) to improve the model's online inference efficiency. The core innovation of SAD is aggressive decoding: instead of sequentially decoding only one token at each step, aggressive decoding tries to decode as many tokens as possible in parallel with the assumption that the output sequence should be almost the same with the input. As shown in Figure 1, if the output prediction at each step perfectly matches its counterpart in the input sentence, the inference will finish, meaning that the model will keep the input untouched without editing; if the output token at a step does not match its corresponding token in the input, we will discard all the predictions after the bifurcation position and re-decode them in the original autoregressive decoding manner until we find a new opportunity for aggressive decoding. In this way, we can decode the most text in parallel in the same prediction quality as autoregressive greedy decoding, but largely improve the inference efficiency.
In addition to aggressive decoding, SAD proposes to use a shallow decoder, instead of the conventional Transformer with balanced encoderdecoder depth, to reduce the computational cost for further accelerating inference. The experimental [ Figure 1: The overview of aggressive decoding. Aggressive decoding tries decoding as many tokens as possible in parallel with the assumption that the input and output should be almost the same in GEC. When we find a bifurcation between the input and the output of aggressive decoding, then we accept the predictions before (including) the bifurcation, and discard all the predictions after the bifurcation and re-decode them using original one-by-one autoregressive decoding. If we find a suffix match (i.e., some advice highlighted with the blue dot lines) between the output and the input during one-by-one re-decoding, we switch back to aggressive decoding by copying the tokens (highlighted with the orange dashed lines) following the matched tokens in the input to the decoder input by assuming they are likely to be the same.
results in both English and Chinese GEC benchmarks show that both aggressive decoding and the shallow decoder can significantly improve online inference efficiency. By combining these two techniques, our approach shows a 9× ∼ 12× online inference speedup over the powerful Transformer baseline without sacrificing the quality. The contributions of this paper are two-fold: • We propose a novel aggressive decoding approach, allowing us to decode as many token as possible in parallel, which yields the same predictions as greedy decoding but with a substantial improvement of computational parallelism and online inference efficiency.
• We propose to combine aggressive decoding with the Transformer with a shallow decoder. Our final approach not only advances the stateof-the-art in English GEC benchmarks with an almost 10× online inference speedup but also is easily adapted to other languages.

Background: Transformer
The Transformer is a seq2seq neural network architecture based on multi-head attention mechanism, which has become the most successful and widely used seq2seq models in various generation tasks such as machine translation, abstractive summarization as well as GEC.
The original Transformer follows the balanced encoder-decoder architecture: its encoder, consisting of a stack of identical encoder layers, maps an input sentence x = (x 1 , . . . , x n ) to a sequence of continuous representation z = (z 1 , . . . , z n ); and its decoder, which is composed of a stack of the same number of identical decoder layers as the encoder, generates an output sequence o = (o 1 , . . . , o m ) given z.
In the training phase, the model learns an autoregressive scoring model P (y | x; Φ), implemented with teacher forcing: where y = (y 1 , . . . , y l ) is the ground-truth target sequence and y ≤i = (y 0 , . . . , y i ). As ground truth is available during training, Eq (1) can be efficiently obtained as the probability P (y i+1 | y ≤i , x) at each step can be computed in parallel.
During inference, the output sequence o = (o 1 , . . . , o m ) is derived by maximizing the following equation: Since no ground truth is available in the inference phase, the model has to decode only one token at each step conditioning on the previous decoded tokens o ≤j instead of decoding in parallel as in the training phase.

Aggressive Decoding
As introduced in Section 2, the Transformer decodes only one token at each step during inference. The autoregressive decoding style is the main bottleneck of inference efficiency because it largely reduces computational parallelism.
For GEC, fortunately, the output sequence is usually very similar to the input with only a few edits if any. This special characteristic of the task makes it unnecessary to follow the original autoregressive decoding style; instead, we propose a novel decoding approach -aggressive decoding which tries to decode as many tokens as possible during inference. The overview of aggressive decoding is shown in Figure 1, and we will discuss it in detail in the following sections.

Initial Aggressive Decoding
The core motivation of aggressive decoding is the assumption that the output sequence o = (o 1 , . . . , o m ) should be almost the same with the input sequence x = (x 1 , . . . , x n ) in GEC. At the initial step, instead of only decoding the first token o 1 conditioning on the special [BOS] token o 0 , aggressive decoding decodes o 1...n conditioning on the pseudo previous decoded tokensô 0...n−1 in parallel with the assumption thatô 0...n−1 = x 0,...,n−1 . Specifically, for j ∈ {0, 1, . . . , n − 2, n − 1}, o j+1 is decoded as follows: whereô ≤j is the pseudo previous decoded tokens at step j + 1, which is assumed to be the same with x ≤j . After we obtain o 1...n , we verify whether o 1...n is actually identical to x 1...n or not. If o 1...n is fortunately exactly the same with x 1...n , the inference will finish, meaning that the model finds no grammatical errors in the input sequence x 1...n and keeps the input untouched. In more cases, however, o 1...n will not be exactly the same with x 1...n . In such a case, we have to stop aggressive decoding and find the first bifurcation position k so that ..k could be accepted as they will not be different even if they are decoded through the original autoregressive greedy decoding. However, for the predictions o k+1...n , we have to discard and re-decode them because o k =ô k .

Re-decoding
As o k =ô k = x k , we have to re-decode for o j+1 (j ≥ k) one by one following the original autoregressive decoding: After we obtain o ≤j (j > k), we try to match its suffix to the input sequence x for further aggressive decoding. If we find its suffix o j−q...j (q ≥ 0) is the unique substring of x such that o j−q...j = x i−q...i , then we can assume that o j+1... will be very likely to be the same with x i+1... because of the special characteristic of the task of GEC.
If we fortunately find such a suffix match, then we can switch back to aggressive decoding to decode in parallel with the assumptionô j+1... = x i+1... . Specifically, the token o j+t (t > 0) is decoded as follows: In Eq (5), o <j+t is derived as follows: where CAT(a, b) is the operation that concatenates two sequences a and b. Otherwise (i.e., we cannot find a suffix match at the step), we continue decoding using the original Aggressive Decode o j+1... according to Eq (5) and Eq (6); 5: end if 13: end while autoregressive greedy decoding approach until we find a suffix match.
We summarize the process of aggressive decoding in Algorithm 1. For simplifying implementation, we make minor changes in Algorithm 1: 1) we set o 0 = x 0 = [BOS] in Algorithm 1, which enables us to regard the initial aggressive decoding as the result of suffix match of o 0 = x 0 ; 2) we append a special token [P AD] to the end of x so that the bifurcation (in the 5 th line in Algorithm 1) must exist (see the bottom example in Figure 1). Since we discard all the computations and predictions after the bifurcation for re-decoding, aggressive decoding guarantees that generation results are exactly the same as greedy decoding (i.e., beam=1). However, as aggressive decoding decodes many tokens in parallel, it largely improves the computational parallelism during inference, greatly benefiting the inference efficiency.

Shallow Decoder
Even though aggressive decoding can significantly improve the computational parallelism during inference, it inevitably leads to intensive computation and even possibly introduces additional computation caused by re-decoding for the discarded predictions.
To reduce the computational cost for decoding, we propose to use a shallow decoder, which has proven to be an effective strategy (Kasai et al., 2020;Li et al., 2021) in neural machine translation (NMT), instead of using the Transformer with balanced encoder-decoder depth as the previous state-of-the-art Transformer models in GEC. By combining aggressive decoding with the shallow decoder, we are able to further improve the inference efficiency.

Data and Model Configuration
We follow recent work in English GEC to conduct experiments in the restricted training setting of BEA-2019 GEC shared task (Bryant et al., 2019): We use Lang-8 Corpus of Learner English (Mizumoto et al., 2011), NUCLE (Dahlmeier et al., 2013), FCE (Yannakoudakis et al., 2011) and W&I+LOCNESS (Granger;Bryant et al., 2019) as our GEC training data. For facilitating fair comparison in the efficiency evaluation, we follow the previous studies (Omelianchuk et al., 2020;Chen et al., 2020) which conduct GEC efficiency evaluation to use CoNLL-2014(Ng et al., 2014 dataset that contains 1,312 sentences as our main test set, and evaluate the speedup as well as Max-Match (Dahlmeier and Ng, 2012) precision, recall and F 0.5 using their official evaluation scripts 4 . For validation, we use CoNLL-2013 that contains 1,381 sentences as our validation set. We also test our approach on NLPCC-18 Chinese GEC shared task (Zhao et al., 2018), following their training 5 and evaluation setting, to verify the effectiveness of our approach in other languages. To compare with the state-of-the-art approaches in English GEC that pretrain with synthetic data, we also synthesize 300M error-corrected sentence pairs for pretraining the English GEC model following the approaches of Grundkiewicz et al. (2019) and Zhang et al. (2019). Note that in the following evaluation sections, the models evaluated are by default trained without the synthetic data unless they are explicitly mentioned. We use the most popular GEC model architecture -Transformer (big) model (Vaswani et al., 2017) as our baseline model which has a 6-layer encoder and 6-layer decoder with 1,024 hidden units. We train the English GEC model using an encoder-decoder shared vocabulary of 32K Byte Pair Encoding (Sennrich et al., 2016) tokens and train the Chinese GEC model with 8.4K Chinese characters. We include more training details in the supplementary notes. For inference, we use greedy decoding 6 by default.
All the efficiency evaluations are conducted in the online inference setting (i.e., batch size=1) as we focus on instantaneous GEC. We perform model inference with fairseq 7 implementation using Pytorch 1.5.1 with 1 Nvidia Tesla V100-PCIe of 16GB GPU memory under CUDA 10.2.

Evaluation for Aggressive Decoding
We evaluate aggressive decoding in our validation set (CoNLL-13) which contains 1,381 validation examples. As shown in Table 1, aggressive decoding achieves a 7× ∼ 8× speedup over the original autoregressive beam search (beam=5), and generates exactly the same predictions as greedy decoding, as discussed in Section 3.1.2. Since greedy decoding can achieve comparable overall performance (i.e., F 0.5 ) with beam search while it tends 6 Our implementation of greedy decoding is simplified for higher efficiency (1.3× ∼ 1.4× speedup over beam=5) than the implementation of beam=1 decoding in fairseq (around 1.1× speedup over beam=5). to make more edits resulting in higher recall but lower precision, the advantage of aggressive decoding in practical GEC applications is obvious given its strong performance and superior efficiency.
We further look into the efficiency improvement by aggressive decoding. Figure 2 shows the speedup distribution of the 1,381 examples in CoNLL-13 with respect to their edit ratio which is defined as the normalized (by the input length) edit distance between the input and output. It is obvious that the sentences with fewer edits tend to achieve higher speedup, which is consistent with our intuition that most tokens in such sentences can be decoded in parallel through aggressive decoding; on the other hand, for the sentences that are heavily edited, their speedup is limited because of frequent re-decoding. To give a more intuitive analysis, we also present concrete examples with various speedup in our validation set to understand how aggressive decoding improves the inference efficiency in Table 2. [Nowadays , people use the all-purpose smart phone for communicating .]0 6.8× 0.03 Because that the birth rate is reduced while the death rate is also reduced , the percentage of the elderly is increased while that of the youth is decreased .

Moreover, we conduct an ablation study to in-
[Because the]0 [birth]1 [rate is reduced while the death rate is also reduced , the percentage of the elderly is increased while that of the youth is decreased .]2 5.1× 0.06 More importantly , they can share their ideas of how to keep healthy through Internet , to make more interested people get involve and find ways to make life longer and more wonderful .
[   vestigate whether it is necessary to constrain the maximal aggressive decoding length 8 , because it might become highly risky to waste large amounts of computation because of potential re-decoding for a number of steps after the bifurcation if we aggressively decode a very long sequence in parallel.  decoding length does not help improve the efficiency; instead, it slows down the inference if the maximal aggressive decoding length is set to a small number. We think the reason is that sentences in GEC datasets are rarely too long. For example, the average length of the sentences in CoNLL-13 is 21 and 96% of them are shorter than 40 tokens. Therefore, it is unnecessary to constrain the maximal aggressive decoding length in GEC.  Table 5: The performance and online inference efficiency evaluation of efficient GEC models in CoNLL-14. For the models with , their performance and speedup numbers are from Chen et al. (2020) who evaluate the online efficiency in the same runtime setting (e.g., GPU and runtime libraries) with ours. The underlines indicate the speedup numbers of the models are evaluated with Tensorflow based on their released codes, which are not strictly comparable here. Note that for GECToR, we re-implement its inference process of GECToR (RoBERTa) using fairseq for testing its speedup in our setting. -means the speedup cannot be tested in our runtime environment because the model has not been released or not implemented in fairseq.

Evaluation for Shallow Decoder
We study the effects of changing the number of encoder and decoder layers in the Transformer-big on both the performance and the online inference efficiency. By comparing 6+6 with 3+6 and 9+6 in Table 4, we observe the performance improves as the encoder becomes deeper, demonstrating the importance of the encoder in GEC. In contrast, by comparing the 6+6 with 6+3 and 6+9, we do not see a substantial fluctuation in the performance, indicating no necessity of a deep decoder. Moreover, it is observed that a deeper encoder does not significantly slow down the inference but a shallow decoder can greatly improve the inference efficiency. This is because Transformer encoders can be parallelized efficiently on GPUs, whereas Transformer decoders are auto-regressive and hence the number of layers greatly affects decoding speed, as discussed in Section 3.2. These observations motivate us to make the encoder deeper and the decoder shallower.
As shown in the bottom group of Table 4, we try different combinations of the number of encoder and decoder layers given approximately the same parameterization budget as the Transformerbig. It is interesting to observe that 7+5, 8+4 and 9+3 achieve the comparable and even better performance than the Transformer-big baseline with much less computational cost. When we further increase the encoder layer and decrease the decoder layer, we see a drop in the performance of 10+2 and 11+1 despite the improved efficiency because it becomes difficult to train the Transformer with extremely imbalanced encoder and decoder well, as indicated 9 by the previous work (Kasai et al., 2020;Li et al., 2021;Gu and Kong, 2020).
Since the 9+3 model achieves the best result with an around 2× speedup in the validation set with almost the same parameterization budget, we choose it as the model architecture to combine with aggressive decoding for final evaluation.

Results
We evaluate our final approach -shallow aggressive decoding which combines aggressive decoding with the shallow decoder. Table 5 shows the performance and efficiency of our approach and recently proposed efficient GEC models that are all faster than the Transformer-big baseline in CoNLL-14 test set. Our approach (the 9+3 model with aggressive decoding) that is pretrained with synthetic data achieves 63.5 F 0.5 with 10.3× speedup over the Transformer-big baseline, which outperforms the majority 10 of the efficient GEC models in terms of either quality or speed. The only model that shows advantages over our 9+3 model is GECToR which is developed based on the powerful pretrained mod-9 They show that sequence-level knowledge distillation (KD) may benefit training the extremely imbalanced Transformer in NMT. However, we do not conduct KD for fair comparison to other GEC models in previous work. 10 It is notable that PIE is not strictly comparable here because their training data is different from ours: PIE does not use the W&I+LOCNESS corpus. 33.0 20.5 29.4 12.0× Table 6: The performance and online inference efficiency evaluation for the language-independent efficient GEC models in the NLPCC-18 Chinese GEC benchmark. els (e.g., RoBERTa  and XL-Net (Yang et al., 2019)) with its multi-stage training strategy. Following GECToR's recipe, we leverage the pretrained model BART  to initialize a 12+2 model which proves to work well in NMT (Li et al., 2021) despite more parameters, and apply the multi-stage fine-tuning strategy used in Stahlberg and Kumar (2020). The final single model 11 with aggressive decoding achieves the state-of-the-art result -66.4 F 0.5 in the CoNLL-14 test set with a 9.6× speedup over the Transformerbig baseline. Unlike GECToR and PIE that are difficult to adapt to other languages despite their competitive speed because they are specially designed for English GEC with many manually designed languagespecific operations like the transformation of verb forms (e.g., VBD→VBZ) and prepositions (e.g., in→at), our approach is data-driven without depending on language-specific features, and thus can be easily adapted to other languages (e.g., Chinese). As shown in Table 6, our approach consistently performs well in Chinese GEC, showing an around 12.0× online inference speedup over the Transformer-big baseline with comparable performance.

Related Work
The state-of-the-art of GEC has been significantly advanced owing to the tremendous success of seq2seq learning (Sutskever et al., 2014) and the Transformer (Vaswani et al., 2017). Most recent work on GEC focuses on improving the performance of the Transformer-based GEC models. However, except for the approaches that add synthetic erroneous data for pretraining (Ge et al., 2018a;Grundkiewicz et al., 2019;Zhang et al., 11 The same model checkpoint also achieves the state-ofthe-art result -72.9 F0.5 with a 9.3× speedup in the BEA-19 test set. 2019; Lichtarge et al., 2019;Wan et al., 2020), most methods that improve performance (Ge et al., 2018b;Kaneko et al., 2020) introduce additional computational cost and thus slow down inference despite the performance improvement.
To make the Transformer-based GEC model more efficient during inference for practical application scenarios, some recent studies have started exploring the approaches based on edit operations. Among them, PIE (Awasthi et al., 2019) and GEC-ToR (Omelianchuk et al., 2020) propose to accelerate the inference by simplifying GEC from sequence generation to iterative edit operation tagging. However, as they rely on many languagedependent edit operations such as the conversion of singular nouns to plurals, it is difficult for them to adapt to other languages. LaserTagger (Malmi et al., 2019) uses the similar method but it is datadriven and language-independent by learning operations from training data. However, its performance is not so desirable as its seq2seq counterpart despite its high efficiency. The only two previous efficient approaches that are both languageindependent and good-performing are Stahlberg and Kumar (2020) which uses span-based edit operations to correct sentences to save the time for copying unchanged tokens, and Chen et al. (2020) which first identifies incorrect spans with a tagging model then only corrects these spans with a generator. However, all the approaches have to extract edit operations and even conduct token alignment in advance from the error-corrected sentence pairs for training the model. In contrast, our proposed shallow aggressive decoding tries to accelerate the model inference through parallel autoregressive decoding which is related to some previous work Stern et al., 2018) in neural machine translation (NMT), and the imbalanced encoder-decoder architecture which is recently explored by Kasai et al. (2020) and Li et al. (2021) for NMT. Not only is our approach language-independent, efficient and guarantees that its predictions are exactly the same with greedy decoding, but also does not need to change the way of training, making it much easier to train without so complicated data preparation as in the edit operation based approaches.

Conclusion and Future Work
In this paper, we propose Shallow Aggressive Decoding (SAD) to accelerate online inference efficiency of the Transformer for instantaneous GEC. Aggressive decoding can yield the same prediction quality as autoregressive greedy decoding but with much less latency. Its combination with the Transformer with a shallow decoder can achieve state-of-the-art performance with a 9× ∼ 12× online inference speedup over the Transformer-big baseline for GEC.
Based on the preliminary study of SAD in GEC, we plan to further explore the technique for accelerating the Transformer for other sentence rewriting tasks, where the input is similar to the output, such as style transfer and text simplification. We believe SAD is promising to become a general acceleration methodology for writing intelligence models in modern writing assistant applications that require fast online inference.  B CPU Efficiency Table 8 shows total latency and speedup of the Transformer with different encoder-decoder depth on an Intel ® Xeon ® E5-2690 v4 Processor(2.60GHz) with 8 and 2 threads 12 , respectively. Our approach achieves a 7× ∼ 8× online inference speedup over the Transformer-big baseline on CPU.