Bidirectional Transformer Reranker for Grammatical Error Correction

Pre-trained seq2seq models have achieved state-of-the-art results in the grammatical error correction task. However, these models still suffer from a prediction bias due to their unidirectional decoding. Thus, we propose a bidirectional Transformer reranker (BTR), that re-estimates the probability of each candidate sentence generated by the pre-trained seq2seq model. The BTR preserves the seq2seq-style Transformer architecture but utilizes a BERT-style self-attention mechanism in the decoder to compute the probability of each target token by using masked language modeling to capture bidirectional representations from the target context. For guiding the reranking, the BTR adopts negative sampling in the objective function to minimize the unlikelihood. During inference, the BTR gives final results after comparing the reranked top-1 results with the original ones by an acceptance threshold. Experimental results show that, in reranking candidates from a pre-trained seq2seq model, T5-base, the BTR on top of T5-base could yield 65.47 and 71.27 F0.5 scores on the CoNLL-14 and BEA test sets, respectively, and yield 59.52 GLEU score on the JFLEG corpus, with improvements of 0.36, 0.76 and 0.48 points compared with the original T5-base. Furthermore, when reranking candidates from T5-large, the BTR on top of T5-base improved the original T5-large by 0.26 points on the BEA test set.


Introduction
Grammatical error correction (GEC) is a sequenceto-sequence task which requires a model to aim to correct an ungrammatical sentence.An example is presented in Table 1. 2 Various neural models for GEC have emerged (Junczys-Dowmunt et al., 2018;Kiyono et al., 2019;Kaneko et al., 2020;Rothe et al., 2021) due to the importance of this task for language-learners who tend to produce ungrammatical sentences.
Previous studies have shown that GEC can be approached as machine translation by using a seq2seq model (Luong et al., 2015) with a Transformer (Vaswani et al., 2017) architecture (Junczys-Dowmunt et al., 2018;Zhao et al., 2019;Kiyono et al., 2019;Kaneko et al., 2020;Rothe et al., 2021).As a neural model consists of an encoder and a decoder, the seq2seq architecture typically requires a large amount of training data.Because GEC suffers from limited training data, applying a seq2seq model for GEC results in a low-resource setting, that can be handled by introducing synthetic data for training (Kiyono et al., 2019;Omelianchuk et al., 2020;Stahlberg and Kumar, 2021).However, as pointed out by Rothe et al. (2021), the use of synthetic data in GEC may result in a distributional shift and require language-specific tuning, which can be time-consuming and resource-intensive.
Considering the limitations of the synthetic data, the current trend is to utilize the learned and general representations from a pre-trained model, such as BERT (Devlin et al., 2019), XLNet (Yang et al., 2019), BART (Lewis et al., 2020), and T5 (Raffel et al., 2020), which have been trained on large corpora and shown to be effective for various downstream tasks.According to Kaneko et al. (2020), incorporating a pre-trained masked language model (MLM) into a seq2seq model could facilitate correction.In addition, as reported by Rothe et al. (2021), the pre-trained T5 model achieved stateof-the-art results on GEC benchmarks for four languages after successive fine-tuning with the cleaned LANG-8 corpus (cLang-8) (Rothe et al., 2021).
Although the seq2seq model with pre-trained representations has shown to be effective for GEC, its performance was still constrained by its unidirectional decoding.As suggested by Liu et al. (2021), for an ungrammatical sentence, a fully pre-  (Napoles et al., 2017) test set.The 3 candidate sentences were generated by T5GEC ( §5.1).Blue indicates the range of corrections."Candidate 1 (T5GEC)" denotes that T5GEC regards "Candidate 1" as the most grammatical correction.trained seq2seq GEC model (Kiyono et al., 2019) could generate several high-quality grammatical sentences by beam search.However, even among these candidates, there may be still a gap between the selected hypothesis and the most grammatical one.Our experimental results, listed in Table 5, also demonstrate their investigation.To solve this decoding problem, given the hypotheses of a seq2seq GEC model, Kaneko et al. (2019) used BERT to classify between ungrammatical and grammatical hypotheses, and reranked them on the basis of the classification results.The previous studies (Kiyono et al., 2019;Kaneko et al., 2020) also showed that the seq2seq GEC model decoding in an opposite direction, i.e., right-to-left, is effective as a reranker for a left-to-right GEC model.
Therefore, to further improve the performance of the pre-trained seq2seq model for GEC, it is essential to find ways to leverage the bidirectional representations of the target context.In this study, on the basis of the seq2seq-style Transformer model, we propose a bidirectional Transformer reranker (BTR) to handle the interaction between the source sentence and the bidirectional target context.The BTR utilizes a BERT-style self-attention mechanism in the decoder to predict each target token by masked language modeling (Devlin et al., 2019).Given several candidate target sentences from a base model, the BTR can re-estimate the sentence probability for each candidate from the bidirectional representation of the candidate, which is different from the conventional seq2seq model.During training, for guiding the reranking, we adopt negative sampling (Mikolov et al., 2013) for the objective function to minimize the unlikelihood while maximizing the likelihood.In inference, considering the robustness of pre-trained models, we compare the reranked top-1 results with the original ones by an acceptance threshold λ to decide whether to accept the suggestion from the BTR.
We regard the state-of-the-art model for GEC (Rothe et al., 2021), a pre-trained Transformer model, T5 (either T5-base or T5-large), as our base model and utilize its generated candidates for reranking.Because the BTR can inherit learned representations from a pre-trained Transformer model, we construct the BTR on top of T5-base.Our experimental results showed that, by reranking candidates from a fully pre-trained and fine-tuned T5-base model, the BTR on top of T5-base can achieve an F 0.5 score of 65.47 on the CoNLL-14 benchmark.The BTR on top of T5-base also outperformed T5-base on the BEA test set by 0.76 points, achieving an F 0.5 score of 71.27.Adopting negative sampling for the BTR also generated a peaked probability distribution for ranking, and so grammatical suggestions could be selected by using λ.Furthermore, the BTR on top of T5-base was robust even when reranking candidates from T5-large and improved the performance by 0.26 points on the BEA test set.

Related Work
For directly predicting the target corrections from corresponding input tokens, Omelianchuk et al. (2020) and Malmi et al. (2022) regarded the encoder of the Transformer model as a nonautoregressive GEC sequence tagger.The experimental results of Omelianchuk et al. (2020) showed that, compared with the randomly initialized LSTM (Hochreiter and Schmidhuber, 1997), the pre-trained models, such as RoBERTa (Liu et al., 2019), GPT-2 (Radford et al., 2019), and AL-BERT (Lan et al., 2020), can achieve higher F 0.5 scores as a tagger.Sun et al. (2021) considered GEC as a seq2seq task and introduced the Shallow Aggressive Decoding (SAD) for the decoder of the Transformer.With the SAD, the performance of a pre-trained seq2seq model, BART, surpassed the sequence taggers of Omelianchuk et al. (2020).The T5 xxl model is a pre-trained seq2seq model with 11B parameters (Raffel et al., 2020).After finetuning with the cLang-8 corpus, T5 xxl and mT5 xxl (Xue et al., 2021), a multilingual version of T5, achieved state-of-the-art results on GEC bench-marks in four languages: English, Czech, German, and Russian (Rothe et al., 2021).This demonstrated that performing a single fine-tuning step for a fully pre-trained seq2seq model is a simple and effective method for GEC without incorporating a copy mechanism (Zhao et al., 2019), the SAD or the output from a pre-trained MLM (Kaneko et al., 2020).Despite the improvements brought about by the pre-trained representations, the conventional seq2seq structure suffers from a prediction bias due to its unidirectional decoding.According to Liu et al. (2021), by using beam search, a fully pretrained seq2seq GEC model (Kiyono et al., 2019) can generate several high-quality grammatical hypotheses, which include one that is more grammatical than the selected one.
To address the shortcoming of the unidirectional decoding, previous studies (Kiyono et al., 2019;Kaneko et al., 2019Kaneko et al., , 2020) ) introduced reversed representations to rerank the hypotheses.Kiyono et al. (2019) and Kaneko et al. (2020) utilized a seq2seq GEC model that decodes in the opposite direction (right-to-left) to rerank candidates, which was effective to select a more grammatical sentence than the original one.This finding motivated us to use a bidirectional decoding method for our model.Instead of using a seq2seq model, Kaneko et al. (2019) fine-tuned BERT as a reranker to evaluate the grammatical quality of a sentence.By using masked language modeling, BERT learned deep bidirectional representations to distinguish between grammatical and ungrammatical sentences.However, BERT did not account for the positions of corrections, as it discarded the source sentence and considered only the target sentence.This made it difficult for BERT, as a reranker, to recognize the most suitable corrected sentence for an ungrammatical sentence.Salazar et al. (2020) proposed the use of pseudo-log-likelihood scores (PLL) for reranking.They demonstrated that RoBERTa, with the PLL for reranking, outperformed the conventional language model GPT-2 when reranking candidates in speech recognition and machine translation tasks.Zhang et al. (2021) also claimed that the pre-trained model, MPNet (Song et al., 2020), was more effective than GPT-2 when using PLL for reranking in discourse segmentation and parsing.Zhang and van Genabith (2021) proposed a bidirectional Transformer-based alignment (BTBA) model, which aims to assess the alignment between the source and target tokens in machine transla-tion.To achieve this, BTBA masked and predicted the current token with attention to both left and right sides of the target context to produce alignments for the current token.Specifically, to assess alignments from the attention scores in all crossattention layers, the decoder in BTBA discarded the last feed-forward layer of the Transformer model and directly predicted masked tokens from the output of the last cross-attention layer.Even though the target context on both sides was taken into consideration, one limitation of BTBA was that the computed alignments ignored the representation of the current token.To produce more accurate alignments, Zhang and van Genabith (2021) introduced full context based optimization (FCBO) for fine-tuning, in which BTBA no longer masks the target sentence to use the full target context.
In our research, to determine the most appropriate correction for a given erroneous sentence, we model the BTR as a seq2seq reranker, which encodes the erroneous sentence using an encoder and decodes a corrected sentence using a decoder.In contrast to the conventional seq2seq model, we use masked language modeling to mask and predict each target token in the decoder and estimate the sentence probability for each candidate using PLL.Unlike BTBA, the BTR preserves the last feed-forward layer in the decoder to predict masked tokens more accurately.Because the original data of the masked tokens should be invisible in the prediction, the FCBO fine-tuning step is not used in the BTR.Compared with BTBA, the BTR keeps the structure of the Transformer model and can easily inherit parameters from pre-trained models.

Preliminary
Because the decoder of the BTR uses masked language modeling to rerank candidates based on the PLL, in this section, we explain how a Transformer-based GEC model generates the candidates, the masked language modeling used in BERT, and how to compute the PLL.

Transformer-based GEC Model
Given an ungrammatical sentence x = (x 1 , . . ., x n ), a GEC model corrects x into its grammatical sentence y = (y 1 , . . ., y m ), where x i is the i-th token in x and y j is the j-th token in y.As an auto-regressive model, a Transformer-based GEC model with parameter θ where sj is the final hidden state from the decoder at the j-th decoding step.W is a weight matrix, b is a bias term, and y <j denotes (y 1 , . . ., y j−1 ).sj is computed as described in Appendix A.

Decoding Method
The pre-trained T5 model with Transformer architecture achieved state-of-the-art results in GEC by using beam search for decoding (Rothe et al., 2021).However, previous studies (Li and Jurafsky, 2016;Vijayakumar et al., 2018) have suggested that beam search tends to generate sequences with slight differences.This can constrain the upper bound score when reranking candidates (Ippolito et al., 2019).To select the optimal decoding method for a Transformer-based GEC Model, T5GEC, we compared beam search with diverse beam search (Vijayakumar et al., 2018), top-k sampling (Fan et al., 2018), and nucleus sampling (Holtzman et al., 2020).For each pair of data in CoNLL-13 corpus (Ng et al., 2013), we required all decoding methods to generate 5 candidate sequences with a maximum sequence length of 200.When using diverse beam search, we fixed the beam group and diverse penalty to 5 and 0.4, respectively.Meanwhile, we set the top-k as 50 and the top-p as 0.95 for top-k sampling and nucleus sampling, respectively.Table 2 presents the compared results among different decoding methods.Oracle indicates the upper bound score that can be achieved with the generated candidates.If the candidates include the correct answer, we assume the prediction is correct.Unique (%) indicates the rate of unique sequences among all candidates.Gold (%) indicates the rate of pairs of data whose candidates include the correct answer.The results show that beam search generates more diverse sentences with the highest Oracle score compared to nucleus sampling, topk sampling, and diverse beam search.This may be because, in the GEC task, most of the tokens in the target are the same as the source, which causes a peaked probabilities distribution to focus on one or a small number of tokens.And thus, a top-k filtering method like beam search generates more diverse sentences than sampling or using probability as a diverse penalty.Based on these results, we have chosen beam search as the decoding method for T5GEC during inference.For evaluating T5GEC, it generates the top-ranked hypothesis with a beam size of 5. To generate the top-a candidates Y a = {y 1 , . . ., y a } for reranking, it generates hypotheses with a beam size of a and a maximum sequence length of 128 and 200 for the datasets in training and prediction, respectively.

Masked Language Modeling
Masked language modeling, used in BERT, was introduced to learn bidirectional representations for a given sentence x through self-supervised learning (Devlin et al., 2019).Before pre-training, several tokens in x are randomly replaced with the mask token <M>.Let κ denote the set of masked positions, x κ the set of masked tokens, and x \κ the sentence after masking.The model parameter θ is optimized by maximizing the following objective:  where |x| is the length of x.As suggested by Salazar et al. (2020), when using PLL to estimate the cross-entropy loss, the loss of x i |x \κ versus i from BERT is flatter than GPT-2, that uses the chain rule.Considering the candidate sentences might have different lengths, PLL is ideal for reranking.

Bidirectional Transformer Reranker (BTR)
The BTR uses masked language modeling in the decoder to estimate the probability of a corrected sentence.Given an ungrammatical sentence x, a base GEC model first generates the top-a corrected sentences Y a , as described in Section 3.1.Assume y base ∈ Y a is the top-ranked hypothesis from the base GEC model.The BTR selects and accepts the most optimal corrected sentence y BT R from Y a on the basis of the estimated sentence probability, as described in the following.Figure 2 shows the overview of the BTR for the whole procedure.

Target Sentence Probability
As PLL has been effective in estimating the sequence probability for reranking, we decompose the conditional sentence probability of y as: As in Eq. ( 2), a linear transformation with the softmax function is utilized for the final hidden state sj to predict p(y j |x, y \κ ; θ).Same as the Transformer architecture, sj is the result of s j after the cross-attention and feedforward layers.We assume the decoder consists of L layers.To capture the bidirectional representation, for ℓ ∈ L, we compute s ℓ j as: where s0 j is the embedding of the j − 1-th word in y \κ and s1 is the state of the start token <s>.
) denotes a set of hidden states for the joint sequence of <s> and y \κ .Attn s indicates the self-attention layer.Figure 1b shows our fully-visible attention mask for computing S ℓ in parallel.The procedure of using the BTR to predict p(y j |x, y \κ ; θ) is shown in Appendix C.

Objective Function
As a reranker, for a given ungrammatical sentence x, the BTR should compare all corresponding corrected sentences Y and select the most grammatical one.However, considering all possible corrected sentences for x is intractable, as suggested by Stahlberg and Byrne (2019), so we consider a subset of sequences Y a based on the top-a results from the base GEC model instead.
Let y gold ∈ Y denote the gold correction for x.For y ∈ Y a ∪ {y gold }, we follow the setting of BERT to randomly mask 15% of y and denote κ as the set of masked positions.As a result, the distribution of the masked tokens satisfies the 8:1:1 masking strategy.Following previous research (Welleck et al., 2019;Zhang et al., 2021;Song et al., 2021), given the masked sentence y \κ , the model parameter θ of the BTR is optimized by maximizing the likelihood and minimizing the unlikelihood as: where p(y k |x, y \κ ; θ) is computed as in Section 4.1.1 y is an indicator function defined as follows:

Inference
In inference, for y ∈ Y a , the BTR scores y by Hereafter, we denote y BT R ∈ Y a as the candidate with the highest score f (y BT R |x) for given x in the BTR.Here, f (y|x) is also considered to indicate the confidence of the BTR.Because the BTR is optimized with Eq. ( 7), a high score for y BT R indicates a confident prediction while a low score indicates an unconfident prediction.
Considering that we build the base GEC model from a fully pre-trained seq2seq model and the BTR from an insufficiently pre-trained model, we introduce an acceptance threshold λ to decide whether to accept the suggestion from the BTR.We accept y BT R only when it satisfies the following equation; otherwise, y base is still the final result: where λ is a hyperparameter tuned on the validation data.

Compared Methods
We evaluated the BTR as a reranker for two versions of candidates, normal and high-quality ones, generated by two seq2seq GEC models, T5GEC and T5GEC (large).We compared the BTR with three other rerankers, R2L, BERT, and RoBERTa.T5GEC: We used the state-of-the-art model (Rothe et al., 2021) as our base model for GEC.This base model inherited the pre-trained T5 version 1.1 model (T5-base) (Raffel et al., 2020) and was fine-tuned as described in Section 3.1.We denote this base model as T5GEC hereafter.Although the T5 xxl model yielded the most grammatical sentences in Rothe et al. (2021), it contained 11B parameters and was not suitable for our current experimental environment.Thus, we modeled T5GEC on top of a 248M-parameter T5-base model.To reproduce the experimental results of Rothe et al. (2021), we followed their setting and fine-tuned T5GEC once with the cLang-8 dataset.
T5GEC (large): To investigate the potential of the BTR for reranking high-quality candidates, we also fine-tuned one larger T5GEC model with a 738M-parameter T5-large structure.We denote this model as T5GEC (large).
R2L: The decoder of the conventional seq2seq model can generate a target sentence either in a leftto-right or right-to-left direction.Because T5GEC utilized the left-to-right direction, and previous research (Sennrich et al., 2016;Kiyono et al., 2019;Kaneko et al., 2020) showed the effectiveness of reranking using the right-to-left model, we followed Kaneko et al. (2020) to construct four rightto-left T5GEC models, which we denote as R2L.R2L reranks candidates based on the sum score of the base model (L2R) and ensembled R2L.
BERT: We followed Kaneko et al. ( 2019) to finetune four BERT with 334M parameters.During fine-tuning, both source and target sentences were annotated with either <0> (ungrammatical) or <1> (grammatical) label for BERT to classify.During inference, the ensembled BERT reranks candidates based on the predicted score for the <1> label.
RoBERTa: We fine-tuned four 125M parameters RoBERTa to compare our bidirectional Transformer structure with the encoder-only one.During fine-tuning, the source and target sentences were concatenated, and RoBERTa masked and predicted only the target sentence as the BTR.During prediction, the ensembled RoBERTa reranks candidates with the acceptance threshold λ as the BTR.

Setup for the BTR
Because there was no pre-trained seq2seq model with a self-attention mechanism for masked language modeling in the decoder, we constructed the BTR using the 248M T5 model (T5-base) and pretrained it with the Realnewslike corpus (Zellers et al., 2019).To compare the BTR with R2L, we also constructed R2L using T5-base, and pretrained both models as follows.To speed up pretraining, we initialized the BTR and R2L model parameters with the fine-tuned parameters θ of T5GEC.During pre-training, we followed Raffel et al. (2020) for self-supervised learning with a span masking strategy.Specifically, 15% of the tokens in a given sentence were randomly sampled and removed.The input sequence was constructed by the rest tokens while the target sequence was the concatenation of dropped-out tokens.An example is provided in Table 3.We pre-trained the BTR and R2L with 65536 = 2 16 and 10000 steps, respectively.Because the BTR masked and predicted only 15% of the tokens in Eq. ( 7), the true steps for the BTR were 2 16 × 0.15 ≈ 10000.We used a maximum sequence length of 512 and a batch size of 2 20 = 1048576 tokens.In total, we pretrained 10000 × 2 20 ≈ 10.5B tokens, which were less than the pre-trained T5 with 34B tokens.The pre-training for R2L and the BTR took 2 and 13 days, respectively, with 2 NVIDIA A100 80GB GPUs.This indicates the BTR requires more training time and resources than R2L.We provide a plot of the pre-training loss in Appendix D.
After pre-training, we successively fine-tuned the BTR with the cLang-8 dataset.Like R2L, BERT, and RoBERTa, our fine-tuned BTR is the ensemble of four models with random seeds.

Datasets
For fair comparison, we pre-trained R2L and the BTR with the Realnewslike corpus.This corpus  contains 37 GB of text data and is a subset of the C4 corpus (Raffel et al., 2020).To shorten the input and target sequences, we split each text into paragraphs.During fine-tuning, we followed the steps of Rothe et al. (2021) and regarded the cLang-8 corpus as the training dataset.
While the CoNLL-13 dataset was used for validation, the standard benchmarks from JFLEG, CoNLL-14, and the BEA test set (Bryant et al., 2019) were used for evaluation.While the CoNLL-14 corpus considers the minimal edit of corrections, JFLEG evaluates the fluency of a sentence.The BEA corpus contains much more diverse English language levels and domains than the CoNLL-14 corpus.We used a cleaned version of CoNLL-13 with consistent punctuation tokenization styles.Appendix E lists our cleaning steps and the experimental results on the cleaned CoNLL-14 set.Table 4 summarizes the data statistics.

Evaluation Metrics
The evaluation on the BEA test set was automatically executed in the official BEA-19 competition in terms of span-based correction F 0.5 using the ERRANT (Bryant et al., 2017) scorer.For the CoNLL-13 and 14 benchmarks, we evaluated the correction F 0.5 using the official M 2 (Dahlmeier and Ng, 2012) scorer.For the JFLEG corpus, we evaluated the GLEU (Napoles et al., 2015).
We report only significant results on the CoNLL-14 set, because the gold data for the BEA test set is unavailable, and the evaluation metric GLEU for the JFLEG test set requires a sampling strategy for multiple references.We used the paired t-test to evaluate whether the difference between y BT R and y base on the CoNLL-14 set is significant, as only limited y BT R differed from y base among the suggestions from the BTR.

Hyperparameters
Appendix F lists our hyperparameter settings for pre-training and fine-tuning each model.
We followed the setting of Zhang et al. (2021) to separately tune a for training and prediction, based on the model performance on the validation dataset with candidates generated by T5GEC.We denote a for training and prediction as a train and a pred , respectively.The threshold (λ) for the BTR and RoBERTa was tuned together with a.We set a train to 20, 0 for the BTR and RoBERTa, respectively, and a pred was set to 5 for all rerankers.When a train = 20, λ was set to 0.4 and 0.8 with respect to the candidates generated by T5GEC and T5GEC (large), respectively.When a train = 0, λ for the RoBERTa was set to 0.1 for the two versions of candidates.The experimental results for tuning a train , a pred , and λ are listed in Appendix G. than the BTR (λ = 0.4) on CoNLL-14 and JFLEG test sets.Meanwhile, the improvements brought by R2L depended on the beam searching score from L2R, suggesting that the unidirectional representation offers fewer gains compared to the bidirectional representation from the BTR.Reranking candidates by BERT resulted in the lower F 0.5 and GLEU scores than T5GEC.This may be because BERT considers only the target sentence and ignores the relationship between the source and the target.The BTR (λ = 0.4) achieved an F 0.5 score of 71.27 on the BEA test set. 4On the CoNLL-14 test set, the BTR (λ = 0.4) attained the highest 4 Experimental results in more details for different CEFR levels and error types can be found in Appendix I.   F 0.5 score of 65.47, with improvements of 0.36 points from T5GEC.The use of the threshold and negative candidates played an important role in the BTR.Without these two mechanisms, the BTR achieved only 59.48 and 63.60 F 0.5 scores, respectively, on the CoNLL-14 and BEA test sets, which were lower than those of the original selection.In the meantime, the BTR without the threshold could achieve the highest GLEU score of 59.52 on the JFLEG corpus, which indicates λ = 0.4 is too high for the JFLEG corpus.This is because of the different distributions and evaluation metrics between the CoNLL-13 and JFLEG corpus, as proved in Appendix J. Compared to RoBERTa (λ = 0.1) w/o a train of the encoder-only structure, the BTR (λ = 0.4) can achieve higher F 0.5 scores on CoNLL-13, 14, and BEA test sets, and a competitive GLEU score on the JFLEG corpus.These results show the benefit of using the Transformer with the encoder-decoder architecture in the BTR.

Results
Table 6 demonstrates the effect of using λ.Equal denotes the suggestion y BT R is exactly y base .Accept denotes y BT R satisfies Eq. ( 10) and y BT R will be the final selection, while Reject denotes y BT R does not satisfy the equation and y base is still the final selection.Most of the final selections belonged to Equal and achieved the highest F 0.5 score of 68.78.This indicates the sentences in Equal can be corrected easily by both the BTR (λ = 0.4) and T5GEC.Around 1/3 of the new suggestions proposed by the BTR (λ = 0.4) were accepted and achieved an F 0.5 score of 63.97, which was a 2.3-point improvement from y base .However, around 2/3 of the new suggestions were not accepted, and the original selection by T5GEC resulted in a higher F 0.5 score than these rejected suggestions.These results show that, among the new suggestions, the BTR was confident only for some suggestions.The confident suggestions tended to be more grammatical, whereas the unconfident suggestions tended to be less grammatical than the original selections.Appendix J shows the analysis.
Table 7 lists the performances when reranking high-quality candidates.While R2L still achieved the highest F 0.5 score on the BEA test set, it was less effective than the BTR on the JFLEG corpus.Although the BTR (λ = 0.8) used only 248M parameters and was trained with the candidates generated by T5GEC, it could rerank candidates from T5GEC (large) and achieve 61.97 GLEU and 72.41 F 0.5 scores on the JFLEG and BEA test sets, respectively.This finding indicates the sizes of the BTR and the base model do not need to be consistent, and a smaller BTR can also work as a reranker for a larger base model.RoBERTa (λ = 0.1) w/o a train achieved the highest F 0.5 score of 66.85 on the CoNLL-14 corpus with only 0.02-point improvement from T5GEC (large), which reflects the difficulty in correcting uncleaned sentences.
To investigate the difference among R2L, RoBERTa (λ = 0.1) w/o a train , and the BTR (λ = 0.4), we compared the precision and recall of the three rerankers in Table 5.In most cases, R2L tended to improve the precision but lower the recall from T5GEC.The improvements brought by RoBERTa from T5GEC for both precision and recall are limited.Meanwhile, the BTR could improve both precision and recall from the original ranking.Because T5GEC already achieved a relatively high precision and low recall, there was more room to improve recall, which was demonstrated by the BTR. Figure 3 shows both T5GEC and R2L have a relatively high cross-entropy loss for tokens at the beginning positions and a low loss for tokens at the ending positions, even though the loss of R2L was the sum of two opposite decoding directions.This may be because the learning by the auto-regressive models for the latest token was over-fitting and for the global context was underfitting, as Qi et al. (2020) indicated.RoBERTa has a flatter loss with less sharp positions than T5GEC and R2L.Meanwhile, the BTR has a flat loss, which is ideal for reranking candidate sentences with length normalization, as suggested by Salazar et al. (2020).Figure 4 shows the probability distribution of reranking.When a train > 0, the probability distribution of the BTR becomes peaked, which indicates that using Eq. ( 7) to minimize the unlikelihood could increase the probability gap between the 1st-ranked candidate and the rest.Compared with the BTR, when a train > 0, the probability distribution of RoBERTa is as flat as a train = 0, which suggests the effectiveness of the encoder-decoder structure compared with the encoder-only one when minimizing unlikelihood.

Conclusion
We proposed a bidirectional Transformer reranker (BTR) to rerank several top candidates generated by a pre-trained seq2seq model for GEC.For a fully pre-trained model, T5-base, the BTR could achieve 65.47 and 71.27 F 0.5 scores on the CoNLL-14 and BEA test sets.Our experimental results showed that the BTR on top of T5-base with limited pretraining steps could improve both precision and recall for candidates from T5-base.Since using negative sampling for the BTR generates a peaked probability distribution for ranking, introducing a threshold λ benefits the acceptance of the suggestion from the BTR.Furthermore, the BTR on top of T5-base could rerank candidates generated from T5-large and yielded better performance.This finding suggests the effectiveness of the BTR even in experiments with limited GPU resources.While the BTR in our experiments lacked sufficient pretraining, it should further improve the performance with full pre-training for reranking in future.

Limitations
As mentioned in the previous section, up until now, there has not been a fully pre-trained seq2seq model with a BERT-style self-attention mechanism in the decoder, while the vanilla seq2seq model tends to use a left-to-right or right-to-left unidirectional self-attention.Therefore, utilizing our proposed Bidirectional Transformer Reranker (BTR) to rerank candidates from a pre-trained vanilla seq2seq model requires additional pre-training steps, which cost both time and GPU resources.Because the BTR masks and predicts only 15% of the tokens in Eq. ( 7), it requires more training steps than a vanilla seq2seq model.In addition, during fine-tuning, the BTR also requires additional a train negative samples, which makes the fine-tuning longer.Furthermore, tuning a train will be inefficient if the training is slow.In other words, training an effective BTR requires much more time than training a vanilla seq2seq model.
As a reranker, the performance of the BTR depends on the quality of candidates.There is no room for improvement by the BTR if no candidate is more grammatical than the original selection.

A Computation for sj in Transformer
Let FNN denote a feed-forward layer and Attn(q, K, V ) the attention layer, where q, K, and V indicate the query, key, and value, respectively.We assume the decoder consists of L layers.To compute sj , the encoder first encodes x into its representation H.Then, for ℓ ∈ L, the hidden state sℓ j of the ℓ-th layer in the decoder is computed by where s0 j is the embedding of the token y j−1 and s1 is the state for the special token <s>, that indicates the start of a sequence.S ℓ−1 ≤j denotes a set of hidden states (s ℓ−1 1 , . . ., sℓ−1 j ).Attn s and Attn c indicate the self-attention and cross-attention layers, respectively.A causal attention mask can be used to compute S ℓ in parallel, as in Figure 1a.

B Computation for hℓ k in BERT
Assuming the model consists of L layers.Without the cross-attention, hℓ k is the feed-forward result of h ℓ k : where h0 k is the embedding of the k-th token in x \κ and H ℓ−1 \κ = ( hℓ−1 1 , . . ., hℓ−1 m ) denotes a set of hidden states for x \κ .Compared with s ℓ j , h ℓ k utilizes both the left and right sides of the context of the masked token x k to capture deeper representations.

E Cleaning for CoNLL Corpus
The original texts of CoNLL-13 and 14 contain several styles of punctuation tokenization, such as "De-mentiaToday,2012" and "known , a".While these punctuation styles with/without spaces are not considered grammatical errors by a human, they are often identified as errors by automatic GEC scorers.Moreover, while most of the sequences in CoNLL-14 are of sentence-level, several sequences are of paragraph-level due to the punctuation without spaces.In this research, we cleaned the texts of CoNLL-13 and 14 using the "en_core_web_sm" tool in spaCy (Honnibal et al., 2020) so that all punctuation included spaces.The paragraph-level sequences were split into sentences with respect to the position of full stops.The cleaned CoNLL-14 corpus contains 1326 pairs of data.Tables 8, 9, and 10 show the experimental results on the cleaned CoNLL-14 corpus.

F Hyperparameters
Table 11 lists the hyperparameter settings used for each model.And Table 12 lists the used artifacts.The setting for T5GEC (large) was the same as T5GEC.We followed the setting of Kaneko et al. (2019) to use a 0.0005 learning rate for the BERT reranker.We used a 0.0001 learning rate for the RoBERTa reranker.For both BERT and RoBERTa, we utilized the adam optimizer, "inverse square root" learning rate schedule, and 1.2 epochs warmup steps.For other models based on a T5 structure, we used a 0.001 learning rate and adafactor opti- mizer.The batch size was 1048576 tokens for all models.We used the Fairseq (Ott et al., 2019) and HuggingFace (Wolf et al., 2020) to reproduce all models and run the BTR.

G Candidate and Threshold Tuning
Following Zhang et al. (2021), we tuned a for training and predicting separately on the validation dataset with candidates generated by T5GEC.
Table 13 lists the size of training data with candidates generated by T5GEC.When tuning a train ∈ {0, 5, 10, 20}5 for the BTR, a pred was fixed to 5.
Because the BTR with λ = 0.4 and a train = 20 achieved the highest score as shown in Table 14, a train was fixed to 20, this BTR was also used to tune a pred ∈ {5, 10, 15, 20}.When tuning a train ∈ {0, 5, 10, 20} for RoBERTa, a pred was fixed to 5. The results in Tables 14 and 15 indicate the different distributions of F 0.5 score between RoBERTa and the BTR.To investigate the reason, we compared the training loss and F 0.5 score of RoBERTa with the BTR. Figure 7 shows the comparison.Different from the BTR, when using negative sampling (a train > 0) for training RoBERTa, the F 0.5 score on the CoNLL-13 corpus decreased with the epoch increasing.The training loss of RoBERTa also dropped suddenly after finishing the first epoch.This result suggests that negative sampling in the GEC task for an encoderonly structure leads in the wrong direction in learning representations from the concatenated source and target.And therefore, we fixed a train to 0 for RoBERTa.This RoBERTa was also used to tune a pred ∈ {5, 10, 15, 20}.The results in Tables 16,  17, and 18 show that when a pred was set to 5, the BTR, R2L, RoBERTa, and BERT attained their highest scores on the CoNLL-13 corpus.Thus, a pred was fixed to 5 in our experiments.Tables 14 and 16 also show the performances  of the BTR concerning λ on the CoNLL-13 corpus with candidates generated by T5GEC.Without using any candidate for training, the BTR(λ = 0) could achieve the highest F 0.5 score.When using 20 candidates for training, the BTR (λ = 0.4) achieved the highest F 0.5 score of 50.22.Table 19 shows the BTR (a train = 20, λ = 0.8) achieved the highest F 0.5 score on the CoNLL-13 dataset with the candidates generated by T5GEC(large).Thus, our tuned λ for the BTR was set to 0.2 when a train = 0.When a train = 20, λ was set to 0.4 and 0.8 for the candidates generated by T5GEC and T5GEC(large), respectively.Similarly, when a train = 0, our tuned λ for RoBERTa was set to 0.1 for the two versions of candidates.

H Mean and Standard Deviation
We list the mean and standard deviation of R2L, RoBERTa, and the BTR over the four trails on each dataset in Table 20.

I Detailed Results on BEA Test
The distribution of the BEA test set with respect to the CEFR level is shown in Table 21.
The BTR (λ = 0.4) achieved an F 0.5 score of 71.27 on the BEA test set, as shown in Table 22.Compared with A (beginner) level sentences, the BTR was more effective for B (intermediate), C (advanced), and N (native) level sentences.As shown in Table 23, the BTR (λ = 0.4) improved T5GEC for all top-5 error types.Furthermore, the BTR (λ = 0.4) could effectively handle Missing and Unnecessary tokens but not Replacement for the native sentences.It was more difficult to correct the Replacement and Unnecessary operations in the native sentences for both models compared with the advanced sentences.This may be because the writing style of native speakers is more natural and difficult to correct with limited training data, whereas language learners may tend to use a formal language to make the correction easier.

J Relation Between a, λ, and BTR Performance
The BEA and JFLEG corpus also provide a dev set with 4384 and 754 sentences for validation, respectively.To determine the optimal a train , a pred , and λ for the BTR listed in Table 14: Results of tuning a train for BTR. a pred was fixed to 5. The highest F 0.5 score on the CoNLL-13 corpus for each a train among different threshold is shown in bold.The scores that were the same as those of the base model (λ = 1) were ignored and greyed out.BTR on the corresponding dev sets.Tables 24 and  25 show the results on the BEA and JFLEG dev sets, respectively.On the BEA dev set, the highest F 0.5 score of 54.04 was achieved with a train = 10, a pred = 5, and λ = 0.2.On the JFLEG dev set, the highest GLEU score of 54.46 was achieved with a train = 5, a pred = 15, and λ = 0.These results demonstrate the differences in evaluating the minimal edit and fluency for grammar corrections.Given the previous a train , a pred and λ, we re-evaluated the BTR on the BEA and JFLEG test sets.Table 26 lists the results.Tuning hyperparameters on the JFLEG dev set led to a higher GLEU score of 60.14 on the JFLEG test set, compared to the tuned hyperparameters on the CoNLL-13 set.However, tuning hyperparameters on the BEA dev set resulted in a lower F 0.5 score of 71.12 on the BEA test set, compared to the tuned hyperparameters on the CoNLL-13 set.
To investigate the effectiveness of λ, i.e., the parameter that balances the trade-off between accep- tance rate and quality of grammatical corrections, we analyzed the relationship between λ and the corresponding precision, recall, and GLEU scores.Figures 8 and 9 show the performance of the BTR (a train = 20, a pred = 5) on the CoNLL-13 and 14 corpus, respectively.With λ increasing, the acceptance rate, i.e., the percentage of suggestions that the BTR accepts, decreases while the precision and recall for the Accept suggestions increases.This demonstrates our assumption in Section 4.3 that the value of f (y|x) indicates the confidence of the BTR, and the confident suggestions tended to be more grammatical, while the unconfident ones tended to be less grammatical than the original selections.As for the whole corpus, when λ = 0.7, this BTR achieved lower precision and recall score than λ = 0.4 due to the limited amount of Accept suggestions.Figures 10 and 11 show the performance of BTR (a train = 10, a pred = 5) on the BEA dev and test corpus, respectively.In Figure 10, the BTR shows a similar performance to that on the CoNLL-13 and 14 that, where a larger λ leads to higher precision and recall for Accept suggestions.However, the performance over the whole corpus also depends on the acceptance rate.Differently, as shown in Figures 13 and 14, the experimental results of the BTR (a train = 5, a pred = 15) on the JFLEG corpus achieved the highest GLEU score for the whole corpus when λ ≤ 0.1.This may be because using a pred = 15 makes a flatter probability than a pred = 5 as shown in Figures 12  and 15.Besides, recognizing the fluency of a sentence by the BTR may be easier than recognizing the minimal edit of corrections.

K Precision and Recall With T5GEC (large) Candidates
Given the top-5 candidate sentences generated by T5GEC (large), we compared the precision and recall of the BTR with those of R2L and RoBERTa in Table 27.

L Example of Reranked Outputs
Table 28 provides examples of ranked outputs by T5GEC, R2L, RoBERTa w/o a train (λ = 0.1), and BTR (λ = 0.4).The first block of output results demonstrates the difficulty of correcting spelling errors.In this block, the BTR outputs the token "insensitively" with the correct spelling but a mismatched meaning, whereas other rerankers tend to keep the original token "intesively" with a spelling error.The examples in the second block show that both the BTR and R2L are capable of correctly addressing verb tense errors.The examples in the last block show that even though the BTR recognizes the missing determiner "the" for the word "Disadvantage", it still misses a that-clause sentence.

Figure 1 :
Figure 1: Mask patterns in the Transformer model (Vaswani et al., 2017) (a) and in the BTR (b) for the self-attention mechanism in the decoder.Light cells indicate no attention.

Figure 2 :
Figure 2: Overview of the reranking procedure by using the bidirectional Transformer reranker (BTR).

Figure 4 :
Figure 4: Average probability for each rank on the CoNLL-14 test set.The top-5 candidate sentences were generated by T5GEC.

Figure 5
Figure5shows our procedure for prediction.

Figure 5 :Figure 6 :
Figure 5: Bidirectional Transformer architecture.The left and right columns indicate the encoder and decoder, respectively.The self-attention mechanism in the decoder utilizes the fully-visible mask (Figure1b), unlike the conventional Transformer(Vaswani et al., 2017).

Figure 6
Figure 6 shows the pre-training loss for R2L and the BTR on the Realnewslike corpus.The training loss of R2L suddenly dropped from 1.48 to 1.3 after the first epoch (7957 steps).

Figure 7 :
Figure 7: Performances of BTR and RoBERTa with various a train without λ during fine-tuning.a pred was fixed to 5 with candidates from T5GEC.Both F 0.5 score and training loss were averaged over the four trials.
Precision over the whole corpus versus λ

Figure 8 :
Figure 8: Precision and recall of BTR (a train = 20, a pred = 5) with respect to different λ on the CoNLL-13 set.

Figure 9 :
Figure 9: Precision and recall of BTR (a train = 20, a pred = 5) with respect to different λ on the CoNLL-14 set.

Table 1 :
Examples of reranked outputs from the JFLEG

Table 3 :
Examples of data pairs for self-supervised and supervised learning used by each model.The grammatical text is "Thank you for inviting me to your party last week ."<M> denotes a mask token.<X>, <Y>, and <Z> denote sentinel tokens that are assigned unique token IDs.<1> denotes the input sentence is classified as a grammatical sentence.Red indicates an error in the source sentence while Blue indicates a token randomly replaced by the BERT-style masking strategy.

Table 5 :
Table 5 presents our main results. 3While reranking by R2L yielded the highest F 0.5 score of 71.42 on the BEA test set, it yielded only a lower score Results for the models on each dataset with candidates from T5GEC.* indicates the score presented in Rothe et al. (2021).Bold scores represent the highest (p)recision, (r)ecall, F 0.5 , and GLEU for each dataset.

Table 6 :
Results for the BTR (λ = 0.4) on CoNLL-14 with candidates from T5GEC.y base and y BT R denote the selections by T5GEC and suggestions by the BTR, respectively.† indicates that the difference between y BT R and y base is significant with a p-value < 0.05.Bold scores represent the highest F 0.5 for each case.

Table 7 :
Rothe et al. (2021)els on each dataset with candidates generated by T5GEC (large).*indicates the score presented inRothe et al. (2021).The precision and recall can be found in Appendix K.
YingZhang, Hidetaka Kamigaito, and Manabu Okumura.2021.A language model-based generative classifier for sentence-level discourse parsing.In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2432-2446, Online and Punta Cana, Dominican Republic.Association for Computational Linguistics.
ShunKiyono, Jun Suzuki, Masato Mita, Tomoya Mizumoto, and Kentaro Inui.2019.An empirical study of incorporating pseudo data into grammatical error correction.In Proceedings of the 2019 Conference on

Table 8 :
Results for the models on the cleaned CoNLL-14 corpus with candidates from T5GEC.Bold scores represent the highest precision, recall, and F 0.5 .

Table 9 :
The mean ± std results on the cleaned CoNLL-14 corpus with candidates from T5GEC.Bold scores represents the highest mean.

Table 10 :
Results for the models on the cleaned CoNLL-14 corpus with candidates from T5GEC (large).Bold scores represent the highest precision, recall, and F 0.5 .

Table 13 :
Number of sentence pairs for cLang-8 dataset with candidates.All pairs of data that satisfy the length constraint of 128 are listed.
Table 14 on these two datasets, we re-evaluated the performances of the

Table 15 :
Results of tuning a train for RoBERTa.a pred was fixed to 5. The highest F 0.5 score on the CoNLL-13 corpus for each a train among different threshold is shown in bold.The scores that were the same as those of the base model (λ = 1) were ignored and greyed out.

Table 16 :
Results of tuning a pred for BTR.The highest F 0.5 score on the CoNLL-13 corpus for each a among different threshold is shown in bold.The scores that were same as those of the base model (λ = 1) were ignored and greyed out.

Table 17 :
Results of tuning a pred for RoBERTa.The highest F 0.5 score on the CoNLL-13 corpus for each a pred among different threshold is shown in bold.The scores that were same as those of the base model (λ = 1) were ignored and greyed out.

Table 18 :
Results of tuning a pred for R2L and BERT.The highest F 0.5 score on the CoNLL-13 corpus for each reranker among different a pred is shown in bold.

Table 26 :
Tuned on corpus a train a pred λ Results for BTR on the BEA and JFLEG test sets with tuned hyperparameters.