A Simple Recipe for Multilingual Grammatical Error Correction

This paper presents a simple recipe to trainstate-of-the-art multilingual Grammatical Error Correction (GEC) models. We achieve this by first proposing a language-agnostic method to generate a large number of synthetic examples. The second ingredient is to use large-scale multilingual language models (up to 11B parameters). Once fine-tuned on language-specific supervised sets we surpass the previous state-of-the-art results on GEC benchmarks in four languages: English, Czech, German and Russian. Having established a new set of baselines for GEC, we make our results easily reproducible and accessible by releasing a CLANG-8 dataset. It is produced by using our best model, which we call gT5, to clean the targets of a widely used yet noisy Lang-8 dataset. cLang-8 greatly simplifies typical GEC training pipelines composed of multiple fine-tuning stages – we demonstrate that performing a single fine-tuning stepon cLang-8 with the off-the-shelf language models yields further accuracy improvements over an already top-performing gT5 model for English.


Introduction
Grammatical Error Correction (GEC) is the task of correcting grammatical and other related errors in text. It has been the subject of several modeling efforts in recent years due to its ability to improve grammaticality and readability of user generated texts. This is of particular importance to non-native speakers, children, and individuals with language impairments, who may be more prone to producing texts with grammatical errors.
Modern approaches often view the GEC task as monolingual text-to-text rewriting (Náplava and Straka, 2019;Katsumata and Komachi, 2020;Grundkiewicz et al., 2019) and employ encoderdecoder neural architectures (Sutskever et al., 2014;Bahdanau et al., 2015). These methods typically require large training sets to work well (Malmi et al., 2019) which are scarce especially for languages other than English. One of the largest and most widely used datasets for GEC is the LANG-8 Learner Corpus, which covers 80 languages and has been created by language learners correcting each other's texts. 2 However, the distribution of languages is very skewed, with Japanese and English being the most prevalent languages with over a million ungrammatical-grammatical sentence pairs each, while only ten languages have more than 10,000 sentence pairs each. Additionally, given the uncontrolled nature of the data collection, many of the examples contain unnecessary paraphrasing and erroneous or incomplete corrections.
Limited amounts of suitable training data has led to multiple approaches that propose to generate synthetic training data for GEC (Madnani et al., 2012;Grundkiewicz and Junczys-Dowmunt, 2014;Grundkiewicz et al., 2019;Lichtarge et al., 2019;Awasthi et al., 2019). Although using synthetic data as the first fine-tuning step has been shown to improve model accuracy, it introduces practical challenges that make the development and fair comparison of GEC models challenging: (i) the synthetic methods often require language-specific tuning (e.g. language-specific hyperparameters and spelling dictionaries (Náplava and Straka, 2019)), and; (ii) due to the inability of synthetic data to capture the complete error distribution of the target eval sets, the final model is obtained by following a multi-stage fine-tuning process (Lichtarge et al., 2019(Lichtarge et al., , 2020Omelianchuk et al., 2020). Because of this, carefully picking the learning rates and number of training steps for each of the fine-tuning stages is required, making it difficult to replicate and build on top of previous best reported models.
The ideas of leveraging self-supervised pretraining and increasing the model size have yielded significant improvements on numerous seq2seq tasks in recent years (Raffel et al., 2019;Xue et al., 2020;Lewis et al., 2020;Song et al., 2019;Chan et al., 2019;Rothe et al., 2020), but these approaches have been applied to GEC to only a limited extent.
In this paper we adopt the mT5 (Xue et al., 2020) as our base model which has already been pretrained on a corpus covering 101 languages. To adapt the model to the GEC task, we design a fully unsupervised language-agnostic pre-training objective that mimics corrections typically contained in labeled data. We generate synthetic training data by automatically corrupting grammatical sentences, but in contrast to the previous state-of-the-art by Náplava and Straka (2019) for low-resources languages, we use our synthetic pre-training to train a single model on all 101 languages, employing no language-specific priors to remain fully languageagnostic. After pre-training we further fine-tune our model on supervised GEC data for available languages (with data conditions ranging from millions to tens of thousands). Additionally, we explore the effect of scaling up the model size from 60M to 11B parameters. We surpass the previous state-of-the-art results on four evaluated languages: English, Czech, German and Russian.
Fine-tuning and running inference with our largest and most accurate models require multi-GPU/TPU infrastructure. To make the results of our research widely accessible we release a CLANG-8 dataset obtained by using our largest gT5 model to clean up the targets of the frequently used yet noisy LANG-8 dataset. We show that offthe-shelf variants of T5 (Raffel et al., 2019) when fine-tuned only on CLANG-8, outperform those models trained on the original LANG-8 data with and w/o additional fine-tuning data, thus simplifying the complex multi-stage process of training GEC models. Thus CLANG-8 not only allows others to easily train highly competitive GEC models, but it also greatly simplifies GEC training pipeline, basically reducing a multi-step fine-tuning processs to a single fine-tuning step.
Our contributions in this paper are three-fold: (1) We show that a simple language-agnostic pretraining objective can achieve state-of-the-art GEC results when models are scaled up in size; (2) We show the effect model size has on GEC, and; (3) We release a large multilingual GEC dataset based on Lang-8, which allows for state-of-the-art results without additional fine-tuning steps, thus significantly simplifying the training setup.

Model
Our model builds on top of mT5 (Xue et al., 2020) a multilingual version of T5 (Raffel et al., 2019)a Transformer encoder-decoder model which has been shown to achieve state-of-the-art results on a wide range of NLG tasks. mT5 comes in different sizes, however for this work we use base (600M parameters) and xxl (13B parameters).

mT5 Pre-training
mT5 has been pre-trained on mC4 corpus, a subset of Common Crawl, covering 101 languages and composed of about 50 billion documents. For details on mC4, we refer the reader to the original paper (Xue et al., 2020). The pre-training objective is based on a span-prediction task, an adaptation of masked-language objective for autoregressive seq2seq models. An example of span prediction: All mT5 models were trained for 1M steps on batches of 1024 input sequences with a maximum sequence length of 1024, corresponding to roughly 1T seen tokens. For all of our experiments we use the publicly available mT5 and T5 checkpoints (Section 4 only).

GEC Pre-training
The span-prediction objective of mT5 does not enable the model to perform GEC without further fine-tuning, as the span-prediction task uses special tokens to indicate where text should be inserted. Another limiting constraint is that mT5 has been trained on paragraphs, not sentences. We therefore split all paragraphs in mC4 corpus into sentences. We corrupt each sentence using a combination of the following operations: a) drop spans of tokens b) swap tokens c) drop spans of characters d) swap characters e) insert characters 3 f) lower-case a word g) upper-case the first character of a word. An example pair of an original sentence and its corrupted version looks as follows:  Training Regime. We experimented with several training setups. All of them build on the mT5 pre-trained models (Section 2.1). We experimented with a) mixing GEC pre-training data (Section 2.2) with fine-tuning data (Section 3), b) mixing pretraining and finetuning examples but annotating them with different prefixes, and c) first using GEC pre-training until convergence and then fine-tuning. While c) is the most computationally expensive approach, it also gave us the best results. GEC pre-training as well as finetuning uses a constant 4 http://aspell.net Models C o N L L -1 4 B E A t e s t C z e c h G e r m a n R u s s i a n  Results. For English, we evaluate on standard benchmarks from CoNLL-14 and the BEA test (Bryant et al., 2019a), while we use CoNLL-13 as the development set (Table 1). For other languages we use the test and development sets associated with their training data. Table 2 shows the results for all languages. We first see that the base model size is inferior to the current state-of-the-art models. This is expected as the model capacity is not enough to cover all 101 languages. We therefore use a larger xxl (11B) model, which produces new state-of-the-art results on all languages except for English. When looking at the development set performance for English, we observed that it had a high variance and the training was over-fitting very quickly. This suggests that train and dev/test set domains are not well aligned for English. In the following Section 4 we further refine our approach, also achieving state-of-the-art results for English.

CLANG-8: Cleaned LANG-8 Corpus
To be able to distill the knowledge learned by gT5 xxl into smaller, more practical models, we create and release CLANG-8, a cleaned version of the popular LANG-8 corpus. As discussed earlier, LANG-8 is a large corpus of texts written by language learners and user-annotated corrections to these texts. However, corrected texts frequently contain unnecessary paraphrasing and erroneous or incomplete corrections -phenomena that hurt the performance of a GEC model trained on this data. For instance, the following source-target pair  is taken from LANG-8: "It is cloudy or rainy recently ." → "It is It 's been cloudy or and rainy recently ." We experiment with two approaches for cleaning the data. First, to create CLANG-8, we generate new targets for LANG-8, disregarding the original targets. We tried using both the unsupervised model, which was trained using the GEC pretraining objective (Section 2.2) and the supervised model (gT5 xxl) (Section 3), but the former did not yield comparable results, so all reported numbers use the supervised model. Second, to create CLANG-8-S, we used the unsupervised and the supervised models to score the original targets, disregarding the lowest scoring 20%, 50%, 70%, or 90% targets. Disregarding 50% was the best performing setup and there was not a significant difference between the supervised and unsupervised model. We therefore report numbers using the unsupervised model disregarding the worst 50% of the targets. Table 3 shows that CLANG-8 moderately reduces the Word Error Rate (WER) between the source and target, with deletions receiving the largest relative reduction, which may suggest that less information from the source sentence is removed. In contrast CLANG-8-S has a significantly lower WER, indicating that the unsupervised model has only kept corrections which are close to the source sentence.
Experiments. To evaluate the effect cleaning LANG-8 has for English, we train two distinct models on this data: T5 (Raffel et al., 2019), a monolingual sequence-to-sequence model, and FE-LIX (Mallinson et al., 2020), a non-auto-regressive text-editing model. 5 We also tried fine-tuning these models on BEA (i.e. FCE and W&I) after finetuning them on CLANG-8, but this did not further improve the scores but slightly decreased them, e.g. 0.43 absolute decrease for BEA test when using T5 base. This can be explained by the fact that 5 The FELIXINSERT variant which we use does not employ re-ordering.   the model used to clean the target texts has already been trained on BEA. This suggests that the typical GEC training pipeline where a model is first fine-tuned on LANG-8 and then on BEA can be both simplified and made more accurate by only fine-tuning on CLANG-8. Finally, we train mT5 models on the German and Russian portions of the CLANG-8 dataset and evaluate these models on the test sets from Table 1.
Results & Analysis. The results for CoNLL-14 and BEA test benchmarks can be seen in Table 4. For both models and both test datasets, CLANG-8 improves the F 0.5 score compared to using the original LANG-8 corpus. While CLANG-8-S performs significantly worse than CLANG-8, it still improves over LANG-8. In terms of model size, larger models are consistently better then their smaller siblings. This is even true when comparing xl and xxl, suggesting that there might still be headroom by using models larger than xxl.
In Table 5   test for T5 base and T5 xxl, trained on either LANG-8 or CLANG-8. We see that for both data conditions increasing the model size leads to an increase in performance. Comparing CLANG-8 and LANG-8, shows that CLANG-8 improves on all error types apart from orthographic (ORTH) and punctuation (PUNCT).
In Table 6, we evaluate mT5 trained on the German and Russian portions of the CLANG-8 dataset, which contain 114K and 45K training examples, respectively. We see that for both languages performance increases with the model size, with no indication of slowing, suggesting further headroom for improvement. For German, the xxl model achieves a better score than the previous state-of-the-art, however, it is worse than gT5 xxl. Whereas for Russian, mT5 trained on CLANG-8 does not match state-of-the-art performance. We believe this is in part due to the small size of CLANG-8 in Russian. Additionally, the training data for Russian and German comes from the same dataset as the test data which is not the case for English, making the training data of significantly greater relevance. For German and Russian GEC tasks, where in-domain training data is unavailable, CLANG-8 could have a greater impact.
We release the re-labeled CLANG-8 dataset, which contains 2.

Conclusion
In this paper we report new state-of-the-art results on GEC benchmarks in four languages we studied. Our simple setup relies on a language-agnostic approach to pretrain large multi-lingual language models. To enable the distillation of our largest model into smaller, more efficient models, we released a cleaned version of the LANG-8 dataset, enabling easier and even more accurate training of GEC models.