TransGEC: Improving Grammatical Error Correction with Translationese

Data augmentation is an effective way to improve model performance of grammatical error correction (GEC). This paper identifies a critical side-effect of GEC data augmentation, which is due to the style discrepancy between the data used in GEC tasks (i.e., texts produced by non-native speakers) and data augmentation (i.e., native texts). To alleviate this issue, we propose to use an alternative data source, translationese (i.e., human-translated texts), as input for GEC data augmentation, which 1) is easier to obtain and usually has better quality than non-native texts, and 2) has a more similar style to non-native texts. Experimental results on the CoNLL14 and BEA19 English, NLPCC18 Chinese, Falko-MERLIN German, and RULEC-GEC Russian GEC benchmarks show that our approach consistently improves correction accuracy over strong baselines. Further analyses reveal that our approach is helpful for overcoming mainstream correction difficulties such as the corrections of frequent words, missing words, and substitution errors. Data, code, models and scripts are freely available at https://github.com/NLP2CT


Introduction
Grammatical error correction (GEC) is a task of automatically correcting an ungrammatical sentence into a corrected version.Training GEC models highly relies on labeled data (i.e., ungrammatical sentences to their grammatical ones), but such resources are scarce and expensive to construct.Data augmentation, which exploits a large amount of unlabeled data for performance improvement, is a popular research line of GEC (Rozovskaya and Roth, 2010;Felice et al., 2014;Rei et al., 2017;Kasewa et al., 2018).However, there is a stylistic discrepancy between the data used for GEC tasks and data augmentation.For most GEC tasks (Ng et al., 2014;Zhang et al., 2022a), their training and * Co-corresponding author testing instances are produced by non-native speakers, whereas the data used for augmentation are mainly native language resources (Kiyono et al., 2019;Zhao et al., 2019;Grundkiewicz et al., 2019;Kaneko et al., 2020).Rabinovich et al. (2016) have shown a large difference between non-native and native texts, which means that style mismatch might be a side-effect limiting the further enhancement of GEC data augmentation.A more ideal way is to directly use non-native texts as input for data augmentation.However, such resources are very few, and their quality is hard to be guaranteed.
In this paper, we propose the TransGEC method which uses human-translated texts (aka translationese) as input for augmentation.Improving GEC with translationese has the following advantages: 1) easy-to-obtain, the training corpus of machine translation tasks consists of abundant translationese, and its identification has been well studied (Riley et al., 2020); 2) similar style, non-native texts and translationese are closer to each other than native texts (Rabinovich et al., 2016); and 3) high quality, most translationese is produced by bilingual experts, whose quality can be better guaranteed than the majority of non-native texts.
Preliminary experiments on the comparison of different kinds of texts confirm our assumption that translationese indeed has a similar style to GEC data.This enables us to further explore translationese for GEC in two steps: 1) obtaining translationese, we propose to fine-tune BERT-based classifiers to identify translationese from the parallel corpora (e.g., WMT corpus) of machine translation tasks; and 2) improving GEC with translationese, we propose to add artificial noise to the identified translationese, and treat the noisy/corrected version as the input/output for training GEC models.
Experimental results on the widely-used CoNLL14 and BEA19 English, NLPCC18 Chinese, Falko-MERLIN German, and RULEC-GEC Russian GEC benchmarks show that TransGEC outperforms strong (m)T5-large pre-trained model (Raffel et al., 2019;Xue et al., 2020), LRGEC baselines (Náplava and Straka, 2019), and existing data augmentation methods (Zhao et al., 2019).Further analyses show that TransGEC improves the correction accuracy of major difficulties (e.g., correction of frequent words, missing words, and substitution errors), but still has room for improvement in minor issues (e.g., correction of rare words, word order, and deletion errors).
Our main contributions are summarized as: • We empirically show that translationese has a similar style to the original GEC data in different languages (i.e., English and Chinese).
• We introduce how to simply obtain translationese and propose a novel method, Trans-GEC, to improve GEC with translationese.
• We confirm the effectiveness of exploiting translationese as input for GEC data augmentation with and without pre-trained models.
• We reveal the linguistic properties enhanced and diminished after exploiting translationese, providing some clues for future studies.

Related Work
Grammatical Error Correction (GEC) can be viewed as a kind of sequence-to-sequence learning task (Sutskever et al., 2014;Chollampatt and Ng, 2018;Junczys-Dowmunt et al., 2018;Liu et al., 2021;Li et al., 2022;Zhang et al., 2022b;Gong et al., 2022;Li et al., 2023;Zhang et al., 2023;Fang et al., 2023a,b).Since labeled training data is scarce and hard to collect, various data augmentation methods are proposed to enhance GEC performance.Another research line uses pre-trained language models to improve the model performance of GEC.Kaneko et al. (2020) extract external knowledge from language models for GEC training, Rothe et al. (2021) further treat the language models as a part of the network for GEC training.All the above work has a potential limitation: while the training and test data of GEC tasks are produced by nonnative speakers, the data used for augmentation or pre-training are mainly native texts.This style discrepancy is a threat to GEC data augmentation.Madnani et al. (2012); Zhou et al. (2020) propose to use machine-translated text for GEC data augmentation, but their intuition is not using the text with a similar style but producing noisy text through machine translation.Our approach focuses on the style mismatch problem by introducing translationese (human-translated texts) as input for data augmentation, providing a reasonable explanation for their model improvements.
Translationese refers to the presence of unusual properties of human-translated texts and thus becomes an alternate name for such texts.A reason might be that translators are affected by the style of the source language and ignore the rules of the target language during translation (Gellerstam, 1986).Translationese tends to show less lexical diversity compared to native texts (Stubbs, 1996).Britt et al. (2015) point out that there are many common idioms unconsciously used in native texts.Baker et al. (1993) and Toury (1995) report that translationese has some unique characteristics, e.g., simplification, explicitation and normalization.Rabinovich et al. (2016) provide a systematic study and find that the non-native texts and translationese are closer to each other than to native texts.
A research line discusses the effect of translationese in machine translation tasks since translationese widely exists in parallel corpora.Graham et al. (2020) reveal the side-effect of using translationese in machine translation evaluation and recommend only evaluating native texts.Riley et al. (2020) demonstrate that translationese hinders the model from generating more adequate and fluent translations.Another line focuses on identifying translationese from parallel sentences to control the training of downstream tasks.Kurokawa et al. (2009) propose a support vector machine-based classifier to identify translationese while Riley et al. (2020) use a convolution neural network-based classifier.Wang et al. (2021) train a classifier that distinguishes between native and translationese based on significant differences in their text content.
To the best of our knowledge, the discussion and application of translationese has not yet been introduced to GEC tasks.This paper takes the first step into using translationese for improving GEC. 3 Why Translationese?
We first explain why GEC models need other kinds of alternatives as input for data augmentation, and then give preliminary experiments and results to show that translationese can be a decent alternative.
Motivation The performance of GEC systems highly depends on the quality and quantity of annotated training data (i.e., ungrammatical sentences and their grammatical version).Due to the high cost of collecting such data, the research of data augmentation techniques (i.e., utilizing unlabeled data) for GEC has become a popular topic.By looking at the most widely-used GEC benchmark -CoNLL14 (Ng et al., 2014) and BEA19 (Bryant et al., 2019) shared tasks, the training corpora includes NUS Corpus of Learner English (NU-CLE) (Dahlmeier et al., 2013), Lang-8 Corpus (Tajiri et al., 2012), FCE v2.1 (Yannakoudakis et al., 2011) and W&I (Yannakoudakis et al., 2018), all of which are produced by non-native language learners.However, existing methods directly use native texts as input for data augmentation for GEC tasks.For example, Kiyono et al. (2019) and Kaneko et al. (2020) use Wikipedia data, while Zhao et al. (2019) and Grundkiewicz et al. (2019) utilize One Billion Word Benchmark (Chelba et al., 2013) data.
Previous studies have validated that there exists a style gap between native and non-native texts (Rabinovich et al., 2016).We argue that such gap brings a side-effect to model performance, limiting the further improvements of GEC data augmentation.Utilizing non-native texts might be a better choice, however, there exist few non-native text resources and it is not easy to collect the text from scratch and guarantee their quality.This motivates us to find some other alternatives, which are easyto-obtain, high-quality, and with a closer style to the non-native text of GEC tasks.
Preliminary Experiments Rabinovich et al. (2016) have shown that non-native texts and translationese are closer to each other than each of them to native texts.Motivated by them, in this experiment, we explore the similarities between GEC data and translationese on the English and Chinese GEC tasks.We compare our collected native texts and GEC data on the properties of lexical richness, cohesive markers, collocations, pronouns, content words, and function words.To make a fair comparison, we directly use the same data provided by Rabinovich et al. (2016) and Su and Li (2016) to reproduce the results of native texts and translationese.The settings are shown in Appendix A.1.
As shown in Figure 1, the trend of our collected native texts and GEC data is consistent with that of the native texts and translationese provided by existing work.For example, both the translationese and GEC data are of lower lexical richness and contain more cohesive markers and function words than the native texts.One outlier is the result of English pronouns, and the reason is the overuse of personal pronouns such as 'I' and 'you' in the GEC data.However, by looking at the result of Chinese pronouns, it still has the same trend.The above results confirm our assumption that translationese and GEC data have a similar style than native texts.Specifically, the native source monolingual text is translated to machine translated text through a trained machine translation system.The translationese is identified via the BERT classifier, which is fine-tuned with the same amount of machine translated text and native target monolingual text.The obtained translationese is injected with specific noise to produce a synthetic GEC corpus which is merged with the original GEC corpus to train a GEC system.

TransGEC
The observations made above enable us to further improve GEC with translationese.Figure 2 shows the overall framework of TransGEC, which contains two parts: obtaining translationese and improving GEC with translationese.
Obtaining Translationese Existing parallel corpora of machine translation (MT) tasks (Bojar et al., 2017) have a huge amount of translationese on both sides.However, most parallel corpora do not annotate whether an instance is native or translated.Therefore, previous studies (Kurokawa et al., 2009;Riley et al., 2020) have had to train a classifier to identify and obtain translationese from parallel corpora.In this paper, to obtain translationese from existing parallel training corpora of MT, we propose to fine-tune BERT-based classifiers using a small number of machine translated texts (Devlin et al., 2019), which can alleviate the limitation of Riley et al. (2020) relying on a large amount of machine translated texts to train a convolutional neural network-based classifier from scratch.Specifically, given a parallel corpus , we first need to train a machine translation model f x →y that translates a source sentence x to a target sentence y: Then, the machine translated texts Y mt can be obtained by translating the native source sentences: where X native denotes native source texts, which can be easily collected (e.g., WMT News Crawl).
Given the generated Y mt and collected Y native , we fine-tune the BERT-based pre-trained language model as a classifier to distinguish whether a sentence is native or not.After that, we use the finetuned BERT-based classifier to label the target side of the parallel corpus D mt , and identify the sentences which have lower classification probabilities to be native texts as translationese Y trans .
Improving GEC with Translationese This part exploits the obtained translationese Y trans as input for GEC data augmentation.Motivated by Zhao et al. (2019), artificial noise is added to Y trans and the synthetic GEC corpus D syn can be viewed as: where δ(•) denotes the noise operator with the following four types of noise: 1) deletion, randomly delete a token in the sentence; 2) insertion, randomly add a token into a sentence; 3) replacement, randomly select a token from the vocabulary to replace a token in the sentence; 4) word order, shuffle the words in the sentence by a Gaussian distribution bias and then subsequently reorder the sentence.
After that, we can train a GEC model with the original corpus D gec and synthetic corpus D syn : where s denotes a noisy (ungrammatical) sentence and t denotes its corresponding corrected (grammatical) version.The model parameters θ can be randomly initialized or initialized from large-scale pre-trained language models.

Obtaining Translationese
Setup We conduct experiments on English, German, Russian and Chinese.We treat WMT17 News Crawl data in English, German and Russian as their native texts, and use Chinese News1 as Chinese native texts.We deduplicate and filter sentences whose lengths are longer than 70 tokens.The pretrained Chinese⇒English translation model (Wu et al., 2019) is used to generate English machine translated texts from native Chinese News.To obtain German, Russian and Chinese machine translated texts, we translate the native English texts using the pre-trained English⇒German (Ott et al., 2018) and English⇒Russian (Ng et al., 2019), and our own English⇒Chinese translation models (37.7 BLEU (Papineni et al., 2002) on newstest17).We use 1M native texts and 1M machine translated texts to fine-tune the BERT-based translationese classifiers (Devlin et al., 2019) for each language.The settings of fine-tuning BERT-based classifiers are listed in Appendix A.2.We use the classifiers to identify translationese and native texts from the target side of the UN Chinese⇔English and UN English⇒Russian (Ziemski et al., 2016) corpora, and WMT16 English⇒German corpora.

Results
The confidence threshold of identifying translationese (native texts) is set to >0.9 (<0.1).We evaluate the fine-tuned BERT-based classifiers by F 1 score on WMT test sets, which consist of native texts and translationese in equal number (Zhang and Toral, 2019).Compared to the score of 0.85F 1 by Riley et al. (2020) on the English⇒German newstest15, our classifier achieved 0.91F 1 on the same test set.For English, Chinese and Russian, our classifiers score 0.94F 1 , 0.80F 1 , and 0.85F 1 on the Chinese⇒English newstest17, English⇒Chinese newstest17, and English⇒Russian newstest17, respectively.
Finally, 6.9M English and 5.8M Chinese translationese are selected from the UN corpus.Due to the small amount of training data for German and Russian GEC tasks, we sample 50K Russian and 120K German translationese from the UN Russian and WMT16 German, respectively.We present classified examples in Appendix A.3.For the main English experiments, we use the distilled cLang-8 corpus as the training data, which is a clean version of Lang-8 data (Rothe et al., 2021).The CoNLL13 (Ng et al., 2013) and the widely used official-2014.combined.m2version of CoNLL14 (Ng et al., 2014) are used for validation and test sets, respectively.For Chinese, we use the official training and test data of NLPCC18 (Zhao et al., 2018), which are also produced by second language learners.We follow Zhao and Wang (2020) to randomly select a subset from the training data as the development set.For German and Russian, we use the same 10M synthetic dataset as Náplava and Straka (2019) for pretraining and then follow them by finetuning on the Falko-MERLIN (Boyd et al., 2014) German dataset and RULEC-GEC (Rozovskaya and Roth, 2019) Russian dataset, these datasets are also the learner corpora.Table 1 presents the statistics of the data we used.
For generating synthetic data, we corrupt the translationese with four certain rules: deletion, insertion, replacement, and word order.For the first three rules, we conduct six groups of different trans-  lationese corruption probabilities.As presented in Table 2, we can see that the choice of different corruption probabilities does not make a big difference in the results.We choose the probabilities of 0.05, 0.1, 0.2 in our experiments as it works best of the six.For word order, we shuffle the words by adding a Gaussian bias to their positions and then reorder the words with a standard deviation of 0.5.

Models and Training
For preliminary English experiments, the GEC models are based on the Transformer architecture and implemented using the open-source toolkit fairseq (Ott et al., 2019).
We follow the default TRANSFORMER-BASE settings to initialize our model with a shared embedding.The other settings are listed in Appendix A.4.
The main experiments for English and Chinese are based on the T5 (Raffel et al., 2019) and mT5 (Xue et al., 2020) models of their large variants.We follow Rothe et al. (2021) to fine-tune the pre-trained models on English cLang-8 GEC data.In addition, we fine-tune the pre-trained models on Chinese GEC data.The details of the fine-tuning setting for T5 and mT5 are listed in Appendix A.5.For German and Russian, we follow Náplava and Straka (2019) to use the TRANSFORMER-BIG architecture and implement it using the tensor2tensor (Vaswani et al., 2018)  and finetuning procedure and the parameters, we use the settings in their repository. 2he M2 scorer (Dahlmeier and Ng, 2012) is used for evaluating our models on CoNLL14 English, Falko-MERLIN German, RULEC-GEC Russian, and NLPCC18 Chinese GEC tasks.The ERRANT scorer (Bryant et al., 2019) is used for evaluating on BEA19 English task.We run experiments with three different random seeds and report the averaged scores.To test the significance of the results, we adopt the T -test method in the SciPy toolkit. 3ugmentation ratio Before conducting the experiments, we first investigate the effect of the proportion of synthetic data on the model performance.As shown in Figure 3, there are three types of data: Native, Tanslationese and Mix (mixture of native texts and translationese).We combine them with the original cLang-8 GEC data using different ratio settings (i.e., 1:0, 2:1, 1:1, 1:2).When the ratio is set to 1:1, the best performance is achieved in all data groups.The experiments in the subsequent sections directly use the augmentation ratio of 1:1.

Preliminary Results
Table 3 presents the F 0.5 results of the BEA19 English GEC task.The Transformer model trained with translationese (i.e., +TRANS.)achieves the best result on the BEA19 non-native W&I and ALL test sets, with an improvement of 2.3 and 1.7 F 0.5 scores over the BASE model, respectively.While testing on the BEA19 native LOCNESS test set, the model trained with native texts (i.e., +NATIVE) achieves the best F 0.5 scores.It sufficiently confirms our assumption that using the texts with a similar style for GEC data augmentation is beneficial for GEC tasks.

Main Results
Table 4 presents the results obtained from the CoNLL14 English, NLPCC18 Chinese, Falko-MERLIN German, and RULEC-GEC Russian GEC tasks.For the Transformer (i.e., TRANSFORMER) models, it can be seen that all three types of synthetic data surpass the baseline (i.e., +BASE.),thus confirming the effectiveness of GEC data augmentation.The model trained with translationese (i.e., +TRANS.)achieves the highest precision and F 0.5 scores when compared to the BASE.and +NATIVE models across the English, Chinese, German, and Russian GEC tasks.
We also employ pre-trained GEC models (PRE-TRAINED) and fine-tune the T5-Large model for English, as well as the mT5-Large model for Chinese.However, for German and Russian, we build upon the strong LRGEC baseline (Náplava and Straka, 2019) to conduct further experiments, as the mT5 LARGE baselines exhibit a slightly lower performance (see Appendix A.6). Table 4 also clearly demonstrates that our +TRANS.method achieves the best results compared to the BASE.and +NATIVE models across the English, Chinese, German, and Russian GEC tasks, respectively.To ensure comparability, we randomly select half of the native texts and half of the translationese (i.e., +MIX) for training the GEC models.The results indicate that the F 0.5 scores of the +MIX models are higher than those of the +NATIVE models but lower than the +TRANS.models.Notably, the models trained with translationese (i.e., +TRANS.)outperform all other models in terms of precision and F 0.5 for all languages, except recall in the case of Chinese.While the recall score of the +TRANS.model may not be the highest, the evaluation of GEC tasks typically places greater emphasis on precision and F 0.5 scores, since neglecting a correction is not as bad as proposing a wrong correction (Ng et al., 2014).Appendix A.7 shows examples produced by Native and Translationese English GEC models, providing further insights.We also include the results of the BEA19 test set in Appendix A.8, which presents the same trend.The reason is that translationese maintains stylistic consistency with the original GEC training data, facilitating the GEC models' acquisition of knowledge.
Compared to Existing Methods MASKGEC (Zhao and Wang, 2020) model dynamically inserts noise to the source sentences for GEC.It is a strong baseline for the Chinese NLPCC18 benchmark.MUCGEC (Zhang et al., 2022a)   TAGGEC (Stahlberg and Kumar, 2021) uses an error-tagged corruption model to produce synthetic data for the GEC task.LRGEC (Náplava and Straka, 2019) focuses on GEC in low resource scenarios and utilizes synthetic parallel data to improve them.ESCGEC (Qorib et al., 2022) combines different strong GEC systems and it is the SOTA model of the English GEC.(M)T5 LARGE/XXL (Rothe et al., 2021) is fine-tuned on (m)T5 large/xxl pre-trained models with the same cLang-8 data used in experiments (i.e., BASE.).gT5 XXL (Rothe et al., 2021) is the largest GEC teacher model for distilling Lang8 data for different languages and it is the SOTA model of the multilingual GEC.As shown in Table 4, our proposed method (i.e., +TRANS.)based on the strong (M)T5 LARGE and LRGEC baselines consistently improves correction accuracy for English, Chinese, German and Russian GEC tasks, respectively.

Analysis
In this section, we analyze our results from two perspectives: error types and linguistic properties.
Error Types We investigate the performance of different error types for English and Chinese GEC tasks.We use the ERRANT toolkit (Bryant et al., 2017) for English.For Chinese, we use the adapted ERRANT released by Hinson et al. (2020).As shown in Table 5, the GEC system augmented with translationese performs well in correcting all types of errors.For Chinese, the GEC system augmented with translationese is good at correcting missing words, and substitution errors.The performance gap between Chinese and English might be caused by their different sentence structures.Our approach is more effective to improve the correction accuracy of the major difficulties, i.e., missing words (17.9%/38.0%),and substitution errors (64.3%/54.0%)on the English/Chinese GEC benchmarks.However, there is still some room for improvement in minor issues (e.g., correction of word order and deletion errors).
Linguistic Properties We study two linguistic properties in terms of word frequency and position.The detailed settings are presented in Appendix A.9.As shown in Figure 4, +NATIVE and +MIX methods are better than +TRANS.method to correct rare words, but fail to correct the words with higher frequency.The reason might be that the lexical diversity of native texts is higher than translationese.Furthermore, we count the proportion of frequent/medium/rare tokens for the training data, which are 90.3%/6.1%/3.6% for English and 91.7%/5.3%/3.0%for Chinese.It means our method can mitigate the primary challenge in GEC tasks.In terms of position, the improvement of the left position is lower than those of the middle and right in the English/Chinese GEC task.It might be that English and Chinese are the right-branching languages that usually describe the main subject first and provide the key information at the tail of the sentence to explain the subject (Payne, 2006).
It may be also that the middle and right parts of the sentences benefit from more previous context.The result of +TRANS GEC system is consistently superior to +NAITVE GEC system.This confirms that using the augmentation data with a similar style to GEC data is beneficial to GEC models.

Conclusion
This paper introduces a TransGEC method that uses translationese as input for data augmentation of GEC.Preliminary experiments on native texts, translationese, and GEC data confirm that the translationese and GEC data share a similar style compared to native texts.Based on the evidence, we propose a simple and effective method to mine translationese from parallel corpora by classifiers and construct a synthetic GEC corpus by adding artificial noise to the translationese.Experimental results on the CoNLL14 and BEA19 English, NLPCC18 Chinese, Falko-MERLIN German, and RULEC-GEC Russian benchmarks show that the models augmented with translationese can outperform strong baselines.Further analyses show that our approach performs well in solving major difficulties (e.g., correction of frequent words, missing words, and substitution errors), but still has some room for improvement in minor issues (e.g., correc-tion of rare words, word order, and deletion errors).

Limitations
There are two limitations of this work, one of which is that our work is trained on the sequenceto-sequence model.However, we have not verified our approach on the sequence-to-edit architecture.
In future work, we will verify our approach on the test bed of the sequence-to-edit model.The other limitation is that using translationese as input of data augmentation can not bring absolute improvement to grammatical error correction task.Specifically, our approach still has some room for improvement such as correcting rare words, word order, and deletion errors.

Ethics Statement
We utilize various datasets in our experimental analysis, including the UN v1.0 corpora (Ziemski et al., 2016), the Chinese News, and the WMT dataset (Bojar et al., 2017) in the classification experiments, as well as the cLang8 (Rothe et al., 2021), CoNLL14 (Ng et al., 2014), BEA19 datasets (Bryant et al., 2019), NLPCC18 (Zhao et al., 2018), Falko-MERLIN (Boyd et al., 2014) and RULEC-GEC (Rozovskaya and Roth, 2019) in the GEC experiments.All of these datasets are publicly available resources and acquired for research purposes.We affirm our commitment to the responsible and ethical use of data throughout this research paper.
The utilization of data in this study strictly adhered to relevant legal and ethical guidelines.

A Appendix
A.1 Details of Quantifying Data Properties One hypothesis is that the distribution of GEC data is similar to that of translationese.To verify our hypothesis, we follow the quantifying method proposed by Rabinovich et al. (2016) and Su and Li (2016) to explore the linguistic properties of the English and Chinese GEC data.If the statistical results are close, the data are similar in terms of different linguistic properties.
Data For the English data, we use the native texts and translationese released by the European Parliament Proceedings (Koehn, 2005).Additionally, we combine the native texts with WMT17 News Crawl monolingual data as the final native data.
For the Chinese data, we use Lancaster Corpus of Mandarin Chinese (LCMC) (McEnery and Xiao, 2004) and People's Daily data as native language data.The ZJU Corpus of Translational Chinese (ZCTC) (Xiao et al., 2008) is used as the translationese.The English and Chinese GEC training data keep the same setting as mentioned in Section 5.For all the data types, we report normalized statistical results measured on 780k and 800k tokens for English and Chinese language, respectively.
Lexical richness Lexical richness is measured by the type-token ratio (TTR).Stubbs (1996) and Xiao (2010) point out that the lexical richness of native texts is larger than translationese in both English and Chinese.Our results show the same TTR trend as the result reported by Rabinovich et al. (2016) and Su and Li (2016).
Cohesive markers Connectives, which illustrate the logical relationships in sentence structure (Koppel and Ordan, 2011) and (Su and Li, 2016), are more commonly used in translationese compared to native texts.To verify this property, we collect about 116 cohesive markers for English and 150 for Chinese.The measurement is calculating the frequency of these cohesive markers that appeared in the four data types.The results show that the connective frequency of translationese and GEC data are higher than the native texts in both English and Chinese languages.
Collocations Native language speakers tend to use common and frequent collocations (Britt et al., 2015).We collect about 8,300 commonly used collocations for English and 6,100 for Chinese.The measurement is computing the frequency of these collocations used in the four data types.The results show that our native language data and GEC data have a similar frequency distribution compared to the results reported by the previous study (Rabinovich et al., 2016) and (Su and Li, 2016).
Pronouns The usage of pronouns is different in Chinese and English.For English, translators prefer to write the actual nouns rather than pronouns that reflect the principle of explicitation (Olohan, 2002).However, Chinese translators are often influenced by the source text and directly translate the pronoun (Su and Li, 2016).The measurement is the frequency of the pronouns in the four data types.The results show that the trends in Chinese are consistent with the result mentioned by Su and Li (2016).For English, the GEC data has more pronouns compared to our own native data, such as "I" and "you".

Content Words and Function Words
We use the Stanford POS tagger4 to annotate contents and function words for both English and Chinese.
For content words, we calculate the frequency of adjectives, pronouns, nouns, and verbs in the four data types.For function words, we calculate the frequency of conjunctions, adverbs, determiners, and prepositions.The results show that translationese tends to use more function words to make the sentences simple and explicit (Su and Li, 2016).Besides, the frequency distribution in translationese is similar to GEC data.

A.2 Settings of BERT Classifier
The settings of hyper-parameters of the fine-tuning BERT classifiers are listed in Table 6.

Native
She urged States to bear in mind the importance of ensuring and maintaining the contextual space for the activities of human rights defenders, including the right to peaceful assembly, in combination with the rights entailed in relation to freedom of expression and association.Translationese Ms. Andersen (Denmark) said that sexual harassment in the workplace was strictly prohibited and that protection was available through the Gender Equality Board and the courts.Translationese An appropriate legal framework would ensure the validity and enforceability of electronic transactions in all circumstances and create certainty in such an important area of law.

Non-native
Because some of my classmates make great progress in the exam and they catch up with me and some of them even surpass me.

Non-native
The students are so nice and obedient, which is very good for me because I am a beginner.by our proposed BERT-based classifier in Table 7.
It can be seen that the native texts contain collocations (idioms) like "with a few to", and "bear in mind", while translationese and the second language learners (non-native) data hardly contain them.The translationese and non-native texts contain more cohesive markers like "and" and "because" than native texts.In addition, native texts like to use pronouns, but translationese and second language learners' data tend to give specific content which indicates the characteristic of explicitation.
Overall, the examples show that translationese resembles the second language learners' data in many aspects.

A.4 Settings of GEC Models Training
The hyper-parameters settings of the training Transformer GEC models are listed in Table 8.
A.5 Settings of (m)T5 Fine-tuning  10 shows that the model augmented with translationese (i.e.,+TRANS) outperforms the BASE.and +NATIVE method for German and Russian GEC benchmarks on MT5 LARGE models.Even though our results are not reached the strong baselines LRGEC (Náplava and Straka, 2019), our results also sufficiently confirm the effectiveness of our approach compared to the GEC models finetuned on the same training data and model settings (Rothe et al., 2021).The training settings for the aforementioned models are presented in A.5.

A.7 Case Study for GEC Models Outputs
Table 12 shows some outputs generated by native/translationese GEC model.By taking English as an example, the translationese GEC model corrects ungrammatical sentences better than native GEC model.It indicates that using translationese as input for GEC data augmentation can improve performance.
A.8 Results on the BEA19 English Test Table 13 shows that the model augmented with translationese (i.e.,+TRANS) outperforms the other settings on BEA19 W&I non-native test and BEA19-ALL test.However, the +NATIVE method is better than others on BEA19 LOCNESS native test.After borrowing knowledge from the T5 pretrained model, the performance still remains consistent and achieves promising results.Overall, the results sufficiently confirm the effectiveness of utilizing similar style texts as input for data augmentation.
A.9 Details of Linguistic Properties Settings Word frequency and word position reflect the performance of GEC systems from the perspective of word-level accuracy and sentence structure, respectively.We use the compare-MT5 toolkit to compare the outputs of BASE, NATIVE, MIX and TRANS.GEC models by F -measure.Taking the result of BASE model as a baseline, we report the improvements of each GEC model.Word Frequency: We count the word frequencies of English and Chinese GEC on the target training sets, dividing their tokens into three categories according to their frequency.We follow Wang et al. (2020) to select the most 3,000 frequent tokens into the Frequent bucket, the most 3,001-12,000 into Medium bucket, and the others into the Rare bucket for English and Chinese.
Position: From the perspective of sentence structure, the behavior of GEC models may be different at different positions of the sentence.We divide the sentences into three buckets that have equal length and categorize the token into three types based on

Figure 1 :
Figure 1: Four kinds of texts in English and Chinese languages.Native (Others) and Translationese (Others) represent our reproduced results based on the released English data by Rabinovich et al. (2016) and the Chinese data byMcEnery and Xiao (2004) andXiao et al. (2008).Native (Ours) refers to the results based on our collected native text (i.e., the WMT News Crawl data for English and the People's Daily data for Chinese), and GEC refers to the results of the original GEC data (i.e., CoNLL14 English and NLPCC18 Chinese benchmarks).The vertical axis represents the normalized statistical results for each linguistic property, where a higher value indicates a greater proportion of linguistic properties.The style of translationese is similar to that of original GEC data.

Figure 2 :
Figure2: The overall framework of TransGEC.The left half is to obtain translationese from the target side of the parallel corpus, and the right half is to use the obtained translationese as input for GEC data augmentation.Specifically, the native source monolingual text is translated to machine translated text through a trained machine translation system.The translationese is identified via the BERT classifier, which is fine-tuned with the same amount of machine translated text and native target monolingual text.The obtained translationese is injected with specific noise to produce a synthetic GEC corpus which is merged with the original GEC corpus to train a GEC system.

Figure 3 :
Figure 3: Results of the different types of synthetic data combined with original cLang-8 GEC data with different combination ratios on the CoNLL14 test set.

Figure 4 :
Figure4: Improvements of exploiting different types of texts for augmentation in terms of word frequency and position on the English and Chinese GEC tasks.Overall, the translationese method (i.e., TransGEC) can bring more benefits to GEC in terms of linguistic properties.We discuss the outlier of correcting rare words in the text part.

Table 1 :
Statistics of the used data sets.Data marked with * is native while the others are non-native data.
(Yannakoudakis et al., 2018)i et al., 2012), NUCLE(Dahlmeier et al., 2013)and W&I(Yannakoudakis et al., 2018).While the development and test sets of BEA19 consist of W&I and LOCNESS(Granger,  1998), W&I consists of 3 different levels of nonnative texts and LOCNESS is native text.Specifically, we use W&I dev and LOCNESS dev as the validation sets when testing the performance on the W&I test set and LOCNESS test set, respectively.

Table 2 :
F 0.5 scores of the probabilities of translationese corruption with deletion, insertion and substitution for different groups.Bold value indicates the best result.

Table 3 :
toolkit.For the pretraining F 0.5 scores on the BEA19 English benchmark.BASE uses the original BEA19 training data.ALL is the full BEA19 test set.+NATIVE can be seen as combining the native texts with base GEC data, +TRANS.(Trans-GEC method) means translationese, and +MIX refers half of the native texts and half of the translationese.Bold values indicate the best results.

Table 4 :
Zhao et al. (2019)a, 20190.5 Pre.Rec.F 0.5 Results on CoNLL14 English, NLPCC18 Chinese, Falko-MERLIN German, and RULEC-GEC Russian GEC tasks.BASE.refers to the method using the GEC training data.The PRE-TRAINED models for our BASE.methodsarebasedon(m)T5-large models for English and Chinese, and are built upon the strong baseline LRGEC(Náplava and Straka, 2019) models for German and Russian.Native texts and translationese are identified from the same domain.+NATIVEcanbe seen as the proposed method byZhao et al. (2019), who use native texts for augmentation.+TRANS.refers to the synthetic data generated from translationese.+MIX.means the synthetic data is made up of half of the native texts and half of translationese.(M)T5 LARGE/XXL results indicate the models Rothe et al. (2021)8 GEC data, which was reported byRothe et al. (2021).Statistically significant improvements over +NATIVE method are reported using P _value, † p < 0.05 and ‡ p < 0.01.

Table 5 :
Pre. Rec.F 0.5 Pre.Rec.F 0.5 Performance by error types when using different kinds of texts for augmentation.We give the ratio of each type.Bold values indicate the best F 0.5 score in each row.The model augmented with translationese has a better ability in correcting missing words and substitution errors.
Yvette Graham,Barry Haddow, and Philipp Koehn.2020.Statistical power and translationese in machine translation evaluation.In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 72-81, Online.Association for Computational Linguistics.
NativeHe would continue consultations in 2008 with a view to holding the next Conference session in a new region, to reinforce Member States' ownership of the Organization.
A.3 Case Study for the Identified TextsWe present the examples of English native texts and translationese distinguished from the UN corpus

Table 7 :
Examples of the native texts and tanslationese distinguished by the BERT-based pre-trained classifier.Native (Translationese) refers to the examples of native (tanslationese) texts.Non-native refers to the examples of GEC train data.The words with the color red represent the characteristics of native texts.The words with the color blue resemble the characteristics of the second language learners.Betas β 1 = 0.9, β 2 = 0.98 β 1 = 0.9, β 2 = 0.998 β 1 = 0.9, β 2 = 0.98 β 1 = 0.9, β 2 = 0.98

Table 8 :
Hyper-parameters for training English, Chinese, German and Russian GEC models.Model Arch.refers to model architecture, LR is learning rate, Att.Drop.means attention dropout, Act.Drop.means activation dropout.

Table 9 :
Hyper-parameters for fine-tuning English, Chinese, German and Russian GEC models.Model Arch.refers to model architecture.LR denotes the learning rate.

Table 11 :
Statistics of the data sets for German and Russian GEC models training and finetuning A.6 Results for German and Russian Trained on cLang-8 DatasetsTable 11 present the statistics of the cLang8 data used for finetuning German and Russian GEC tasks based on the mT5 large pre-trained model.Table