Beyond Grammatical Error Correction: Improving L1-inﬂuenced research writing in English using pre-trained encoder-decoder models

In this paper, we present a new method for training a writing improvement model adapted to the writer’s ﬁrst language (L1) that goes beyond grammatical error correction (GEC). Without using annotated training data, we rely solely on pre-trained language models ﬁne-tuned with parallel corpora of reference translation aligned with machine translation. We evaluate our model with corpora of academic papers written in English by L1 Portuguese and L1 Spanish scholars and a reference corpus of expert academic English. We show that our model is able to address speciﬁc L1-inﬂuenced writing and more complex linguistic phenomena than existing methods, outper-forming what a state-of-the-art GEC system can achieve in this regard. Our code and data are open to other researchers 1 .


Introduction
Writing for international readerships can be challenging for scholars whose working language is not English (Flowerdew, 2019), even when they are reasonably proficient in the language. However, existing research on tools for improving writing focuses mostly on Grammatical Error Corrrection (GEC), and often relies on college-level corpora as a benchmark for lower-English proficiency levels. While GEC can address more straightforward grammar mistakes, it does not adequately address fluency and more complex linguistic issues (Napoles et al., 2017). Moreover, few studies address the effects of first-language (L1) transfer on writing (Nadejde and Tetreault, 2019).
Another limitation is that state-of-the-art methods in GEC such as neural machine-translationbased (NMT) approaches (Sennrich et al., 2016a) and transformer-based sequence-to-sequence models (Vaswani et al., 2017) typically require a large amount of pseudo-errors generated from 1 https://github.com/gzomer/BeyondGEC monolingual data using rule-base corruption (Zhao et al., 2019), back-translation (Kiyono et al., 2019), or round-trip translation (Lichtarge et al., 2019). These methods tend to introduce errors with limited diversity, most of which involve spelling and grammar.
There have been recent attempts to eliminate the time-consuming pre-training step by employing pre-trained transformer models. Alikaniotis and Raheja (2019) used pre-trained transformer models in a language-model setting, and  fine-tuned BART (Lewis et al., 2020) with a small corpus of annotated sentences. Both approaches achieved results comparable to models trained with millions of sentences.
This study proposes a method for improving L1influenced English texts beyond GEC through the use of pre-trained encoder-decoder models. Our approach uses parallel corpora of English aligned with English that has been machine-translated from Portuguese and Spanish as training data that emulates L1-influenced writing. The models generate sentences that have a higher level of acceptability and are more linguistically diverse than what a state-of-the-art GEC system can achieve. We also propose new metrics for evaluating improvement beyond correctness. Chollampatt et al. (2016) adapted a neural language model for three different L1s and Nadejde and Tetreault (2019) adapted a general-purpose neural GEC system to the writer's proficiency level and L1. Both studies were conducted on student corpora. Takahashi et al. (2020) proposed a method to generate more realistic pseudo errors by considering learners' tendency for errors, but only adjusted the probability of grammatical and spelling mistakes.

Related work
More recently,  proposed a novel data-synthesis method to generate errorcorrected sentence pairs based on a pair of machine-translation models of different qualities. However, good quality NMT translation may not be good enough as a reference.

Method
Our method consists of fine-tuning a pre-trained encoder-decoder transformer model with parallel corpora of English and machine-translated English.

Parallel corpora compilation
We create the parallel corpora by machinetranslating into English the non-English sentences of a parallel corpus containing English. Our assumption was that machine-translated text can be used as a proxy for L1-influenced English, as it tends to render more literal translations, preserving many lexical and syntactical features of the source language. This can mimic the output of writers using English as a second language (L2) who transfer words and grammar from their L1 when writing in English. The example below illustrates this. Sentence (a), from the Brazilian Academic Corpus of English (BrACE) (Pinto et al., 2021), exemplifies the common problem among Brazilians writing in English of using realization to mean undertaking. This is likely to be due to the direct translation of the Portuguese false cognate 'realização', as shown in sentence (b), a Portuguese sentence from the EurLex corpus (Baisa et al., 2016). Sentence (c) shows how the same problem occurs when (b) is submitted to MT, but not in the reference translation in (d).
a. BrACE: For the realization of the initial stage of the project, nationwide semi-presential extension courses were proposed [...] b. PT Source Text: O próprio presidente do governo regional afirmou que estão garantidos os fundos de Bruxelas para a realização das referidas obras.
c. PT-EN MT: The president himself of the regional government said that Brussels funds are guaranteed for the realization of these works.
d. PT-EN Reference Human Translation: The President of the Regional Government himself has said that funds from Brussels are guaranteed to enable the projects to be started.
This study used Spanish and Portuguese as L1. We selected the EUR-Lex Portuguese, Spanish and English subcorpora as a starting point. We corrected alignment issues using the length-based alignment algorithm from Gale and Church (1993).
Next, we cleaned the corpora by removing duplicated sentences and sentences in other languages using the langid 2 package. We also removed sentences with fewer than 10 words to filter out titles. Empty alignment units were discarded.
Finally, we machine translated the Spanish and Portuguese corpora into English using the Open-MT roa-en model 3 (Tiedemann and Thottingal, 2020). We obtained 260k sentences 4 for each: (a) the EN P T -EN corpus of Portuguesemachine-translated-into-English aligned with English and (b) the EN ES -EN corpus of Spanishmachine-translated-into-English aligned with English.

Model
We used the Text-to-Text Transfer Transformer (T5) model (Raffel et al., 2020) as our encoderdecoder. In T5, each task is treated as a languagegeneration task where the model is conditioned to generate the correct output based on a textual prompt included in the input sequence (Raffel et al., 2020). We fine-tuned T5 by using the parallel corpora we created, with the machine-translated sentence as the input and the English reference as the output. We conditioned the input sentence on the writers' L1 by prepending the task definition improve_english <L1>: for each input, where <L1> is replaced by the writer's L1.

Pre-and post-processing
We added a pre-encoding and post-decoding step in our model to preserve out-of-vocabulary (OOV) tokens, as academic and scientific texts include a diversity of mathematical and other nonalphanumeric symbols. Adding symbols to the tokenizer would significantly increase the model vocabulary and decrease its performance. OOV tokens were replaced with special tokens appended with an id [KEEP_ID], and then restored with the original token post-decoding.

Grammaticality
We define our grammaticality metric as the accuracy of a RoBERTa 5  classifier trained on the CoLA corpus (Warstadt et al., 2019), which contains sentences paired with grammatical acceptability judgments, following a similar approach to that of Krishna et al. (2020).

Acceptability
We measure acceptability based on the assumption that probability is related to naturalness, as shown in Lau et al. (2017). We used SLOR (Pauls and Klein, 2012), as it is particularly effective in neutralizing both sentence length and word frequency, and has yielded the best results in a comparative study of different metrics (Lau et al., 2017). SLOR is calculated as the normalized difference between the sentence log-probability and the uni-gram sentence probability, as shown below: To calculate the probabilities, we built the Expert Academic Corpus of English (ExpACE) corpus, which consists of 10 million words from over 1200 highly cited papers published in high-impact journals (based on h5 index) in eight different subject domains. We removed references, tables, and replaced citations with (CIT). We fine-tuned GPT-2 with ExpACE on a Tesla T4 GPU for 3 epochs with 5 We fine-tuned Hugging Face RoBERTa-base on a Tesla T4 GPU for 3 epochs with batch size of 32 and using Adam optimizer with a learning rate of 2e −5 batch size of 8 and using Adam optimizer with a learning rate of 3e −5 to calculate sentence probabilities, and built a subword uni-gram model using byte-pair encoding (BPE) (Sennrich et al., 2016b) as the tokenizer. We multiplied each BPE token probability to approximate the word probability.

Lexical and syntactic diversity
As discussed in the introduction, most GEC systems focus on correctness and minor-edits (Napoles et al., 2017). However, L1-influenced texts may underuse and overuse certain vocabulary (Pinto et al., 2021), and may require more complex syntactical transformations to read more fluently. Therefore, we propose two new measures for evaluating linguistic diversity at the lexical and syntactical level.
The lexical diversity metric is measured by the sum of lexical changes with different lemmas (e.g. changing provoked to led) normalized by the sentence length. Models focusing mostly on grammaticality (e.g. fixing verb tense from show to shown) will score lower on this metric. We calculated lexical diversity by first word-aligning the original and improved sentence using SimAlign (Jalili Sabet et al., 2020). Lemmas were extracted using WordNet from the nltk package.
We also introduce a syntactical diversity metric, defined as the normalized difference between the word alignments of the original and improved sentence. The idea behind this metric is to capture reordering of words and phrases in a sentence, as non-conventional order can decrease readability (Wallwork, 2016).

Model Sentence SxD LxD
Source A company that has a good planning, with well-defined action plans, indicators and responsible persons, will have a favorable condition. GECToR A company that has good planning, with well-defined action plans, indicators and responsible persons, will have a favorable condition.
0.000 0.000 Our model A well-planned company with well-defined action plans, indicators and responsible persons will have a favorable condition.

Experiments
In this section, we describe the two corpora compiled for evaluating our model and the experimental setup used in the evaluation.

Evaluation corpora
The first corpus is a larger version of the 1million word Brazilian Academic Corpus of English (BrACE) (Pinto et al., 2021), consisting of 14 million words of journal articles published in English in Brazilian journals. The second corpus is the Latin-American Academic Corpus of English (LACE) corpus, containing 13 million words of research articles published in English in journals from Spanish Latin America. Both corpora were compiled using a balanced sample of seven broad subject areas downloaded from Scientific Electronic Library Online (SciELO). We cleaned both corpora by removing headers, references, footnotes, figures and tables. We also replaced citations with (CIT).

Experimental setup
We experimented with a small T5 model version with 60M trainable parameters and a larger version with 220M parameters 6 . We fine-tuned each model for 3 epochs using Adam optimizer with a learning rate of 5e −5 and a batch size of 16. We compared our results with GECToR (Omelianchuk et al., 2020), a current state-of-the GEC model. During inference, we performed beam search with a beam size of five. We also added a τ threshold above which we accept the candidate sentence by comparing the original sentence and the candidate sentence probability evaluated using the GPT-2 language model fine-tuned with ExpACE (see Section 4.2), similar to the approach used in (Krishna et al., 2020). As this step is applied postinference, we applied τ filtering to GECToR in order for the results to be comparable. In our experiments, we empirically found a τ of 0.05 as a good balance between precision and recall.

Results and Discussion
The results for BrACE are presented in Table 2 and for LACE in Table 3, where we present the average change in grammaticality (GRM), average change in acceptability (ACP), lexical diversity (LxD), and syntactical diversity (SxD) on a sample of 20k sentences of each corpus. 6 Hugging Face t5-small and t5-base   Our results show that both small and large models generate sentences that have a greater degree of acceptability and are more lexically and syntactically diverse by a large margin in comparison with the GECToR baseline. On the other hand, GEC-ToR performed better on grammatically. This was not unexpected, as our model was not optimized for grammatically and was trained with fewer sentences.
Table 1 exemplifies the differences between minimal-edits from GECToR and the more substantial changes from our model. As shown in the first sentence, our model is able to improve that has a good planning to well-planned, whereas GECToR only deleted an article.
In addition, we analysed the ability of the models to capture L1-influence. We extracted the 1000 most frequent 2-,3-,4-grams of the CoPEP corpus of journal articles written in Portuguese (Kuhn, 2017), translated them into English and compared their frequency in BrACE and ExPACE, to test whether the CoPEP n-grams were overused in BrACE.  We found that 74% of the n-grams were indeed overused in BrACE when compared with ExPACE. Our model revised 23% of these, while GECToR only tackled 14%. In doing so, overall ngram overuse was reduced by 34% with our model against 19% with GECToR.
A preliminary error analysis of the output of our model was conducted using a sample of 100 sentences from BrACE that were submitted to our system. We annotated each sentence pair with ER-RANT (Bryant et al., 2017) and manually classified 284 changes in one or more categories of errors. Table 4 shows that the most common type of issue is overcorrection, especially occurring among cohesive devices (for example, changing furthermore to moreover). Removing symbols or adding unnecessary ones was the second most common type of problem (e.g., removing punctuation). Although changes in terminology were uncommon, further research should explore approaches for keeping them unchanged.
Although these issues could be mitigated by increasing τ filtering, it would also filter out relevant suggestions. Instead, we suggest further research on incorporating local biases on specific tokens, giving either a boost or reduction on the likelihood of changing tokens. The GECToR approach of tagging sentences instead of rewriting them allows for the use of bias to keep tokens unchanged. However, this restricts the flexibility of more complex and linguistic diverse changes, as seen in our results. We argue that future research should incorporate the benefits of both approaches in a single method, achieving high-flexibility with controlled results.

Final Remarks
We introduced a new method for improving L1influenced academic English by using a pre-trained encoder-decoder transformer. We showed that our model generates sentences with a higher acceptability and that are more linguistically diverse than a state-of-the-art GEC model. However, further research is needed to assess the extent to which our model over-corrects or introduces errors.
The approach taken in this study can be extended to other L1s by modifying existing parallel corpora with machine translation. It can also be extended to other domains, beyond academic, as our method does not rely on annotated corpora. It is evident that more research is needed beyond grammatical correction, including the development of corpora and models addressing more linguistically diverse issues.