CoEdIT: Text Editing by Task-Specific Instruction Tuning

We introduce CoEdIT, a state-of-the-art text editing system for writing assistance. CoEdIT takes instructions from the user specifying the attributes of the desired text, such as"Make the sentence simpler"or"Write it in a more neutral style,"and outputs the edited text. We present a large language model fine-tuned on a diverse collection of task-specific instructions for text editing (a total of 82K instructions). Our model (1) achieves state-of-the-art performance on various text editing benchmarks, (2) is competitive with publicly available largest-sized LLMs trained on instructions while being nearly 60x smaller, (3) is capable of generalizing to unseen edit instructions, and (4) exhibits abilities to generalize to composite instructions containing different combinations of edit actions. Through extensive qualitative and quantitative analysis, we show that writers prefer the edits suggested by CoEdIT relative to other state-of-the-art text editing models. Our code, data, and models are publicly available at https://github.com/vipulraheja/coedit.


Introduction
Large language models (LLMs) have made remarkable progress toward generating coherent text in a wide variety of tasks and domains to support writing assistance (Du et al., 2022a;Mallinson et al., 2022;Schick et al., 2023), such as grammatical error correction (Wu et al., 2023), text simplification (Štajner et al., 2022), paraphrasing (Chowdhury et al., 2022), and style transfer (Reif et al., 2022).One of the emergent abilities of LLMs is the capability to generalize to unseen tasks by following new or composed instructions.Instruction-tuning, where LLMs are fine-tuned on a collection of tasks phrased as instructions, makes the models more adept at interpreting and following instructions, reducing the need for few-shot exemplars (Sanh et al., 2022;Ouyang et al., 2022b;Wei et al., 2022;Chung et al., 2022b). 1 Code, data, and models available at https://github.com/vipulraheja/coedit  2 and 11.Publicly available models are denoted with (*).
Text editing is a complex task because human writers cannot simultaneously grasp multiple demands and constraints of the task and tend to iterate and revise their work multiple times (Flower, 1980;Collins and Gentner, 1980;Vaughan and McDonald, 1986).This poses a significant challenge for intelligent writing assistants.
In this work, we aim to improve the capabilities of instruction-tuned models for text editing by leveraging instruction-tuning from diverse tasks of text editing benchmarks.While multiple previous works have attempted to develop general-purpose text editing models using LLMs, they are either not trained with instruction-tuning (Du et al., 2022c;Kim et al., 2022), trained on much smaller models or not trained on task-specific datasets (Mallinson et al., 2022;Schick et al., 2023), or are not publicly available (Schick et al., 2023), which limits their effectiveness, performance, or usability.
We introduce COEDIT, a text editing system designed to provide writing assistance with a natural language interface.A user can employ COEDIT by providing natural language instructions such as "Paraphrase the sentence" or "Fix the grammar".Our experiments demonstrate that fine-tuning in- structions for specific tasks is more effective than multi-task learning and general-purpose instruction tuning.We conjecture that task-specific instructions increase the density of the instruction space, reinforcing the complementary effects of multiple tasks and facilitating their generalization to composite and new text editing tasks, as shown in Fig. 2.
To build COEDIT, we fine-tune a pre-trained sequence-to-sequence model on a parallel corpus of instruction-based 82K input-output pairs.The inputs and outputs are sourced from publicly available corpora for different text editing tasks, and the instructions are constructed based on rules that introduce lexical and semantic variations.
Our main contributions are as follows: • We achieve state-of-the-art performance on multiple text editing tasks: grammatical error correction, text simplification, sentence fusion, iterative text editing, and three stylistic editing tasks (formality style transfer, neutralization, and paraphrasing).• We find that even our smallest instruction-tuned model outperforms other supervised text editing models, instruction-tuned models, and generalpurpose LLMs with nearly 60x greater parameters, on both manual and automatic evaluations.• COEDIT generalizes well to new, adjacent tasks not seen while fine-tuning, as well as composite instructions with multiple task specifications.• Our data and models will be publicly available.

Related Work
Large Language Models for Text Editing In general, our work is related to many prior works that leverage LLMs; for instance, finetuning T5 (Raffel et al., 2020a) on pairs of original and edited text (Faltings et al., 2021;Reid and Neubig, 2022;Mallinson et al., 2022;Du et al., 2022a,b;Kim et al., 2022).However, these aforementioned works are either not based on instruction tuning, use different modeling techniques such as tag-based se-quence labeling, or are not general enough to work on multiple text editing tasks.Moreover, several LLMs are trained to solve specific tasks only, such as grammar errors (Mallinson et al., 2022;Fang et al., 2023), text simplification (Štajner et al., 2022), paraphrase generation (Chowdhury et al., 2022), or style transfer (Reif et al., 2022), which limits their generalizability.
Instruction Tuning for Writing Assistance Explicitly teaching models how to follow natural language instructions is closely related to recent work for fine-tuning models using large datasets of human-written instructions (Wei et al., 2022;Mishra et al., 2022;Sanh et al., 2022;Ouyang et al., 2022a;Wang et al., 2022;Iyer et al., 2022;Bach et al., 2022;Longpre et al., 2023).Recently, advanced data augmentation and instruction tuning, starting with the Flan models (Chung et al., 2022b), have shown that strong results stem both from the larger and more diverse set of tasks.Additionally, enriching task diversity and balancing task sources (Sanh et al., 2022) are shown to be critical to performance, suggesting instruction-tuned models offer a more computationally-efficient starting checkpoint for downstream applications, corroborating Liu et al. (2022) and Aribandi et al. (2022).
On instruction tuning for writing assistance, our work is closely related to PEER (Schick et al., 2023), who fine-tuned T5-based LLMs by following user-provided text-editing plans to perform the said edits.There are a few significant differences in our approach compared to PEER.While PEER attempts to either create or leverage a user-provided plan, realize the edits conditioned on the plan, and try to explain the plan, we focus only on the plan and edit parts of the pipeline.Even when it comes to handling editing plans in the form of natural language instructions, our work focuses on edits that do not add new information.Therefore, we compare our models only against PEER-Edit models.
Finally, no prior works, to the best of our knowledge, have investigated the ability of instructiontuned LLMs for text editing to generalize to composite instructions.

Training Dataset
Our dataset creation is based on the ITERATER+ dataset proposed by Kim et al. (2022) who combined datasets from various text editing tasks (See  2022), our work focuses on non-MEANING-CHANGED edits.We consider those edits to be ones that do not add new information or perform fact updates.Since the STYLE edits are quite subjective in nature, we allow for the possibility of meaning change so as to fulfill the needs of making stylistic edits, but we constrain the editing tasks to ensure the edited texts are semantically similar to the sources, but not to the extent of adding new information or fact updates.With this in mind, we expand the STYLE edit intention category from ITERATER+ to include three new sub-intentions: Paraphrasing, Formality Style Transfer (or Formalization), and Neutralization.
The aforementioned ITERATER dataset taxonomy lends itself conveniently to be articulated as natural language instructions and allows us to naturally formulate them into instructional prompts (See Table 1).We rewrite each edit intention as a set of natural language instruction prompts to create the COEDIT dataset.To allow models to adapt to linguistic variations of the instructions, we also include paraphrases of the instruction templates, e.g., instead of "Write" we also use "Generate" or "Rewrite," or instead of "Paraphrase the text" we use "Rewrite the text with different wording," and so on.For each task, we develop a variety of such diverse instructional prompts and ran-domly sample an instruction from the aforementioned group of task-specific instruction candidates to be pre-pended to the source in order to form an <instruction: source, target> data pair.We provide the full list of our instructional prompts in §C.In total, our training dataset consists of around 82K <instruction: source, target> pairs.We keep the original train-validation-test splits consistent as the original datasets but diversify the train and validation splits with the paraphrasing augmentations.The details of datasets and instructions used to train our models are described in §A.

Experimental Setup
We conduct experiments to determine if a standard instruction-tuned language model fine-tuned using task-specific data can improve text editing performance and if it can further generalize into a general-purpose text editing model capable of following human-written instructions and handling a wider array of editing tasks, such as unseen and composite instructions.Specifically, we aim to answer the following research questions: • RQ1: Can COEDIT follow text editing instructions and perform high-quality edits across a wide variety of tasks?
• RQ2: Is COEDIT generalizable to perform high-quality edits for new types of text editing instructions?• RQ3: Does COEDIT make the writing process more efficient and effective for human writers?We answer these questions via quantitative analyses of model outputs (Section 5) and via qualitative analyses and human evaluations of model outputs (Section 6).Further, we investigate RQ2 along two dimensions: (1) generalization to composite instructions containing combinations of multiple different kinds of edits and (2) out-of-domain generalization to instructions with new task requirements on previously unseen data.

Models
No-Edits Baseline We first evaluate a no-edits baseline, where the output is simply a copy of the source input without the instruction.This strategy performs reasonably well on tasks where the target output largely overlaps with the input (e.g., GEC).

Supervised Text Editing Models
We also evaluate existing LLMs for text editing that are not finetuned with instruction-specific data.Specifically, to understand the effect of task-specific fine-tuning, we evaluate against T5 2 (Raffel et al., 2020b) models as primary alternatives of our FLAN-T5 models.
We also compare our models against ITERATER (Du et al., 2022b) and DELITERATER (Kim et al., 2022), which have shown strong performance on a variety of text editing tasks. 3 Instruction-tuned LLMs A major group of our comparisons is against instruction-tuned LLMs: • Our main comparison is against PEER (Schick et al., 2023), which is primarily based on the LM Adapted variant of T5.As the focus of our work is on improving revision quality (Section 2), we compare against PEER-EDIT (both 3B and 11B versions).• T0, T0++ (Sanh et al., 2022) and Tk-Instruct (Wang et al., 2022), which are all initialized from the LM Adapted variant of T5, and finetuned using PromptSource (Bach et al., 2022), and Super-NaturalInstructions (Wang et al., 2022) datasets, respectively.
• Alpaca (Taori et al., 2023) is an instructiontuned version of the LLaMA-7B model (Touvron et al., 2023) trained on 52K instructionfollowing demonstrations generated by GPT3.• We also compare InstructGPT (Ouyang et al., 2022a), a variant of GPT3 fine-tuned via reinforcement learning on a large dataset of instructions and human-written outputs.

4
• GPT3.5 (henceforth referred to as ChatGPT), is an improved version of InstructGPT optimized for chat.We utilize OpenAI's API for all inference tasks.5 • GPT3 also offers a text Editing API6 (we refer to as GPT3-Edit), which is usable for editing tasks rather than completion, making it directly comparable to the tasks we train COEDIT on.
Large-Pretrained Decoder-only Models We compare against LLMs with no instruction tuning in two settings -zero-shot and few-shot (details in Section 5.1): • The 175B GPT3 (Brown et al., 2020) model that is not instruction-tuned demonstrates strong general-purpose text revision capabilities.• LLaMA (Touvron et al., 2023) is Meta AI's general-purpose language model trained only on publicly available data.We utilize the 7B model due to computing constraints.Outputs of all models were generated using greedy decoding unless specified otherwise.

Test Datasets
To assess the editing capabilities of COEDIT, we perform evaluations on standard test sets sourced from a variety of text editing task benchmarks, most notably, EDITEVAL (Dwivedi-Yu et al., 2022).Owing to the overlap of our work with PEER, we keep our evaluation datasets and evaluation metrics as close to theirs as possible for consistency: We used JFLEG (Napoles et al., 2017) for grammatical error collection, TurkCorpus (Xu et al., 2016) and ASSET (Alva-Manchego et al., 2020) for text simplification, Coherence split of ITERATER (Du et al., 2022b) and the DISCOFUSE dataset (Geva et al., 2019) for coherence, and ITERATER (Du et al., 2022b) for iterative text revision.For Stylerelated edits, we used GYAFC (Rao and Tetreault, 2018) for formality style, WNC (Pryzant et al., 2020) for neutralization, and MRPC (Dolan and Brockett, 2005), STS (Cer et al., 2017), and QQP for paraphrasing.Detailed descriptions of each dataset and its evaluation metrics are in §B.
5 Quantitative Results

Text Editing Performance
Table 2 helps us answer RQ1 by comparing the performance of COEDIT to other models across various text editing tasks.We first present results from the more well-known evaluation sets here and present additional results (i.e., sub-tasks and additional datasets) in Table 11.
We segregate the models into seven groups.The first group (a) consists of the copy baseline and T5-LARGE baseline fine-tuned with prefix-tuning (each data point is prefixed with task-specific tags rather than instructions), while the second group (b) consists of instruction-fine-tuned T5-based models on non-text-editing tasks.We find that COEDIT substantially outperforms these models across all tasks.
The next two groups (c, d) show different LLMs varying from 7B to 176B parameters in size, evaluated in a zero-shot setting.Those in group (c) are decoder-only models, while those in group (d) are instruction-tuned.We find that COEDIT outperforms all LLMs comparable to its model size (e.g., Alpaca and LLaMA) across all tasks, as well as on most tasks compared to models several times larger, such as ChatGPT and InstructGPT.This indicates that current general-purpose and instruction-tuned models are underfitted, and it is beneficial to densify the task/instruction space rather than to scale model size.
Although models such as Alpaca and T5-based models (Tk-instruct, T0, T0++) have previously shown strong capabilities for zero-shot tasks, they show weaker performance compared to COEDIT.We also see that the decoder-only models (e.g., GPT3 and LLaMA) often repeat the input for more complex tasks, such as ones under the Style intent group.This can be attributed to difficulty understanding the prompted task, resulting in the models either repeating the input sentence or generating a continuation unrelated to the task.
Next, in the fifth group (e), we evaluate the LLMs under a few-shot setting.As mentioned in Section 4.1, we conduct these experiments in a 4-shot evaluation setting, where example inputs were constructed by randomly sampling four inputs for each task from the COEDIT dataset such that all examples chosen would fit in the input window for all models as seen in (Brown et al., 2020).The input sentence and its corresponding revised reference were pre-pended to the instructional prompt.We conduct few-shot evaluations for decoder-only LLMs (GPT3) and three instruction-tuned LLMs (InstructGPT, ChatGPT, and Alpaca).Outputs of all models were generated using greedy decoding unless specified otherwise.
We observe that giving specific examples improves performance in all models for all tasks except MRPC for GPT3.This may be because GPT3 still exhibits some similar behavior in repeating its generations continuously, resulting in a low BLEU score but low semantic similarity as well.We don't present any experiments for GPT3-Edit under the few-shot setting, as scores tended to stay the same across all tasks -implying that GPT3-Edit may not have as good in-context learning capabilities.Overall, we find that even our smallest 770M parameter model is competitive against LLMs evaluated in a few-shot setting in most tasks.
In the final group (f), we compare our models against task-specific text editing models such as ITERATER, DELITERATER, and PEER.ITER-ATER and DELITERATER perform comparatively worse than the scores reported in the original paper as we present different and more difficult inputs, only pre-pending instructions to the inputs while ITERATER and DELITERATER were trained with task-specific tags.Furthermore, they were trained using BART and Pegasus, respectively, both of which have a summarization pre-training objective, and were not trained to follow instructions.On average, COEDIT beats PEER across all reported evaluations except the ITERATER benchmark.This can primarily be attributed to the difference in taskspecific fine-tuning since PEER uses Wikipedia as the source of instructional edit data.

Ablation Studies
Table 3 shows the performance of various baselines, which we discuss in detail in this section.
Instruction Tuning.To understand the effectiveness of instruction-tuning, we fine-tune the 3B pa-  between the two for all datasets and model sizes, thus, confirming prior findings.
Quality of Instructions.While we developed with a limited set of task-specific instructional prompts, there has been widespread work on the prompt sensitivity of LLMs, especially with growing model capacity (Lu et al., 2022).To assess the robustness of COEDIT models on instructional prompts, we train another baseline COEDIT-XL model with randomized task-specific instructions (henceforth referred to as COEDIT-XL-R).Specifically, the entire training dataset was randomized, where an instruction from one task was replaced randomly by an instruction from another task.Table 3(c) shows the results for this experiment.We observe that while COEDIT-XL-R achieves scores that are higher than the non-task-specific tuned FLANT5-XL (especially on edit-based metrics such as SARI), it significantly falls behind COEDIT-XL on those, as well as on the style accuracy metrics such as formality transfer accuracy and paraphrasing semantic similarity.This indicates that while the instructional structure of the inputs and task-specific training makes the model learn how to make edits (which drives up the SARI scores), however, the accuracy of those edits suffers since they are trained with the wrong instructions most of the time.Overall, the improvements highlight the positive impact of task-specific training, and the gaps in performance highlight the negative impact of lack of proper instruction tuning.

Qualitative Results
We now address RQ2 and RQ3 (Section 4).We show that COEDIT shows generalization abilities to adjacent tasks not seen during fine-tuning and can generalize to composite instructions containing a combination of tasks.Further, our human evaluation studies show that expert human evaluators find the text generated by COEDIT to be of higher quality than a much larger instruction-tuned LLM.

Text Editing Quality
Since text editing is often subjective, and automatic metrics are not always accurate in measuring if an instruction is satisfied, we conduct human evaluations for our model outputs by linguistic experts on 50 test inputs to ensure they meet the instructional constraints.Given the automatic evaluation results in Section 5, we compare our 3B-parameter COEDIT-XL model against the largest comparable 175B instruction-tuned LLM for text editing GPT3-EDIT.Specifically, we conducted a pairwise comparison: each annotator was shown an instructional input and outputs from both models (they were not aware which output was generated by which model).They were then asked to evaluate the fluency, accuracy, and meaning preservation of the edited texts and choose the higher-quality output ("neither" and "tie" are also valid options).We collect three annotations for each question and use the majority vote as the final judgment.Table 4 shows the results of the evaluation.The annotators prefer our COEDIT model for 64% of the inputs, whereas, for 10% of the inputs, GPT3-EDIT's output is preferred.In 4% cases, both models produce equally good outputs, whereas, for 22% of the inputs, both models generate unacceptable outputs.Table 12 provides a side-by-side comparison of the outputs generated by the two models.

Generalizability to Adjacent Tasks
We analyze the generalization capabilities of our models by evaluating them on a few related tasks that do not exist in the fine-tuning data.Specifically, we chose two standard NLP tasks -sentence compression (SC) (Filippova and Altun, 2013) and politeness transfer (PT) (Madaan and Yang, 2021).It is noteworthy that while our models were not fine-tuned on these exact tasks, we chose them so that the models could still comprehend them based  on other tasks they were fine-tuned on.We define them as being adjacent tasks, which still exist within the scope of existing tasks but have not been seen during fine-tuning (blue lines in Fig. 2).Similar to the previous experiment, in addition to GPT3-EDIT, we compare COEDIT-XL against the similarly-sized prefix-tuned (T5-XL) model and the non-task-specific trained FLANT5-XL model (same models as the ones used in Table 3 (a) and (b)).For evaluation, we curated a set of new instructional prompts geared towards both the new tasks (details in Appendix C).We evaluated the models on the respective test datasets from Filippova and Altun (2013) and Madaan and Yang (2021).
Table 5 shows the results of COEDIT-XL against various models on the sentence compression and politeness transfer tasks.For SC, we report the SARI metric for rewrite quality and compression ratio (CR) for task-specific quality.For PT, we report Self-BLEU (Zhu et al., 2018) for the rewrite quality 9 and Transfer Accuracy (TA) for the taskspecific quality.We observe that COEDIT consistently outperforms other models on both tasks, which indicates its generalization abilities on these new and unseen adjacent tasks.It is noteworthy that GPT3-EDIT performs quite well out-of-thebox on PT, but not so much on the SC task.

Generalizability to Composite Instructions
Finally, we also explore the capability of our model to understand composite natural language instruc-tions.Composite instructions are made up of a combination of tasks.For example, for the composite instruction, "Make the text simpler, paraphrase it, and make it formal", the model needs to simultaneously perform simplification, paraphrasing and formalization of the input sentence.
Since there is no publicly available dataset for composite instructions, we create the COEDIT-COMPOSITE dataset by expanding the COEDIT dataset to a total of 90k pairs.In addition to the single-task instructions, we use seven new combinations of instructions as part of our training set, with each composite instruction having either two or three tasks.Specifically, these are GEC-Paraphrasing, GEC-Simplification, GEC-Paraphrasing-Simplification, Formality-Paraphrasing, Formality-Simplification, Formality-Paraphrasing-Simplification, and Paraphrasing-Simplification (more details in §A).We then fine-tune the FLANT5-XL model on COEDIT-COMPOSITE (referred as COEDIT-XL-C).The training details are summarized in §D.
We evaluate COEDIT-XL-C on both single and composite instructions.For the single instructions, we use the same evaluation setup as in Table 2 and find that the overall performance of COEDIT-XL-C is on par with that of COEDIT-XL (Table 6).This shows that training the model additionally on composite prompts has no negative impact on single-task performance.
For composite instructions, we conduct human evaluations since there is no standard test dataset available.We use three new task combinations in addition to the seven seen during training to evaluate the model's generalizability.These are Coherence-Paraphrase, Coherence-Simplify, and Coherence-Simplify-Paraphrase.Specifically, we conduct two sets of pairwise annotations (similar setup as the one in Section 6.1) comparing COEDIT-XL-C with GPT3-EDIT and COEDIT-XL (shown in Table 7) on 30 composite instructions.For a fair comparison against COEDIT-XL, we pre-pare a chaining pipeline10 by decomposing composite instructions into a sequence of multiple single instructions and executing them one-by-one.In 38% of cases, experts show a preference for COEDIT-XL-C, compared to 34% for GPT3-EDIT.In 3% cases, both models are preferred equally, whereas, for 25% of the cases, none of them are preferred.
The experts prefer COEDIT-XL-C for 34% of the cases versus 21% for the chaining baseline.Both outputs are preferred equally in 14% cases, whereas, for 31% of the cases, both models generate unacceptable predictions.Table 13 provides a side-by-side comparison of outputs generated by these models.

Conclusions
We present COEDIT -an open-sourced dataset and set of instruction-tuned large language models that can act as a writing assistant by following natural language instructions to perform various textual edits by removing, updating, or adding words, phrases, and sentences.COEDIT achieves state-ofthe-art performance on multiple text editing benchmarks, spanning syntactic, semantic, and stylistic edit requirements.Through extensive experiments, we have shown that COEDIT is capable of further generalizing to unseen, adjacent, and composite instructions to perform edits along multiple dimensions in a single turn.In our human evaluations, we observe that COEDIT can assist writers with various aspects of the text revision process at scale by following natural language instructions.

Limitations
Although COEDIT achieves state-of-the-art performance on multiple text editing benchmarks, we acknowledge some limitations to our approach and evaluation methods.Our task-specific fine-tuning (like most other works) mainly focuses on sentencelevel editing tasks, and its effectiveness on much longer sequences of texts that are more appropriate to real-world editing settings remains to be seen.Additionally, our system mainly focuses on nonmeaning-changing text edits, thus, which could potentially limit the utility of our model to more real-world scenarios where fact-based editing or corrections are needed.Another limitation of our work involves prompt sensitivity.While we construct our inputs by randomly choosing from a pool of verbalizers for every task, we acknowledge that different prompts may induce better or worse edits, and as we evaluate each input with a random verbalizer, a fully controlled comparison for each available prompt across all models is not done.Furthermore, the prompting format was kept uniform across all evaluated models, whereas some models may perform better with a different prompting format.We plan to address this in future work.Finally, computing resource requirements could pose some difficulty in replicating the results (which we try to address by sharing our models publicly).

Ethics Statement
Since our work mainly focuses on non-meaningchanging text edits, we are able to avoid many issues involving generating harmful text.Although, there is still a possibility of small meaning changes for stylistic tasks, we try to reduce the chance of hallucinations by constraining the generation to strictly edit tasks in order to reduce the chance of adding any new information, or perpetuating biases.
• Neutralization: We use WNC (Pryzant et al., 2020), a dataset from the Subjective Bias Neutralization task, where the objective is to remove or mitigate biased words to make sentences more neutral; • Paraphrasing: For paraphrase generation, we used the PARABANKV2 corpus (Hu et al., 2019), since it is a large-scale corpus that contains multiple diverse sentential paraphrases.
Once the raw datasets were collected, we randomly sampled them to the quantities mentioned in Table 1 based on a few heuristics such as old word retention, complexity ratios, dependency tree depth ratio, and character length ratio.The sampled pairs were then modified by prefixing the source texts with task-specific verbalizers (Appendix C) to convert a <source, target> pair to a <instruction: source, target> pair.All our models were then fine-tuned on the verbalized dataset.
Composite instructions: Table 8 shows the composition of the COEDIT-COMPOSITE dataset, in addition to the details about datasets and prompts.We use seven such composite instructions during model training.For the first three composite prompts (GEC-Paraphrasing, GEC-Simplification, GEC-Paraphrasing-Simplification), we use GEC datasets to extract datapoints that show simplification and paraphrasing edits in addition to GEC.For the next three prompts (Formality-Paraphrasing, Formality-Simplification, Formality-Paraphrasing-Simplification), we use the formality dataset (GYAFC) to extract pairs which exhibit paraphrasing and simplification edits in addition to formality.Lastly, for the last prompt (Paraphrasing-Simplification), we use the ParabankV2 paraphrasing dataset to extract data points which show a simplification of the source text in addition to paraphrasing.
To select the appropriate source-target pairs for a composite instruction, we use similar heuristics as with single-task instructions, i.e. old word retention, complexity ratios, dependency tree depth ratio, and character length ratio.For example, a source-target pair from a GEC dataset can be used for the composite instruction involving GEC, paraphrasing and simplification if the target and source sentence has a high edit distance and low complexity ratio, character length and word retention scores.The exact details can be found in the code.
Finally, for building the prompts for the composite instructions, we randomly sample from the task-specific verbalizers and concatenate them.The ordering of the single tasks in a composite instruction is also chosen randomly to ensure better generalization.

B Testing Dataset Description
Specifically, we consider the following datasets: Grammatical Error Correction We use the JF-LEG (Napoles et al., 2017) corpus of English sentences that represents a range of language proficiency levels and comprehensive fluency edits.For evaluation, we use the GLEU (Napoles et al., 2015) score as the primary metric and also report results using the SARI (Xu et al., 2016) metric.
Text Simplification We use the TurkCorpus (Xu et al., 2016) and ASSET (Alva-Manchego et al., 2020) datasets, which were both created from WikiLarge data (Zhang and Lapata, 2017), where each complex sentence consists of multiple crowdsourced reference simplifications.We report results using the SARI metric.
Coherence We use the Coherence split of IT-ERATER (Du et al., 2022b), and the DISCOFUSE dataset (Geva et al., 2019), as it involves linking two given sentences as coherently as possible using edit operations such as inserting discourse connectives.We report results using the SARI metric.
Iterative Text Editing We use ITERATER (Du et al., 2022b), an iterative text revision dataset spanning five edit intentions (Section 3) across three different domains (ArXiv, News, Wikipedia).We evaluate our models using the SARI metric.We report the performance on individual intentions -Fluency, Clarity, and Coherence, and also aggregated scores on the full dataset, which includes Style edits.
The rest of the section describes the evaluation setups for Style-related edits: Formality Style Transfer We use Grammarly's Yahoo Answers Formality Corpus (GYAFC) (Rao and Tetreault, 2018), a parallel corpus of informal and formal sentence pairs from two different domains.Similar to prior works, we evaluate the quality of rewriting using SARI, and the accuracy of style transfer using a formality classification model11 .Neutralization We use WNC (Pryzant et al., 2020), a dataset from the Subjective Bias Neutralization task.Based on prior works, we use Exact-Match (EM) for evaluations, which is the percentage of examples for which the edited text exactly matches the reference(s).

C Task Verbalizers
We manually curated a variety of task-specific verbalizers to construct the instructional inputs.Table 9 shows the full list of the verbalizers used for training and evaluations.Table 10 shows the verbalizers used for the experiments conducted in Section 6.2.

D Training Details
We used the Adam optimizer with a learning rate of 1e − 4. Each model is trained for 5 epochs Fives is a British racquet sport.
Fives is a British sport.It is thought to have come from the same place as many racquet sports.
Fix coherence in this text: Phantom Budo is an incredibly powerful ability.
Spike has yet to fully develop this awesome fighting technique.
Phantom Budo is an incredibly powerful fighting technique.Spike is yet to fully develop it .
Phantom Budo is an incredibly powerful ability, but Spike has yet to fully develop it.
Use different wording: Judge Gerald W. Heaney, in dissent, said the authorities should have allowed the prisoner to be medicated without the consequence of execution.
Judge Gerald W. Heaney, who disagreed with the majority decision, said the prisoner should be medicated after important testimony is received.
In his dissent, Gerald W. Heaney said the authorities should have allowed the prisoner to be treated without execution.
Table 12: A sample of text revisions generated by GPT3-EDIT and COEDIT-XL (ours).

GPT3-EDIT COEDIT-XL (chaining) COEDIT-XL-C
Remove all grammatical errors from this text, and make this text less complex: They follow the way the Sun goes from the east to the west.
The Sun rises in the east and sets in the west.
They move with the sun as it goes from east to west.They follow the sun as it goes from east to west.
Make the sentence grammatical, rewrite the sentence with different wording, and make this text less complex: Fives is a British sport believed to derive from the same origins as many racquet sports.
Fives is a British sport that is believed to derive from the same origins as many racquet sports.
It is thought to have come from the same sources as other racquet sports.
Fives is a British sport that is believed to have originated from racquet sports.
Make the text coherent, and rewrite this text: Nebulous is initially unconcerned when he is unable to finish his cheese sandwich.He soon discovers he is dealing with an infinite sandwich and immeasurable garnish.
Nebulous is initially unconcerned when he is unable to finish his cheese sandwich.He soon discovers he is dealing with an infinite sandwich and immeasurable garnish.He is then unable to finish his cheese sandwich.
Nebulous wasn't too worried about the fact that he couldn't finish his cheese sandwich, but soon realized that this was an endless sandwich and an endless amount of condiments.
Nebulous is initially unconcerned when he is unable to finish his cheese sandwich, but soon realizes he is dealing with an endless sandwich and an endless supply of condiments.
Make the text more coherent, simplification, and rewrite this sentence: Phantom Budo is an incredibly powerful ability.Spike has yet to fully develop this awesome fighting technique.
Phantom Budo is an incredibly powerful ability.It is an awesome fighting technique that Spike has yet to fully develop.
This fighting technique of Spike is still in development.
Spike has yet to fully develop his Phantom Budo ability.
Rewrite the sentence to be simpler, and paraphrase this text: The jury asked for transcripts of Quattrone's testimony about his role in the IPO allocation process.
The jury asked for transcripts of Frank Quattrone's testimony about his role in the initial public offering allocation process.
The jury requested a copy of Quattrone's testimony.Transcriptions of Quattrone's testimony were requested by the jury.
Rewrite more formally, and make this easier to understand: brian, he asked the coolest questions, they suspended him though, i dont know what's up with that!Brian, he asked the coolest questions, they suspended him though, I don't know what's up with that!He was a cool guy, and I liked his questions.
I do not understand why Brian was suspended.He had some great questions.
I do not know why they suspended Brian.
Table 13: A sample of text revisions generated by GPT3-EDIT, COEDIT-XL (chaining) and COEDIT-XL-C for composite instructions.

Figure 1 :
Figure 1: Model comparison according to training parameters vs. average performance across all text editing benchmarks reported in Tables2 and 11.Publicly available models are denoted with (*).

Table 1 )
. Their work, in turn, is based onDu et al.

Table 1 :
Example data instances in the COEDIT dataset.Instructions in the inputs are italicized.

Table 2 :
Comparison of COEDIT against various baselines: (a) copy baseline and T5-LARGE baseline with task-specific prefixes (i.e.<gec>, <clarity>, etc.) (b) T5-based models, (c) Decoder-only LLMs (zero-shot), (d) Instruction-tuned LLMs (zero-shot), (e) Few-shot evaluations of pre-trained LLMs, (f) SOTA text editing models, and, (g) Variants of COEDIT models (our work).The first score for each task (excluding MRPC style task) is SARI.The second scores for Fluency, GYAFC, and WNC are GLEU, Formality Transfer accuracy (%), and EM.For MRPC, the first score is Self-BLEU, while the second score is semantic similarity.The best-performing models 7 for each dataset are highlighted in boxes.Results with (*) are ones reported in prior works.(FS)denotesfew-shot evaluation.Results on other datasets are in Table11.

Table 3 :
Ablation results for COEDIT to evaluate the impact of (a) instruction tuning (b) task-specific training, and (c) quality of instructions.The scores from left to right follow exactly as Table2.

Table 4 :
Human evaluation results: Pair-wise comparison of COEDIT-XL against the best-performing 175B-parameter instruction-tuned LLM for text editing (GPT3-EDIT).Scores indicate the % of test inputs for which the human annotators preferred the said model.

Table 6 :
Results for composite prompt training on single-task performance.Scores follow exactly as Table2.

Table 7 :
Human evaluation results: Pair-wise comparison of COEDIT-XL-C against GPT3-EDIT and equivalent COEDIT-XL (with chaining pipeline).Human annotators preferred the said model for % of test inputs.

Table 8 :
Example data instances with composite instructions in the COEDIT-COMPOSITE dataset (90K <instruction: source, target> pairs).Instructional prompts in the inputs are italicized.