EdiT5: Semi-Autoregressive Text Editing with T5 Warm-Start

We present E DI T5 1 – a novel semi-autoregressive text-editing model designed to combine the strengths of non-autoregressive text-editing and autoregressive decoding. E DI T5 is faster during inference than conventional sequence-to-sequence (seq2seq) models, while being capable of modeling ﬂexible input-output transformations. This is achieved by decomposing the generation process into three sub-tasks: (1) tagging to decide on the subset of input tokens to be preserved in the output, (2) re-ordering to de-ﬁne their order in the output text, and (3) insertion to inﬁll the missing tokens that are not present in the input. The tagging and re-ordering steps, which are responsible for generating the largest portion of the output, are non-autoregressive, while the insertion step uses an autoregressive decoder. Depending on the task, E DI T5 on average requires signiﬁcantly fewer autoregressive steps, demonstrating speedups of up to 25x when compared to seq2seq models. Quality-wise, E DI T5 is initialized with a pre-trained T5 checkpoint yielding comparable performance to T5 in high-resource settings when evaluated on three NLG tasks: Sentence Fusion, Gram-matical Error Correction, and Decontextualiza-tion while clearly outperforming T5 in low-resource settings.


Introduction
Pre-trained seq2seq models such as T5 (Raffel et al., 2020), BART (Lewis et al., 2020a), and MASS (Song et al., 2019) have established strong baselines for the majority of text-to-text transduction tasks.A recent trend to massively scale up model sizes, e.g., all the way up to 540B params (Chowdhery et al., 2022), as well as the sizes of pretraining corpora, has further pushed the state-of-the-art without signs of reaching a plateau.From a practical point of view, running inference with such models is prohibitively expensive for most applications, which motivates the work on finding efficient recipes for model distillation, e.g., (Kim and Rush, 2016) and choosing a model architecture that can provide a better trade-off between performance on a given task and inference speed.A typical choice is to distill a large language model into a smaller seq2seq model, e.g., Transformer (Vaswani et al., 2017).In this paper we propose a novel model architecture EDIT5which blends ideas from a seq2seq T5 (Raffel et al., 2020) and text-editing to provide faster inference without sacrificing on task performance.
Seq2seq-based models output text token-by-token from scratch, allowing them to model any kind of input-output relationship.However, for many real-world tasks this degree of generality is unnecessary, especially for monolingual tasks where the input and output texts have relatively high degrees of overlap.In such cases a natural approach is to cast conditional text generation as a text-editing task, where the model learns to construct target texts by applying a set of edit operations to the inputs (Malmi et al., 2022).Typically the set of edit operations is defined ahead of time (Omelianchuk et al., 2020;Malmi et al., 2019;Awasthi et al., 2019), which on the one hand limits the flexibility of the model to reconstruct arbitrary output texts from the inputs, but on the other, leads to latency improvements as the limited set of allowed operations significantly reduces the output vocabulary of the decoder.In this paper, we propose an approach which is both fast at inference time and flexible, able to model arbitrary rewrites.
Faster inference.A common method for achieving low latency in serving models is to reduce their size, thus reducing their computational cost.Doing so naively, however, often leads to inferior model quality, and much work has gone into finding better methods for model size reduction, such as distillation (Kim and Rush, 2016).
Regardless of model size, one of the major contributors to the total inference time for seq2seq models is the decoder, which generates the output sequence step-by-step.EDIT5 also relies on an autoregressive decoder, but generates the majority of the output sequence with its tagging and pointing networks, and as such the decoder makes far fewer steps.
Flexible text-editing.Recent text-editing approaches, e.g., (Awasthi et al., 2019;Malmi et al., 2019), are not as powerful as general purpose seq2seq approaches when it comes to modeling arbitrary input-output text transductions.EDIT5 supports open-vocabulary generation by relying on an autoregressive decoder.In the extreme case, where there is no overlap between the source and the target texts, it reduces to a vanilla seq2seq model generating the entire output from scratch.However, when the input and output overlap, it can benefit from the tagging and pointer networks to reconstruct the bulk of the output text that is further infilled (refined) by the autoregressive decoder.

Warm start.
Training a high-precision text generation model typically requires large amounts of high-quality supervised data.Self-supervised techniques based on text in-filling (Rothe et al., 2020a;Lewis et al., 2020b;Raffel et al., 2020) have been shown to provide a crucial advantage over non-pre-trained models especially in low-resource settings.Hence, we design EDIT5 to be able to benefit from already existing pre-trained language models (specifically T5), where the final model is directly fine-tuned on the downstream task.
EDIT5 decomposes the generation task into three steps: tagging, pointing and insertion (see Fig. 1).The tagger and pointer networks decide which source tokens to preserve and in which order they should appear in the output, thus allowing for arbitrary word dropping and reordering.The tagger is implemented using a non-autoregressive feedforward network, and pointing is implemented using a novel non-autoregressive pointing mechanism (Vinyals et al., 2015) combined with sinkhorn layers (Mena et al., 2018).The insertion network inserts/infills words which are present in the target sequence but do not appear in the source sequence.The network is implemented using an autoregressive transformer decoder, which attends to the tagged, reordered source sequence.The decoder predicts both the locations of where the token spans should be infilled, as well as the spans themselves.
We evaluate EDIT5 on three distinct text generation tasks: Sentence Fusion, Grammatical Error Correction (GEC), and Decontextualization, comparing to recent text-editing approaches and T5.Each task is unique in the editing operations required and the amount of training data available, which helps to better quantify the value of modeling decisions we have integrated into EDIT5.
Additionally, we explore the impact of training data size and model size on EDIT5.Finally we quantify the latency of EDIT5, providing a detailed analysis and comparison to T5.

Model description
The model architecture of EDIT5 resembles a vanilla Transformer (Vaswani et al., 2017) composed of an encoder and a decoder.EDIT5 decomposes the generation of a text y from an input x into three parts: predicting a sequence of edit tags y t (indicating whether a token from the input should be copied to the output), a permutation of the input tokens π (indicating the order that copied tokens should appear in in the output), and a sequence of tokens y d (indicating additional tokens that should be in the output, and where in the permuted input they should be inserted).y t and π are modeled by the encoder, and y d by the decoder.
There are multiple ways to choose the triple (y t , π, y d ) for a given (x, y) pair.During dataset creation we choose a single such triple for each training pair (see section 2.1 for details), in which case the probability of y can be expressed as: During inference, we first greedily set y t to maximize the third term, then π to maximize the second term, and finally y d to maximize the first term.The output text y is realized by applying the tags y t and permutation π to the input sequence x and then inserting the tokens y d .

Text-editing encoder
The EDIT5 encoder consists of three steps: encoding, tagging, and pointing.
Encoder.The source sentence x is first encoded using N transformer layers into the hidden representations h.
Tagging.The tag sequence y t is constructed as follows: source tokens that must be copied are assigned the KEEP tag, tokens not present in the output are marked by the DELETE tag.Tags are predicted by applying a single transformer layer followed by a classification layer to the output of the encoder h, which is trained using cross-entropy: where y t are the gold tags, j is the index of the source token, and f t is a transformer layer followed by a classification layer.During inference we use argmax to determine the tags, whereas during training we use the gold tags.The encoder hidden state is then updated to take these tags into account: Where T E is a tag embedding layer, whose output is concatenated to the original hidden representation of the source sequence, before a feed-forward layer f te is applied.
Pointing.In many tasks it is helpful for the model to be able to rearrange the kept input tokens.For example, we can grammatically correct the sentence Who you are? to Who are you?purely by reordering tokens from the input.In EDIT5 this is made possible thanks to its pointing mechanism.In contrast, in text editing approaches such as Malmi et al. (2019); Dong et al. (2019), correcting this sentence involves first deleting the words you are and then recreating them in the right order.Given a sequence x and the predicted tags y t , the re-ordering model generates a permutation π.Our implementation is based on a pointer network (Vinyals et al., 2015), where an attention mechanism points to the next token.We follow Mallinson et al. (2020) which, unlike previous approaches where a decoder state attends over an encoder sequence, applies intra-attention, where source tokens attend to all other source tokens.As such the output of this model is a series of predicted pointers, where each source token predicts the token that comes after it.π can easily be constructed by daisy-chaining these predicted pointers together, as seen in Fig. 2. We calculate attention using keyquery attention, where we include an additional transformer layer prior to the key network: Where α m,j is the unnormalized attention, f q is the query network, a single feed-forward layer, and f k is the key network, a transformer layer followed by a single feedfoward layer.
Unlike Mallinson et al. (2020), we ensure a valid permutation is formed, i.e. no token is pointed to twice, by using sinkhorn layers (Mena et al., 2018), which normalizes over both the rows and the columns of the intra-pointer attention α.Sinkhorn layers are defined as: (5) where T j,m c (X) = [CLS] a long user query root The loss for the pointing network is defined as: Where CE is the cross-entropy loss.During inference we use argmax to determine π.
We use additional positional embeddings to update the hidden states with their new position (offset from 0).For example if Who you are? was reordered into Who are you?, the position information would be updated as 0 Who 2 you 1 are 3 ?.
where P E are learnt absolute positional embeddings (Devlin et al., 2019).These additional positional embeddings are masked out for those source words which do not appear in the target sequence.
Finally we apply a transformer encoder layer to h p forming the final encoded representation of the sequence h f .h f captures the edits as well as the original sequence x, and the decoder attends to this representation.
Decoder.We use a standard transformer decoder, which is tasked with inserting tokens which are in the output sequence but don't appear within the input sequence.As such the decoder first decodes a special position token and then decodes the inserted tokens which should appear after this token.For example to insert the cat after the first token, the decoder generates: 〈pos_1〉 the cat.The decoder is trained  with a standard cross-entropy loss: Where i is the decoder index, and h f is the encoder output.The loss for the entire model is defined as the sum of the three individual losses: ) where λ 1 , λ 2 and λ 3 are hyper-parameters determining the relative importance of tagging, pointing and insertion losses in the final loss.
Pre-training.While we initialize EDIT5 from T5 base, T5 was pre-trained with 12 decoder layers, and for EDIT5 we use a single decoder layer.To account for this change in the decoder layers, we perform additional pre-training.We use a pretraining objective which combines a T5 style span insertion task, with a generic text-editing denoising task, as used in BART (Lewis et al., 2020b).A source sentence is corrupted by dropping, swapping and adding spans (an example can be seen in Figure 3), and we task our model to reconstruct the original sentence.By introducing noise we are able to train the tagger to detect incorrect spans, and the pointer to reorder the sentence.The decoder then behaves like the T5 pre-training objective inserting the content of missing spans.Unlike BART's pretraining, our approach is computationally cheap, as we do not decode the entire sequence when training, instead just decoding the missing spans.
Dataset construction.When constructing the training dataset, there are many possible combinations of y t , π and y d which could produce y.For instance, all source tokens could be deleted and the decoder could then produce all the target tokens.
However to minimize latency, we wish to make the number of inserted tokens (i.e. the number of decoder steps) as small as possible, and maximize the number of kept tokens.
To produce alignments from a target sequence to a source sequence, we iterate left-to-right through characters in the target sequence, trying to find spans of target characters which appear in the sequence of source tokens, as described in Algorithm 1 (see Appendix A).Each source token can only be aligned to a single target span.Those target spans that can't be aligned are instead inserted after the closest previous aligned source token.In cases where there are multiple possible alignments, e.g. the same token appears multiple times in the source, we align the target character span to produce the longest contiguous span of source tokens aligned with the target, i.e.where source tokens appear one-after-another in the target sequence.To find the longest contiguous span we compare the contiguous overlap between source and target for each possible alignment.

Experiments
We evaluate EDIT5 on three distinct text-editing tasks: Sentence Fusion, Grammatical Error Correction, and Decontextualization.In addition to reporting previously published results for each task, we also compare to FELIX (Mallinson et al., 2020), a recent non-autoregressive text-editing model, and a strong pre-trained T5 baseline implemented in the T5X framework (Roberts et al., 2022).
Modeling.For EDIT5 we initialize with a T5 base model with a 12-layer Transformer encoder, and single-layer Transformer decoder.Our code is based on the Tensorflow Model Garden's (Hongkun Yu and Li, 2020) TF2 version of T5.After initializing with the T5 checkpoint, we further pre-train on the denoising objective (see Section 2.1) using the C4 corpus (Raffel et al., 2020), training for 100k steps.
For all experiments EDIT5 is trained using AdamW (Loshchilov and Hutter, 2019), additionally the learning rate was decayed using the validation set, and exact match is used for checkpoint selection.Tokenization is based on T5's Senten-cePiece vocabulary (Kudo and Richardson, 2018), with a vocabulary size of 32k.We, however, modify the vocabulary, removing tokens which have punctuation as a suffix, and replacing them with additional span insertion special token, giving EDIT5 512 span insertion special token.Unless otherwise stated, we use an input sequence length of 128.We performed minimal hyper-parameter selection, which is discussed in the Appendix.
Task Analysis.The chosen tasks cover a diverse set of edit operations and a wide range of dataset sizes, varying from under 11 thousand data points to over 4.5 million.Table 1 provides dataset statistics including: the size, input sequence length, output sequence length for seq2seq models, the output sequence length for EDIT5, and the translation error rate (TER) (Snover et al., 2006) between the source and target sentences.We use TER to highlight unique properties of each task.
From Table 1 we see that for all tasks EDIT5 requires significantly fewer decoder steps than a seq2seq model, which results in significant latency savings.We also see that decontextualization has the longest input and output sequences, where the maximum input length of decontextualization is 512 tokens.Decontextualization has the highest TER, with the major contribution being deletion, which is due to the input sequence consisting of a paragraph, whereas the output is a single sentence.In contrast GEC, has the shortest input and output sequence, with the majority of the dataset consisting of a single input and a single output sentence.GEC has the lowest TER, however it has the highest insertion TER.Sentence fusion consists of two sentences being rewritten into a single sentence, and has a middling TER and sequence lengths.It also has the fewest substitutions.

Sentence Fusion
Sentence Fusion is the task of fusing independent sentences into a coherent output sentence(s) (Geva et al., 2019).It requires operations such as inferring the appropriate discourse connective, pronominalization, reordering the text to introduce relative clauses, and changing the order of the input sentences.
Setup.Following Geva et al. (2019), we report Exact match, which is the percentage of exactly correctly predicted fusions.In addition to the T5 The results in Table 2 additionally demonstrate that the significant improvements of EDIT5 over Felix in high/medium-resource settings do not stem from EDIT5 pre-training.With 450 datapoints, pretraining is critical since there's a larger mismatch between EDIT5 and T5 checkpoints than there is between Felix and BERT checkpoints.We additionally ablated the impact of sinkhorn layers, and found that under the 100% data condition there was a modest decrease in performance (0.5 exact match points).

Decontextualization
Sentence decontextualization task was introduced by Choi et al. (2021).The goal is to rewrite an input sentence to make it stand-alone without the original context.
Data.We use the train, dev and test data from Choi et al. (2021), where sentences were selected from Wikipedia passages.Human annotators were asked to rewrite them, if possible, to be interpretable and grammatical without the context.We compare against T5 base, T5 xxl, FELIX, and a copy baseline.All models use a sequence length of 512.
Metrics.Following Choi et al. (2021), we report exact match, exact match when a sentence needs to be rewritten and SARI F1 (deletion and addition) on unigrams (Xu et al., 2016).3 show that EDIT5 achieves a higher exact match scores, and SARI delete score when compared to T5 base, with a significant drop in latency and using fewer parameters.T5 base achieves significantly higher SARI add, suggesting its better at inserting new tokens, which is unsurprising as EDIT5 is primarily focused on copying the source sequence.Both T5 and EDIT5 achieve significantly higher numbers than FELIX.EDIT5 and T5 base, however, still achieve a significantly lower score than the T5 xxl, which can be explained by the difference in model size.

Grammatical Error Correction
GEC requires systems to identify and fix grammatical errors in a given input text.Data.We evaluate on the standard GEC test set BEA (Bryant et al., 2019), and use BEA-DEV for checkpoint selection.For pre-training we use an artificial GEC dataset C4_200M of 200M sentences (Stahlberg and Kumar, 2021).We then finetune on cLang-8 (Rothe et al., 2021), a distilled version of the Lang-8 learners corpus (Mizumoto et al., 2011).
Setup.We report ERRANT F0.5 scores for BEA.We report additional gT5/gFelix baseline numbers from Rothe et al. (2020b), where T5/Felix models were trained only on cLang-8.For pre-training we sampled 0.2% examples from the training set to use as a development set, and train till convergence as measured on this development set.We additionally measure the impact that model size has on quality and latency, training T5 and EDIT5 small, base, and large models.To make the latency comparison fairer, we also train singledecoder-layer variants of the T5 models we call T5 Slim.To further ensure a fair latency comparison between EDIT5 and T5 we use the same framework for both models.Additionally, we do not perform EDIT5 specific pre-training.
Results.From Table 4, we see that all models outperform their equivalent gT5/gFelix models, which is not surprising as the latter models were trained on less data.A surprising result is that the T5 slim variants achieve comparable scores to the full T5 models while having significantly lower latency.Comparing EDIT5 against T5 models, we see up to ∼1 point differences in F0.5 scores between models of the same size (small/base/large), however EDIT5 produces speed ups between 10x and 25x.
In Figure 4, we study the latency-quality tradeoffs of T5, T5 slim, and EDIT5 models.We omit Felix from this analysis, because Felix achieves a significantly lower score.We focus on the 95   percentile latency, as it is often the case that users require that a model returns a result within a fixed latency budget.We see that EDIT5 drops less than 0.25 F0.5 points comparing across model sizes, whilst being significantly faster.Additionally for a given latency budget of 5ms, no full T5 model would fit, and only the T5 slim small would fit, whereas both EDIT5 small and base fit.Comparing EDIT5 base against T5 slim small, we see that EDIT5 scores 3 F0.5 points higher, whilst being faster.For any latency budget under 20ms, EDIT5 is quicker and offer better results than T5 and T5 slim.For latency budgets above 20ms, T5 slim large scores slightly (<0.25 F0.5) higher than EDIT5, and if latency is not a factor then gT5 xxl should be used.

Latency analysis
The tasks on which EDIT5 outperforms seq2seq models in latency are those that have overlap be-tween sources and targets, but it's unclear how much overlap is required for EDIT5 to produce latency savings.To answer this question, we split EDIT5 base, T5 base and T5 slim base into components whose latencies we measure separately and compare.Details on how latencies are measured can be found in the Appendix C. A seq2seq model decomposes into two parts: the encoder (we include the input embedding here, so we refer to this as encoder* below), and the decoder.EDIT5 has both of these parts, but also includes a third part (which we call its overhead), comprising of pointer realization and additional transformer layers.To make our analysis simpler and more task-agnostic, we make two simplifying assumptions.First, we assume the worst-case that no tokens are deleted by EDIT5 and there are no padding tokens in the input2 , in practice this is not the case, and provides significant latency savings for EDIT5.Second, we assume that decoder latency is linear in the number of decoder steps3 .Both of these assumptions benefit the latency of seq2seq models more than EDIT5.
Results.In Table 5 we present latencies of encoder*, worst-case EDIT5 overhead and the perstep latency of a decoder under various input-length conditions.We see the overhead added by EDIT5 even in the worst-case is small.
From these results we can derive a simple rule for when EDIT5 will provide a net latency benefit.Compared to T5 slim base4 , EDIT5 base must save on average 4 steps with an input length of 128, and 7 steps with an input length of 512.
Finally, collating the results in Table 5 with the number of decoder steps performed by EDIT5 and T5 in Table 1, we see that whereas in T5 the decoder latency dominates the latency of encoder*, in EDIT5 this is no longer the case.For instance for GEC, at 24.7 decoder steps on average required to construct the output, T5 slim spends 3.7x more time in its decoder than in encoder*.EDIT5 however spends less time in its decoder than in encoder*, as such the encoder* is now the latency bottleneck.

Related work
T5 (Raffel et al., 2020) is a pre-trained, Transformer-based (Vaswani et al., 2017)  decoder model which has become a generalpurpose tool for a variety of sequence-transduction tasks, establishing many new state-of-the-art results (Raffel et al., 2020;Rothe et al., 2021).However, two considerable challenges hindering the productionizing of T5-based models are the high latency caused by autoregressive decoding and the need for having a relatively large number of training examples despite the fact that pre-training makes T5 more sample efficient.Recently, it has been found that the sample efficiency problem can be mitigated by performing in-context few-shot learning, but this typically requires scaling up the model size even further (Brown et al., 2020;Chowdhery et al., 2022), increasing the latency.
To reduce latency, a number of nonautoregressive (NAT) seq2seq methods have been proposed for neural machine translation (Gu et al., 2018(Gu et al., , 2019;;Du et al., 2021) but a quality gap compared to autoregressive methods still exists.To decrease the gap, it is common to run the NAT methods iteratively, which, however, limits the inference speed advantage over autoregressive methods (Lee et al., 2018).In contrast, we show that for tasks where inputs and outputs overlap, we can maintain an order-of-magnitude speed-up without compromising on the model quality by treating the problem as a text-editing task and producing the output in a single pass.
A number of text-editing models have been proposed as a faster and more sample efficient alternative to seq2seq models like T5 (Awasthi et al., 2019;Malmi et al., 2019;Omelianchuk et al., 2020;Mallinson et al., 2020).Another recently proposed approach to speed up the inference time of Transformer models is called aggressive decoding (Sun et al., 2021;Ge et al., 2022).
Closest to our work, Mallinson et al. (2020) show that adding pointing mechanism for reordering and a separate insertion model allow their text-editing model, FELIX, to produce an arbitrary output in a flexible manner.FELIX is a non-autoregressive model which first predicts the tokens to keep, their order, and the locations at which to insert new tokens.Then it runs a separate model based on a BERT masked language model for inserting new tokens.In contrast, EDIT5 employs a single, endto-end model which has an autoregressive insertion component.This enables more accurate insertions, while keeping the latency low, given that most of the tokens can be copied from the source non-autoregressively.Other text-editing models that employ autoregressive insertion include Ed-itNTS (Dong et al., 2019), the text-normalization model by Zhang et al. (2019), Seq2Edits (Stahlberg and Kumar, 2020), ESC (Chen et al., 2020) and LEWIS (Reid and Zhong, 2021).However, unlike EDIT5, these models perform also the edit operation prediction autoregressively, making them potentially slower at inference time.

Conclusions
In this paper we have proposed EDIT5 a low latency solution to text generation, that achieves comparable or betters results, across three distinct tasks, to a strong T5 baseline whilst achieving inference latencies that are up to 25x quicker than the baseline model.
In the future we wish to explore the following ideas: 1) The impact of distillation for EDIT5.Distillation has previously been shown to be particularly advantageous to non-autoregressive models.
2) Exploring the impact that quantization has on both latency and quality.3) Applying EDIT5 to additional languages.EDIT5 makes no language specific assumptions and we plan to apply it to languages other than English.

Limitations
A limitation of EDIT5, and text-editing models in general, is the assumption of overlapping text between the input and output sequences.For instance, in machine translation the overlap between source and target is minimal to none.As such EDIT5 would decode the entire target sequence, thus offering no latency saving.
An additional limitation is that all of our experiments were done on English tasks.It is unclear how EDIT5's pointing mechanism would behave with languages which have a less strict word-order, such as Czech.
Finally, we have measured latency only on V4 A Alignment Algorithm

B Training Details
All models were trained on 4x4 or 8x8 TPUs, all EDIT5 models completed training (including EDIT5 pre-training) in under a day.T5 large pretraining large took 2 days to complete and was done using a 4x4 TPU.

B.1 Hyper-Parameters Selection
For T5 we compared the T5 1.0 and T5 1.1 version using the base model on the validation sets and found that T5 1.1 performed better, as such used T5 1.1.For EDIT5 we used the BEA dev set, finding that T5 1.0 base performed better than T5 1.1 and selected 1.0 for all experiments.For T5 we used the recommend fine-tuning settings, including using the adafactor optimizer  (Shazeer and Stern, 2018), with a learning rate of 0.001.For EDIT5 we used AdamW with default settings and the default learning rate of 3e-4.
DiscoFuse.For both EDIT5 and T5 we experimented with 3 different batch sizes 128, 256, 1024.For 100% and 10%, there was not a noticeable difference in the DEV set exact match performance, so we chose 1024 as it converged the quickest.For 1% and lower, we found that a batch size of 128 performed the best on the dev set.
Decontextualization.For EDIT5 we experimented with the batch size 128, 256, 1024 and found that 256 offered the best exact match and used this.We also slightly modified the preprocessing code, bracketing the target sequence with [CLS] and [SEP], which helped the alignment code.
GEC.For both EDIT5 and T5 we used the T5 recommended number of tokens per batch of: batch size = 512, maximum sequence length = 128.We however note that T5 used the inverse: batch size = 128, maximum sequence length = 512.For T5 and EDIT5 we disabled learning rate warmup when fine-tuning on cLang-8.Two additional hyperparameters were set for EDIT5, during pre-training on C4_200M, we noted that EDIT5 train set performance was lower than T5, as such we disabled dropout on the additional EDIT5 specific transformer layers.We additionally used the dev set to set the values of lambda for equation 10.We experimented with tagging/pointing λ being 1, 2, 10, or equal to the number of tokens.Where λ equal to the number of tokens produced the best results.

C Latency measurement
To report latency for a model, we run inference on a Cloud TPU V4 chip with batch size 1 and report the time spent in computations on the device.This approach ignores some practical contributors to latency, such as memory transfers between the host and device, but we found it also reduced noise significantly, while focusing on the main performance differences between EDIT5, T5 and T5 slim (the amount of computation they each perform).
To further minimize spurious latency differences, both EDIT5 and the baseline models are based on the same T5 implementation, found in TensorFlow Model Garden (Hongkun Yu and Li, 2020).

1Figure 1 :
Figure1: EdiT5 transforms the input text A long user query into the output The user query is very long by first generating a sequence of edit tags D K K K (where K stands for keeping and D for deleting the input token), re-ordering the input tokens with the pointer network, and infilling missing tokens into the source sequence with an autoregressive decoder which jointly predicts the text spans (The and is very) and the position where to insert them (pos0 and pos2).The blue arrow shows how the token pos2 is predicted conditioned on the prefix <s> pos0 The generated thus far.The dotted arrow lines depict the encoder-decoder cross attention over the re-ordered input tokens and edit tags.

Figure 2 :
Figure 2: Pointing mechanism to transform "a long user query" into "user query long".
EDIT5 takes advantage of the pretraining of a T5 model, where T5 was pre-trained to infill missing spans.When pre-training T5 uses special tokens 〈pos_i〉 to indicate where missing spans should be inserted, as demonstrated in Figure 3. EDIT5 re-purposes these special tokens, using them to indicate at which position new tokens should be infilled.I.e.〈pos_1〉, indicates that the tokens should be inserted after the first token.

Figure 3 :
Figure 3: Example pre-training noise for T5 and EDIT5.K and D indicate keep and delete tags resspectivly, and [0] indicates pos0.

Table 1 :
Statistics across tasks: size of the dataset (Size), source length in tokens (L src ), target length in tokens (L tgt ), EdiT5 insertion tokens (E5-Ins), and TER scores, including number of insertions (Ins), deletions (Del), substitutions (Sub), and shifts (Shft).Token counts are measured using a sentencepiece tokenizer and averaged over the development set.

Table 4
encoder- TPUs, and thus it is unclear how the performance would behave on different graphics cards or on CPUs.As such to determine if EDIT5 offers a good trade-off between quality and latency, one must measure latency on the target device.