Interactive Text Generation

Users interact with text, image, code, or other editors on a daily basis. However, machine learning models are rarely trained in the settings that reflect the interactivity between users and their editor. This is understandable as training AI models with real users is not only slow and costly, but what these models learn may be specific to user interface design choices. Unfortunately, this means most of the research on text, code, and image generation has focused on non-interactive settings, whereby the model is expected to get everything right without accounting for any input from a user who may be willing to help. We introduce a new Interactive Text Generation task that allows training generation models interactively without the costs of involving real users, by using user simulators that provide edits that guide the model towards a given target text. We train our interactive models using Imitation Learning, and our experiments against competitive non-interactive generation models show that models trained interactively are superior to their non-interactive counterparts, even when all models are given the same budget of user inputs or edits.


Introduction
Advances in generative modeling have made it possible to automatically generate high-quality texts (Brown et al., 2020), code (Chen et al., 2021), and images (Ramesh et al., 2021), but these creations can be unsatisfactory in many respects.For example, they often suffer from content errors-e.g., hallucinations (Ji et al., 2022)-that may require help from the user.Even if the generation is of good quality, it may not be the kind of text, code, or image that the user was hoping to obtain.Indeed, open-ended and complex generation tasks are often underspecified (e.g., from a simple prompt), which makes it almost impossible for the model to satisfy user needs without additional information.Distinguishing the real user need from the need as initially presented to the system has been the focus of decades of research (Taylor, 1968), and usually requires interacting with users to clarify their needs (Gao et al., 2022).Unfortunately, much of the research in generative models have been in "one-shot" settings, which don't allow any kind of iterative refinements to steer the generation towards what the user really wants.
This paper introduces a new end-to-end generation task, Interactive Text Generation, that accounts for interactivity without requiring real users during training.The framework is illustrated in Figure 1, where the model and a "user" take turns editing the text until a given stopping criterion is met.As this setup would be impractical to train with real users, we rely on a user simulator that provides a few high-quality edits that guide the model towards a given target text.This interspersion of user edits in the generation process allows text generation models to more efficiently use inferential capabilities of large language models (LLM).Contrast interactive text generation in Figure 1 with conventional text generation, where both systems are given exactly three user input words.Interactive text generation leverages LLM's capability to infer, e.g., "space-craft" from "NASA" and allows it to focus on parts of the text (e.g., "Monday") that are more difficult to predict, and this allows the interactive approach to generate text that better meets user's needs.
Our work makes the following contributions: • We propose a task and framework for interactive text generation, and release models, a dataset, and user simulators.Crucially, this task evaluates generation models with the same budget of user inputs or edits, to ensure the comparison between interactive and non-interactive models is fair.• We present methods to train Transformer-based (Vaswani et al., 2017) interactive text editing models using imitation learning, where our models learn to imitate an expert that dynamically constructs a trajectory from the current document state to the goal state (target document).
Our editing models include both autoregressive and non-autoregressive versions, with the nonautoregressive one achieving the best results.• We show that interactivity indeed does help thanks to imitation learning when compared to their non-interactive counterparts, across different evaluation metrics, model types, and model sizes and with the same amount of user inputs.This finding is consistent with prior work showing that user needs are often best handled interactively (Oddy, 1977;Belkin et al., 1995;Radlinski and Craswell, 2017;Gao et al., 2022), and confirms that our benchmark helps quantify the benefit of interactivity in text generation.• As user simulation is a key component of this new task, we contribute different user simulators.Our experiments show performance of our models remains consistent with different user simulators, which highlights the robustness of our new benchmark.Another contribution is that text generation in this work is not merely aimed at producing well-formed text, but also at creating text that is tied to user needs.We release this framework with the hope it will help researchers in NLP, Imitation Learning, Reinforcement Learning, and AI in general as it provides an environment to train AI agents that directly interact with users and user simulators. 1

Task: Interactive Text Generation
The task introduced in this paper considers a simple setting in which the system and user collaboratively write a document.As our goal is to train 1 Code and models for this paper will be made public.generation models that can do most of the heavy lifting, this task gives a more directional role to the user, while the bulk of the text is generated by the system.Therefore, our task is similar to the instructor-executor frameworks (Hu et al., 2019;Kiseleva et al., 2022) seen in other tasks.In the case of text generation, motivational examples of such instructor-executor interactions include a student writing a paper while getting occasional feedback from an advisor, or a freelance writer getting instructions from a client.
Our task models the interaction between a user and system, where the two parties successively take turns making changes to a draft document.As this interaction applies to both training and testing, it would be unrealistic to assume we have real users in both cases, and we therefore rely on user simulators.Although building effective user simulators is notoriously hard in tasks such as dialog (Schatzmann et al., 2007;Li et al., 2016;Lin et al., 2022;Gao et al., 2019), it is less so in our work given how we frame the interactive generation task: We assume that there is a goal text which one can think of as representing the user's desire or needs.The system does not know the goal, but the user simulator does.The objective of the agent is to get as close to the goal text as possible, while minimizing the number of edits the user simulator needs to make.
This framing makes it much easier to design effective user simulators, as a comparison between the current draft and goal text can help infer useful text edits.While agents in this setup are given an advantage by interacting with an oracle that has knowledge of the goal document, the number of oracle edits is generally set to be small, and we ensure comparisons between all models (including non-interactive models) are fair by giving each model the same budget of edits or inputs derived from the goal document.

Task Formulation
We can formalize our task by the following protocol.There are two players in the game, the agent and user.Starting from a blank document, the players take turns producing drafts over H rounds. 2The user has a goal document G, and the objective of the game is to get as close to the goal as possible.Closeness is quantified by a scoring function  s(•; G), and a tolerance δ > 0. We denote the agent draft at step h by A h , and the user draft at step h by U h .If we then fix a goal document G, the protocol is as follows.For h = 1, ..., H: Otherwise, the user produces U h according to G and A h−1 .

Agent observes U h and produces A h .
Here A 0 is a blank document.The tolerance δ represents how close the agent needs to get to the target for the user to be satisfied.Let T ≤ H be the time at which the interaction stopped.We evaluate the task by looking at s(A T ; G) and T .The higher s(A T ; G), the better the produced document, and the lower T , the more time the user saved.
We assumed here that each instance of the task is parameterized by a goal document G.However, one could instead use multiple goal documents, G 1 , ..., G M , or even replace s(•; G) with some general score function s(•) that is independent of any goal document, which one could think of as a utility function for the user.
In this work, we assume that the user behavior is fixed and we will learn an agent policy.

User Simulator
This section describes the user simulator used in our work.Throughout this section and the next, we refer the reader to the Appendix for more details.
Where appropriate, we indicate references to the appendix in parentheses.

Edits
Both our simulated user described below, and our model described in Section 4, operate by making edits to a document.We will consider single-word edits of three types: inserting a word, deleting a word, or substituting one word for another.If we denote an edit by e, and a document by x, then y = [e](x) will denote the action of e on document x, producing a new document y.Note that y will differ from x by a single word.If we have multiple single-word edits e 1 , ..., e N , we can apply them one after another, as in y = [e 1 ...e N ](x).

Simulated User
Because the user has knowledge of the goal document G, we can easily define a sensible user simulator.Given a current draft produced by the agent A h , we can compute sequences of edits e 1 , ..., e N such that G = [e 1 ...e N ](A h ), as detailed in Section 3.3 below.The user will produce a draft U h+1 = [e 1 ...e n ](A h ) only by applying some small number of these edits n ≤ N , where the edits are chosen according to a heuristic (App.B).An example heuristic would be to pick edits e 1 , ..., e n that insert informative words first.This way, the user provides information to the agent by producing a draft U h+1 that is a few steps closer to G than A h .This is like providing a gradient in the direction of G.Note that while this represents one way to build a user simulator, our task formulation is completely agnostic to the user behavior.
In our experiments we ensure that comparisons between non-interactive and interactive systems are fair by fixing the total number of single-word edits made by the user at N .In the non-interactive case, all N edits are made at the start.In the interactive setting, the N edits are spread out over M rounds of editing, where at each round the user only makes N/M edits.

Alignments
For any x, y there may be many sequences e 1 , ..., e N such that y = [e 1 ...e N ](x).There is always such a sequence since we can delete x completely and generate y.Practically, we can compute a set of such sequences of edits e 1 , ..., e N by finding an alignment between x and y, as illustrated in Figure 2. The alignment defines which words in x map to words in y, and which words were deleted or inserted.Words that match to identical words are conserved, whereas mismatches correspond to substitutions.From there it is easy to get the edits e 1 , ..., e N that transform x into y.We will denote this set by A(x, y).We use contextual word embeddings from BERT to compute these alignments, so that words in x map to words in y with similar meaning, as we found that this tends to produce more natural alignments (App.A.3).

Imitation Learning Model
The agent policy π θ will parameterize a distribution over documents, so that at each step of the interaction we draw A h ∼ π θ (•|S h ).Here S h is the his-tory of drafts (A 0 , U 1 ), . . ., (A h−1 , U h ) exchanged so far.This information is necessary for the agent to make inferences about the goal G.However, this history can become very long, especially since each element is an entire document.Therefore, in practice we only give the policy π θ access to the last set of drafts, (A h−1 , U h ), which we represent as a diff (App.B), as illustrated in Figure 2.

Left to Right Autoregressive Model
The first agent model that we consider is a simple sequence to sequence (S2S) model that generates outputs in a left-to-right autoregressive manner, where a = (a 1 , ..., a L ) and a <i = (a 1 , ..., a i−1 ).Note that this model autoregressively generates the new draft from scratch.

Token Editing Model
While our S2S model forms a reasonable baseline, we note that it is not particularly adapted for the task.Instead, we also propose a token editing model that directly edits the previous draft U h to produce its revised draft A h by making a series of edits.This way, both the agent and user operate by making edits.Additionally, we add a stopping action to allow the model to decide when to stop editing.Concretely, we parameterize the model as a distribution over edits (and the stopping action), In other words, based on the previous edits it made, and the current state, the model decides what edit to make next.The probability of producing a particular draft a is, where the sum is over all sequences of edits τ = (e 1 , ..., e M ) such that [e 1 ...e M ](U h ) = a.See Figure 2 for an illustration.

Training
We train our model to copy an expert policy π θ that would perform well in our task.Because this expert is only used at training time, it is very simple: it produces the goal G. See Figure 3 for an illustration.We follow the DAgger (Ross et al., 2011) algorithm as outlined in Algorithm 1, which tries to minimize the following objective: where π is the roll-in, or sampling policy, ν is a distribution over goals, and the second expectation is over histories produced by running our task protocol with the policy π as the agent (App.A.4). Setting π to π θ would allow us to train on-policy, but since π θ will start off as a poor policy that visits many unnecessary states, DAgger uses a mixture between the agent and expert policies, so

Likelihood Estimation
The training objective from Section 4.3 requires the computation of the negative log-likelihood − log π θ (G|S h ).For the autoregressive model, this can be directly computed as a simple product as described in Section 4.1.On the other hand, the likelihood for the token editing model involves a sum over sequences of edits which quickly becomes intractable.Instead, we optimize an upper bound on it.Using the alignments defined previously, we can specify a set of edit trajectories, denoted A(U h , G), that are indexed by permutations σ ∈ S M , where M is the length of the edit trajectories (App.A.3). So, an edit trajectory in A(U h , G) will be written e σ 1 , ..., e σ M , and [e σ 1 , ..., e σ M ](U h ) = G for all σ.We can then assume that most alternative trajectories 3 will have low probability, so we restrict the sum to all trajectories in A(U h , G).For convenience, we will write X σ k = [e σ 1 , ..., e σ k ](U h ).This gives the 3 For example the infinite number of trajectories that involve inserting and deleting the same word.
upper bound (App.A.5), where the two expectations are over uniform distributions, and n k is the number of terms in the sum.The first expectation ranges over prefixes of edit sequences e σ 1 , ..., e σ k for σ ∈ S M , and the sum ranges of over distinct edits e σ k+1 .In words, this objective says that we select a random number k of edits, in random order σ, that move us toward G, giving us the intermediate draft X σ k .We then try to predict the next edit that brings us closer to G. We evaluate this objective stochastically as described in Algorithm 2 (App.A.5).

Decoding
When decoding from the autoregressive model, we use standard algorithms such as beam search.For the token editing model, we sequentially sample edits from the model until the model's probability of stopping reaches a given threshold or we exceed a maximum number of decoding steps.If we hit the timeout, we return the draft that had the highest stopping probability according to the model.This ensures that whatever draft the agent returns it is highly confident that it is a finished document.For example, if the model were editing "the man" into "the man and the dog", we would not want to stop and return the draft "the man the dog" where the model hasn't yet added the word "and".Algorithm 4 outlines this procedure (App.A.7).

Experiments
Because of the novelty of our task, the main goal of our experiments is to assess the benefit of interactivity in text generation.We also provide example interactions between the learned agent and user simulator.Ablations and additional details on experiment settings are in the appendix.

Setup
Data We consider single sentences from CNN/DailyMail article summaries (Nallapati et al., 2016).While we only consider single sentences for ease of implementation, we can extend our models to full documents in a straightforward way.Using news articles gives complex and factually grounded sentences, while restricting our attention to the summaries keeps the sentences , where #episodes=1 means the entire budget of 6 oracle edits is given to the model all at once (i.e., no interactivity).For #episodes=2, the model receives as input 3 oracle edits per episode (2x3=6).For #episodes=3, the model receives 2 oracle edits per episode (3x2=6).
self-contained.This dataset forms the distribution over goal documents G.
Model We implement our models using pretrained transformer models, with additional MLP heads to predict the edit actions in the case of the token editing model.

Metrics
We evaluated generation using BLEU (Papineni et al., 2002), CHRF (Popović, 2015) BLEURT (Sellam et al., 2020), BARTSCORE (Yuan et al., 2021), and ROUGE (Lin, 2004).As BARTSCORE returns a score interpretable as a logprobability, we report the natural exponent of that score.In the case of both BLEU and ROUGE, we evaluate with their unigram versions (BLEU-1 and ROUGE-1) as they provide interpretability of the results at the token level.

Interactivity
To probe whether interactivity helps, we compared the performance of our model given different levels of interactivity.Concretely, we take a fixed number of edits N provided by the user, and compare the performance of the model when those edits are provided over a varying number of episodes M .Thus, for given M , the user and agent interact over M episodes, with the user making N/M edits at each episode.Note that the total amount of information provided by the user in each setting is thus the same.The only difference is that in interactive settings the agent is able to make changes in between receiving edits from the user.While this setup is not able to probe the advantages that interactivity provides in terms of human-computer interaction, we still expect to see better performance in the interactive case.For example, the model may be able to preempt some of the edits the user makes, allowing the user to use its budget to make other edits.

Results
Automatic evaluation Figure 4 presents our main results on interactivity.We can see that for our main model, splitting the same number of user edits over more episodes leads to better scores across BLEU, BLEURT and BERTSCORE.For example, comparing the setting where the model receives all 6 user edits at the start in one episode, against the setting where the edits are given across 3 episodes, we see improvements of about 7%, 4% and 5% (absolute % gains) in terms of BLEU, BLEURT, and BERTSCORE respectively.While these differences are not large in absolute terms, we emphasize that this gain comes exclusively from interactivity.The amount of information given to the model is the same in both settings.This suggests that, even in this narrow sense, interactivity helps.Note that there may also be many more benefits from a human-computer interaction standpoint that we cannot quantify here.We also note that our token editing model (Editor) outperforms the left-to-right, autoregressive sequence to sequence (S2S) model4 .While the difference is not staggering, it is notable given the success of standard S2S models across a variety of text generation tasks.As motivated in Section 4, the token editing model is more naturally suited to the task, and we argue that it constitutes the starting point for more interesting interactive models.For example, one could foresee training the model using reinforcement learning, in which case the structure of the editing model based on editing actions is better suited than a S2S model.

Human evaluation
We conducted a human evaluation (Table 1) to compare interactive and noninteractive generation, similar to the automatic evaluation in Figure 4. We selected the best-performing model (BART Large) and compared 3-episode generation against 1 episode on 1k test instances.Each instance was evaluated by 5 mechanical turkers.The judges were asked to compare each pair of generations based on semantic similarity to the gold text ("meaning") and linguistic quality ("fluency") using a 5-point Likert scale.Table 1 shows that interactivity improves performance, as the interactive system is generally semantically closer to the target (significance at p < 1e − 7).The interactive system exhibits slightly lower fluency on average, although the level of significance is weaker here (p = .037).We hypothesize this slight decrease in fluency is due to multiple rounds of generation.
Ablations We provide extensive ablations of model variants.As a benchmark for comparison, we look at the quality of the text produced by our models after interacting with the user over several episodes.For better comparisons we use a fixed number of episodes and a fixed number of user edits per episode.We use 3 edits and 4 episodes.Tables 2 and 3 present our ablations.Note that these results use 4 episodes, with 3 user hints per episode (a total of 12 user hints) compared to the 6 total user hints in Figure 4, so the overall results are higher.The baseline model is trained with a noise level of σ = 0.3, an unrestricted user and a sampling annealing rate of 0.9.All models in Table 2 were evaluated with the adjacent and contiguous user heuristics.sampling annealing rate indicates how quickly we anneal from the expert to the trained model while sampling (lower is faster).Table 3 compares different user heuristics at test time.
We note that adding noise during training improves results (e.g.noise level 0.1 vs. 0.0), while annealing too fast can hurt performance (annealing rate 0.8 vs. baseline).Interestingly, training with a user that better matches the inference-time user leads to worse performance (e.g.adj+contig vs. baseline).It seems that using the most informative user (which simply returns the most informative edits, without restriction) leads to the best model (baseline).Comparing different user simulators at inference time, we see that adding restrictions to the user leads to decreased scores, as expected.Interestingly, we see that the most impactful restriction seems to be enforcing contiguous edits.We suspect that this is because contiguous edits are highly predictable.For example, predicting "Obama" after "Barack" is fairly obvious.Thus, if the user didn't provide contiguous edits, and only inserted the word "Barack", the model could score an easy win by predicting "Obama".
Examples Table 4 provides two examples of interactions between the agent and user.We emphasize that these examples were not cherrypicked.Note how the agent is able to revise its previous version of the text based on the new information provided by the user at each step.Qualitatively, these interactions could be much more natural for a user than the one shot setting that is prevalent in the literature.However, a systematic evaluation of this claim requires a more comprehensive user study that lies out of the scope of this work.

Related Work
Text Generation Prior work on natural language generation (NLG) has largely focused on noninteractive settings that have become increasingly more challenging and open-ended, e.g., with generation from prompts (Fan et al., 2019), outlines (Rashkin et al., 2020), topics or keywords (Ghazvininejad et al., 2016;Yan, 2016), plots (Riedl and Young, 2010), descriptions (Jain et al., 2017), events (Martin et al., 2017).This increase of difficulty can make NLG more prone to content quality issues, such as hallucinations (Wiseman et al., 2017;Filippova, 2020;Çelikyilmaz et al., 2020;Ji et al., 2022), that can require post-editing from the user.Several works explored ways for LLMs to improve their outputs by iteration and self-critiquing (Huang et al., 2022;Gou et al., 2023;Welleck et al., 2022).In particular, Welleck et al. (2022) presented models for text generation and self-correction that also incorporated external feedback.However, in their case the feedback is used at training time to learn a corrector.In our task, the user feedback comes at inference time and the agent must use that feedback to guess what the user would like to generate.
Non Autoregressive Generation Several works considered non-autoregressive text generation (Gu et al., 2019;Shen et al., 2020;Xu and Carpuat, 2021;Stern et al., 2019;Zhang et al., 2020;Welleck et al., 2019), but these models all focus on one-shot text generation.While some models are able to edit text (Gu et al., 2019;Xu and Carpuat, 2021), it is primarily used as a means to refine the model's generations.On the other hand, we consider editing text into a completely different version conditioned on a given set of user-provided edits.
Text Editing Text editing has previously been considered from two different angles.On the one hand, various works (Zhang et al., 2019;Du et al., 2022b) have studied the types of revisions made by humans.On the other hand, works have focused on modeling text edits (Guu et al., 2018;Yin et al., 2018;Faltings et al., 2021;Akoury et al., 2020), but they have generally been restricted to modeling a single episode of user edits at a time.In our framework, model edits and user edits are interleaved.We note that (Du et al., 2022a) presented an interactive revision system, but their model was nevertheless trained on a corpus of edits, rather than in an interactive environment as in our case.The recent versions of GPT-3 (Brown et al., 2020;Ouyang et al., 2022) and ChatGPT5 also offer text editing capabilities.For example, ChatGPT can handle a prompt containing a piece of text and instruction to improve the text, and ChatGPT's ability to execute that command is often quite impressive.The ability of ChatGPT to interact with users was somewhat explored in (Bang et al., 2023), although not in the context of text editing.We think our work is complementary to GPT-3 and ChatGPT, as we provide a framework for both modeling and evaluating edit models in a more end-to-end setting.Our model of interaction between user and system, where the user and system are both editing the text, may also be more natural than dialogue.Ultimately, we think it will be beneficial to fine-tune very large language models such as GPT-3 in an environment that exposes them to interaction with a user or a user simulator (i.e., akin to user-in-the-loop training).This benefit is currently somewhat captured using reinforcement learning (RL) to tune large language models from human feedback (Ouyang et al., 2022), except that our approach features an actual environment representative of the end-user experience while Ouyang et al. (2022) is more akin to offline RL.
Interactivity Several works have explored interactivity between humans and models to complete tasks collaboratively.Lee et al. (2022) presented a dataset to reveal large language models' capabilities in assisting creative writing.Zhang et al. (2022); Li et al. (2022) explored pre-trained language models for code editing based on humanwritten comments.Closer to our work, Lahiri et al. ( 2022) created an interactive framework to refine user intents through test case generations and user feedback, and Kiseleva et al. (2022)   and follow natural language instructions to achieve goals in Minecraft.Finally, interactivity has also been studied from a Human-Computer Interaction viewpoint (Clark et al., 2018;Shen and Wu, 2023).

Conclusions
We presented a new task and benchmark for text generation that operationalizes interactivity between an agent and a user, without the need to involve real users during training.Our framework compares interactive and non-interactive systems that are given the same amount of user inputs, and shows that interactive text generation leads to text of higher quality according to multiple automated evaluation metrics and human evaluations.We also present a non-autoregressive editing model that outperforms a standard sequence-to-sequence Transformer in various settings.All baseline, data, and models of this paper will be made public.

Limitations
A long-term goal of this research is to enable interactive editing of full documents.For practical reasons and to facilitate adoption, we limited text length to 64 tokens, but we plan to extend our benchmark and released datasets to also include paragraph-level and multi-paragraph texts.Another limitation of our work is that training is done with simulated instead of real users, as training with users in the loop can be extremely slow and costly.
To make our approximation of user behavior realistic, our work relies on user inputs that real-world users perform routinely (i.e., word insertion, deletions, and substitutions) even settings that are not assisted with AI.However, we recognize that the behavior of real users in a AI-user collaborative setting may differ from that of our user simulators, and leave the study of such behavior for future work.Finally, while the need for interactivity ap-plies to any kind of AI creation (e.g., also code and images), we leave the application to other generation tasks for future work.We note, however, that this paper treats text as simple sequences of symbols, and our work could readily be applied to other symbolic creations (e.g., code generation).

Ethics Statement
Text generation systems, including those relying on large language models, run the risk of generating text that is unsafe (Bender et al., 2021;Bommasani et al., 2021;Weidinger et al., 2021), and this also applies to generation models developed in this work.While we have not observed generations that are overtly toxic or hateful, our models can generate texts that are biased and offensive to some readers.As our work focuses on generation of non-fictional texts (in contrast to prior work on story generation), our models also run the risk of generating text that is factually incorrect.However, the focus of our research is to provide interactive capabilities to generation systems, and to make them more in control of the user.As illustrated in Figure 1, a back-and-forth between system and user can make the generated text more factual, and the same kind of interaction can also help increase its safety.Such sanitization of generated texts in our framework may still expose users with unsafe content, so it is still recommended to use current safeguards against hallucinations (Ji et al., 2022), biases (Dhamala et al., 2021;Weinberg, 2022), and other offensive content (Gehman et al., 2020;Welbl et al., 2021;Kiritchenko et al., 2021;Jones et al., 2022;Xu et al., 2022) before displaying text to real users.In that sense, we think our work is complementary to current NLG research on hallucination and safety.

A.3 Alignments
Given two documents x, y, an alignment will give us a convenient way to find a sequence of edits e 1 , ..., e N such that [e 1 ...e N ](x) = y.We use this to define the user simulator behavior, as well as for training our token edit model.Loosely speaking, an alignment between x and y is a partial mapping between words in x and words in y.More formally, it is an undirected bipartite graph with vertex sets V x = {x 1 , ..., x L } and V y = {y 1 , ..., y L }, and edges E. The edges then match words in x to words in y, where some words in x or y may not be matched.A monotonic alignment is one where the edges in the graph do not cross.
For our purposes it will be convenient to represent alignments in a particular way.Rather than a graph, an alignment between x and y will be a pair x, ȳ ∈ V 2L , where x, resp.y, is a subsequence of x, resp ȳ.Moreover, x and ȳ only contain blank tokens otherwise.See Figure 2 for an example.Then each pair (x i , ȳi ), i = 1, ..., 2L is interpreted as an operation: Pairs of blanks (_, _) are ignored.Note that the the positions where neither xi or ȳi are blank give a monotonic alignment.The converse is not true because we could construct many pairs (x, ȳ) that correspond to the same alignment.This is because we can rearrange the order of the positions corresponding to insertions and deletions since this won't affect the alignment.We therefore require all insertions to come before deletions so that monotonic alignments between x and y correspond oneto-one to pairs (x, ȳ).
Given x and y, there may be many possible alignments.We therefore define a score on alignments and choose an alignment that maximizes the score.Given alignment (x, ȳ), the score is where s(x i , ȳi |x, y) scores the pair (x i , ȳi ) given the contexts x and y.If both xi and ȳi are blank, the score is 0. If only one is blank, we assign a baseline score b.If neither is blank, then xi will correspond to some x k in x.Similarly ȳi corresponds to y l in y.
We use the cosine similarity between the BERT embeddings of x k and y l as their score.We found that using BERT embeddings gave more natural alignments where words of similar meaning or function are matched together.For example, consider two single word documents "red" and "blue".Then it makes more sense to consider the change from "red" to "blue" a substitution rather than a deletion followed by an insertion.On the other hand, "red" to "car" makes less sense as a substitution.This score can be easily optimized using dynamic programming (see for example Gale et al. (1994);Smith et al. (1981); Needleman and Wunsch (1970) for more details).
Given an alignment x and ȳ, the set of indices A = {i : xi ̸ = ȳi } naturally give a set of edits.We can split A = I ∪ D ∪ S into the union of indices corresponding respectively to insertions, deletions, and substitutions.These edits can be performed in any order, so the alignment gives a whole set of edit sequences, one for each permutation of A. For any σ ∈ S M a permutation of A, where M = |A| is the size of A, there is a sequence of corresponding edits e σ 1 , ..., e σ M .The delicate part about getting these edits is determining their location parameters since these will depend on the order σ.We will need the following two quantities: These correspond to the number of insertions and deletions that come before edit e σ i and which will affect its location parameter.Let B σ i = |{j < A σ(i) : xj = _}| be the number of blanks before position A σ(i) in x.Then, for i = 1, ..., M , the location parameter will be l Note that this is just keeping track of where the edit needs to be made in the document [e σ 1 , ..., e σ i−1 ](x) after applying the first i − 1 edits.We then treat the three operations separately: In summary, for documents x, y we can compute a unique alignment (x, ȳ).We then denote the set of edit sequences {e σ 1 , ..., e σ M , σ ∈ S M } as defined above by A(x, y).
where the sum is over the set {e : e σ k = e, some σ ∈ S M }.
We then fold the log π θ (stop|a, S h ) term into the expectation over k.This gives the objective, where the expectation is now over k = 1, ..., M +1, e σ M +1 = stop for all σ ∈ S M , and

A.6 Denoising
In practice, the lower bound derived above may be very loose because we are only considering edits in A(U h , a,).In other words, the learned policy π θ will inevitably find itself out of distribution at inference time by making mistakes.
In order to make the model robust, we leverage denoising training (Lee et al., 2018).To do so, we notice that in Algorithm 2 we are essentially sampling a trajectory of edits e 1 , ...e M such that [e 1 ...e M ](U h ) = a, and then randomly sampling a point along this trajectory.Instead, we sample noisy trajectories of edits, where we intersperse random edits, as in Algorithm 3.This simulates mistakes that the trained policy might make at inference time.Given this random prefix of edits e 1 , ..., e k−1 , we get a noised, intermediate document, xk−1 .We compute the same loss as before over the set of edits {e : τ 1 = e, τ ∈ A(x k−1 , a)}.This is the set of first elements of edit trajectories from xk−1 to a, and we'll denote it by A(x k−1 , a) 1 .The objective becomes, where now n k = |A(x k−1 , a) 1 |.Intuitively, this is like using a noisy roll-in policy to get to xk−1 , and then matching an expert policy that can produce a.

Decoding
To decode from the token editing model we use ancestral sampling with a few modifications.First, when sampling an edit (or stopping action) from π θ , we only sample from the top k edits.Just as for autoregressive models, we found that this improves generation quality, since the low-probability edits are usually of poor quality and will figuratively speaking throw a wrench in the decoding process.For long sequences of edits, the probability of sampling a bad edit also becomes non-negligible.
The model also has a tendency to stop early.Again, this is because the probability of stopping erroneously increases as we sample more and more edits.In contrast to other types of mistakes, stopping early is especially bad because there is no way to recover, as opposed to other mistakes that can be fixed with another edit later on.Therefore, we explicitly avoid sampling the stopping action, and instead decode edits until either the model's stopping probability exceeds a threshold, or we reach a maximum number of edits.We then return the document that had the highest stopping probability.The whole decoding procedure is given in Algorithm 4.

B.1 Data
We use version 3.0.0 of the CNN/DailyMail dataset from the HuggingFace dataset hub (Lhoest et al., 2021).We take the article summaries, which we split into sentences and then filter to sentences with less than 64 tokens, where the tokens are determined by the BART tokenizer from HuggingFace (Wolf et al., 2020).We then split the data into train, test, and validation splits, with (approximately) 1M, 55K, and 45K instances respectively.

B.2 User Simulator Heuristics
The environment of our contextual MDP is determined by the behavior of the user simulator.We consider the following methods for generating user edits: Ranking Given a set of edits, we rank them based on their informativeness.As a measure of informativeness, we use IDF scores.
Adjacent Edits If the user simulator simply returns the most informative edits, it will have a tendency to make edits in disparate parts of the text.For example, it might return keywords from the end of the document, when the current draft only covers the start.In a realistic setting, users make edits related to the current draft, they do not preempt the end of the document.Thus, we limit the user simulator to producing adjacent edits, where an edit is adjacent if it is adjacent to a match in the alignment to the target.
Contiguous Edits While adjacency will keep the user simulator edits relevant to the current draft, it may still have a tendency to produce a disconnected set of edits.Instead, we limit it to only produce contiguous edits.
Complete Words Finally, since the user simulator operates on tokenized text, and the tokens may break up words, we also constrain its edits to complete words.

B.3 Models and Training
Implementation Models were implemented with Transformer architectures with additional MLP layers on top.Specifically, we used the BART base and BART large checkpoints made available through the HuggingFace transformers library (Wolf et al., 2020).The models have (approximately) 140M and 400M parameters respectively.
For fixed document x and state s, the token editing model parametrizes a distribution over edits along with a stopping action.We refer to the union as edit actions, which are four tuples (s, l, o, w), where s ∈ {0, 1} indicates the stopping action, and l, o, w, are as defined for edits.The probability of an edit action e = (s, l, o, w) is parameterized as a product of four probabilities, depending on three cases.If s = 1, then π θ (e|x, s) = π s θ (1|x, s).

Model Inputs
The history of drafts S h could become prohibitively large, so we keep track of only the last three drafts, denoted by U h−1 , A h−1 and U h .This allows the agent to reason about the edits it made at the last step and how the user responded.Additionally, we track all words inserted by the user along the entire history.This is easy to do by marking the relevant words, and allows the agent to know which words in the current draft came from the user.In practice, all these features are represented as a diff.See Figure 2 for an illustration, where the different coloring and markings represent which words were inserted or deleted by the agent or user.Practically, we implement this by adding labels to the tokens in the draft U h .We can easily track these labels for the user and token edit models because they operate by making edits.So if the model or user deletes a word, we can mark it as deleted.For the Seq2Seq model this isn't possible because it returns an entire new draft.For example, we couldn't tell if a word was deleted or substituted for another.Instead, we compute the alignment between U h−1 and A h−1 , from which we can read off which tokens were inserted, deleted or substituted.Because words inserted by the user should likely never be deleted by the agent, we keep the labels on all words inserted by the user so that they persist throughout the entire interaction.
Training Models were trained on a single 16GB V100 GPU for 600 iterations with a sampling batch size of B = 10 4 and a sampling annealing rate of λ = 0.9.We also used 300 warmup iterations (where λ was not annealed).For the token editing model we used a noise level of σ = 0.3.

Figure 1 :
Figure1: The ITG framework of this paper allows generation models to be trained and evaluated by directly interacting with users or user simulators.

Figure 4 :
Figure4: Results showing the benefit of interactive generation, where #episodes=1 means the entire budget of 6 oracle edits is given to the model all at once (i.e., no interactivity).For #episodes=2, the model receives as input 3 oracle edits per episode (2x3=6).For #episodes=3, the model receives 2 oracle edits per episode (3x2=6).

Token Edit Model (Repeat x N)
Figure2: Model architecture: the current draft repeatedly goes through two phases.First, it is edited with the edit model (either a non-autoregressive token edit model or a standard seq2seq Transformer), and then it is annotated with edits coming from the user simulator.While the user simulator has access to the gold document, all our generation systems (interactive and non-interactive) are evaluated with the same budget of user inputs.

Table 2 :
Table 2 presents variations on training parameters.The noise level is the amount of noise injected during training, the user is the (sole) heuristic used for the user during training, and the Ablation results with different training hyperparameters.All ablations results are relative to the baseline.BART stands for BARTSCORE.
Table 3: Ablation results with different oracles changed at test time.
studied an interactive agent that can interact with humans oracle: Progress comes after Boeing 59 spacecraft will re-enter NASA's spacecraft, Russia space agency says .oracle:TheLarge Hadron Collider (LHC) will end a two-year shutdown .agent:Progresscomes as Boeing's after Boeing 59 spacecraft will re-enter Earth NASA spacecraft, Russia's space agency says .agent:The Large Hadron Collider (LHC) will end a end a two-year shutdown .

Table 4 :
Example interactions between the oracle and agent.The oracle changes are shown red.Deleted words are crossed out.Note that these are randomly selected examples (i.e., without cherrypicking), with the only curation being the rejection of potentially offensive examples.