IGA: An Intent-Guided Authoring Assistant

While large-scale pretrained language models have significantly improved writing assistance functionalities such as autocomplete, more complex and controllable writing assistants have yet to be explored. We leverage advances in language modeling to build an interactive writing assistant that generates and rephrases text according to fine-grained author specifications. Users provide input to our Intent-Guided Assistant (IGA) in the form of text interspersed with tags that correspond to specific rhetorical directives (e.g., adding description or contrast, or rephrasing a particular sentence). We fine-tune a language model on a dataset heuristically-labeled with author intent, which allows IGA to fill in these tags with generated text that users can subsequently edit to their liking. A series of automatic and crowdsourced evaluations confirm the quality of IGA’s generated outputs, while a small-scale user study demonstrates author preference for IGA over baseline methods in a creative writing task. We release our dataset, code, and demo to spur further research into AI-assisted writing.


Introduction
Writing can be both an exhilarating, creative experience as well as a frustrating slog. Can recent advances in neural language modeling help improve the human writing experience? Pretrained Transformer language models (Radford et al., 2019) have improved writing aids such as email "autocomplete" , while commercial tools such as Grammarly and Microsoft Editor can rewrite full sentences to increase clarity. 1 Few existing writing assistants provide support for the underlying cognitive process of writing (Greer et al., 2016). In this paper, we explore more advanced writing assistance functions: * * Most of the work done during an internship at Adobe. 1 As these systems are not open-sourced, it is unclear how exactly they are implemented. therefore there is no more big news! but still took the exam and had it done.
it was an unplanned break.
David, the principal, believes she will miss tomorrow's class.
She was busy making ends meet.
Sarah did not arrive at school today. Figure 1: General concept overview of our Intent-Guided Authoring assistant IGA. Given context, by specifying different writing intents, the system generates output satisfying the intent. In addition to wellformed sentence fragments, keywords can also be part of user input, serving as arguments for the intents, and are preserved in the output. specifically, we build an authoring assistant capable of following fine-grained user directives (e.g., add descriptive text, use idiomatic language, or paraphrase a clunky bit of wording). Our system, the Intent-Guided Assistant (IGA), combines controllable text generation with text infilling (Zhu et al., 2019;Keskar et al., 2019a;Lewis et al., 2020;Donahue et al., 2020); more specifically, we adapt the tag-based control of Keskar et al. (2019b) to include a set of rhetorical directives that our model learns to infill with relevant and fluent text. Our system can handle the following authorguided tags: cause, effect, concession (contrast), description, biography, idiom, and rephrase. User input to IGA can be as simple as a list of keywords and does not have to include well-formed text ( Figure 1).
We train IGA in supervised fashion by creating a large multi-domain dataset in which spans corresponding to particular directives are replaced with a single tag: for the input "It was raining <description> trees", the ground-truth completion could be "It was raining , the trees were swaying and the wind was oppressive." To build our dataset, we use heuristics based on lexical and syntactic choice to isolate spans corresponding to each directive. For the above example, we extract the first simple declarative clause, then highlight the span that contains words (such as "oppressive") in a large list of adjectives, and finally extract keywords such as "trees" using a keyword extractor. At inference time, our model can flexibly take any tag as input: given "It was raining <contrast> trees", for example, our model inserts a contrastive clause to produce "It was raining but still the trees were not wet".
To evaluate the effectiveness and usability of our AI-assisted writing paradigm, we design IGA to be interactive, in the spirit of human-AI coauthoring. In addition to automatic and crowdsourced evaluations that demonstrate IGA's output quality, we perform a user study in which participants make use of our system for creative storytelling (Section 6). Our results show most users prefer writing with assistance from IGA compared to writing from scratch or with a non-controllable infilling model.
Our contributions are as follows: 1. We design IGA, an authoring assistant capable of controlled text generation based on explicit rhetorical directives specified by the author.
2. To train IGA, we create a large dataset (75M tokens) of text heuristically-labeled with author intent, sourced from multiple repositories. This dataset is made publicly available to facilitate future research on AI-enabled author assistance. 2 3. We validate the usefulness of IGA through automatic and crowdsourced evaluations as well as a user study involving creative writing.
2 Related Work

Theories of writing
Numerous studies within the humanities focus on modeling the process of effective writing (Flower and Hayes, 1981;Rohman, 1965;Grabe and Kaplan, 1998). We base the design of IGA on the 2 Dataset and models can be found at https://github.com/SimengSun/ IGA-An-Intent-Guided-Authoring-Assistant widely cited and reproducible "cognitive process theory of writing" of Flower and Hayes (1981), which was made more comprehensive in the review work by Becker (2006). This theory posits that writing is a non-linear process that consists broadly of three steps : planning, translating and reviewing. The planning phase involves accessing one's knowledge of the topic and target audience to formulate a rough outline of the eventual output. The actual rendering of the text on paper is called translating, while the reviewing phase consists of making edits or revisions to the output. All of these steps happen concurrently, under the influence of a monitor. In our work, a human author and a language model jointly participate in the planning and translating phase, while the human (by means of an editable output interface) reviews and monitors the process.

Infilling language models
The actual implementation of IGA relies on controllable text infilling via language modeling. The ability of large-scale language models to generate fluent and coherent text has been demonstrated in several prior works (Radford et al., 2019;Brown et al., 2020;Zellers et al., 2019) when given only a few words or a sentence as a prompt. More recent research has addressed the inability of these models to infill text, or insert new words/tokens between tokens that already exist (Donahue et al., 2020;Zhu et al., 2019;Huang et al., 2020;Stern et al., 2019;Welleck et al., 2019;Liao et al., 2020;Moryossef et al., 2019). In this vein, Rashkin et al. (2020) generate coherent stories given just bullet-point plot outlines, while Cai et al. (2019) perform token insertion using a retrieval engine in combination with a language model for dialogue agents. Unlike IGA, however, none of this prior work can control generation using high-level rhetorical directives specified by an author.

Controllable Text Generation
IGA conditions its generated text on tags, which has previously been done for left-to-right language models. For example, Dathathri et al. (2020) combine a large language model with an attribute discriminator to generate text that obeys certain sentiments or topics. Meanwhile, expanding the control codes proposed in Keskar et al. (2019b), Krause et al. (2020)

Inferencetime
(1) Fine-tunedGPT-2 generates outputprefixedwith input and <sep> (2)Post-processing:Identifytags (3)Post-processing:Infillmodel outputbacktospans Figure 2: On the left, we show how each example is constructed for fine-tuning. On the right, we show how the final output is constructed by post-processing the output of a fine-tuned GPT-2 model at inference time.
the Megatron-CNTRL model (Xu et al., 2020) control the output with predicted keyword. In contrast to these works, IGA focuses on finegrained, intra-sentential controlled infilling. Previous work has also explored controlling stylistic parameters (Ficler and Goldberg, 2017) and syntactic structures (Iyyer et al., 2018;Goyal and Durrett, 2020).

Intent-Guided Assistant
IGA extends text infilling models with finegrained rhetorical control. Specifically, we build on the Infilling Language Model (ILM) of Donahue et al. (2020), which fine-tunes an off-theshelf language model such as GPT-2 on a dataset of text with masked spans. To continue with our running example, the input to this model is the sequence "It was raining <mask> trees", and a target output is "and getting harder to see the". At inference time, the blanks are substituted with the words predicted by the LM and combined with the input in a post-processing step to generate the final output: "It was raining and getting harder to see the trees". Building on this framework, we fine-tune an off-the-shelf GPT-2 medium model 3 on our dataset (described in Section 4) created for generating text conditioned on author intent. Instead of replacing spans with a generic <mask> token as in ILM, we replace spans with more fine-grained tags corresponding to rhetorical directives. For fine-tuning, we concatenate a tag-replaced sentence with the ground-truth spans that should be infilled using a special separator token <sep>, as in Donahue et al. (2020). If multiple tags are used in the input, the multiple ground-truth spans following the <sep> token are separated by special <answer> tokens. At inference time, the model is fed a tag-replaced sentence and the <sep> token, from which it generates span(s) to infill the input tags. In a post-processing step, we replace the tags with the generated answer spans. Figure 2 summarizes the fine-tuning process (left) and the inference process (right). During inference time, we use top-k sampling (Fan et al., 2018) with k fixed to 40.
To sum up, unlike the standard GPT-2 model which only supports strict left-to-right generation, both ILM and IGA are capable of text-infilling. Moreover, IGA has finer control over author writing intents than ILM, which further narrows down the output generation space. Although providing GPT-2 and ILM with specific discourse markers may result in output conforming to certain intents, they are less flexible than IGA, which can sample from multiple discourse cues.  The novelty of IGA lies not in its model architecture but the way in which we supervise it to enable controlled fine-grained text infilling. We construct the fine-tuning dataset primarily by heuris-

Tag
System input and output PARA Input: <paraphrase> The pandemic has caused very serious problems. <paraphrase> Output: The pandemic has brought severe economic, social and political effects that have seriously affected many countries. BIO Input: Oria, <biography> , mentions that technically only humans can cry in response to emotional state. Output: Oria, a psychologist specializing in the study of human emotion, mentions that technically only humans can cry in response to emotional state. CAUSE Input: This is a really good book <cause> plot <cause> .
Output: This is a really good book because the plot is always so well written and never predictable. EFFECT Input: Janet had suffered a traumatic brain injury in a car accident in 1988, <effect>.
Output: Janet had suffered a traumatic brain injury in a car accident in 1988, thus had no memory of who she was or what happened. CNTRA Input: The castle was built in 1865, <contrast> . Output: The castle was built in 1865, though a full-scale modern restoration has been underway for the past four years. DESCP Input: <description> individual <description> beliefs <description> wrong .
Output: There are individual and social beliefs that should lead us to be skeptical of the facts and the wrong.

IDIOM
Input: This report only shows the <idiom> , as many details can only be uncovered if you sign the document.
Output: This report only shows the tip of the iceberg , as many details can only be uncovered if you sign the document. tically mining the NEWSROOM corpus (Grusky et al., 2018), the largest available summarization dataset with 1.3 million news articles. We also collect partial data from ParaNMT-50M , WikiLarge (Zhang and Lapata, 2017) for "sentence embellishment" writing intent, and PoMo (Kang et al., 2019) to extract postmodifier that comes after an entity. Our dataset (statistics shown in Table 1) contains 75M tokens with a mean example length of 60.5 words tokenized with NLTK (Bird et al., 2009).
Choosing a collection of tags: Before we start collecting data, we conduct an internal survey with potential users of our system to determine what writing assistance functions they would most benefit from. We surveyed nine NLP researchers about their opinions on the ideal functionality of an authoring assistant. After removing simple functions such as generating synonyms, antonyms, adjectives, and adverbs, which are already implemented in existing writing assistant tools, we condense the most requested writing intents into seven tags (described in detail below; examples provided in Table 2). Only one of them (PARA) is heavily constrained by semantic content, and the rest in-volve open-ended generation loosely constrained by keywords and intent.

Data collection for each writing intent
CAUSE: This tag helps an author invent a reason for the occurrence of an event. Clauses with CAUSE intent usually follow words/phrases like 'because' or 'due to'. We manually extracted 16 markers, many from the discourse marker list in Sileo et al. (2019), and then mine NEWSROOM (Grusky et al., 2018) to find sentences that match any of the markers. For all mined examples, we also preserve the previous sentence as the context of the matched sentence. Simple declarative clauses that start with matched discourse markers are extracted through shift-reduce constituency parser ZPar (Zhang and Clark, 2011). The YAKE algorithm is later applied to those clauses for keyword extraction.
EFFECT: As a conjugate writing intent of CAUSE, EFFECT is used when one needs to describe the result or consequence. It usually cooccurs with the words/phrases such as 'therefore' and 'as a result'. Similar to CAUSE, we manually select 15 discourse markers that signify EFFECT, mine sentences from NEWSROOM, and highlight spans of interest using the same process as before. Specifically, we extract declarative clauses based on the constituent labels returned from the parser. Then, we decide the intent according to the starting markers of those clauses. Inside each clause, we run YAKE, an unsupervised keyword extraction algorithm, to extract keywords for the clause. All markers are included in Appendix B.
CNTRA: Comparison is a commonly used writing technique that encompasses two separate intents: concession and contrast. Concession refers to the unexpectedness of a consequence (Webber et al.): words/phrases such as "although" and "even though" raise an expectation curbed by the rest of the sentence. Contrast is often confused with concession, but its markers include "by comparison", "in contrast", etc. We manually select 31 concession markers and six comparison markers, and mine the Newsroom corpus with these to obtain labeled data.
DESCP: Descriptive details are important for creative writing and can help embellish written output. To curate data for this tag, we first collect 27K descriptive adjectives based on morphological rules. 4 We then mine sentences from NEWS-ROOM and extract simple declarative clauses that contain the matched adjective(s). The spans are filtered to be greater than five tokens.
IDIOM: Idiomatic language makes writing more vivid and imaginative. We collect 3000 idioms online 5 and mine sentences from NEWS-ROOM that match any of the idioms. In order to include variants of the raw idiom, e.g. "apple of one's eye", we apply regular expressions to match various possessive forms.
BIO: Biographical post-modifiers are commonly used to provide a brief summary of a previously mentioned named entity. For example, "co-founder of Microsoft Corporation" fills in the blank span of the sentence "Bill Gates, , has pursued a number of philanthropic endeavors". PoMo (Kang et al., 2019) is an existing dataset that aligns with this writing intent. It con-4 Adjectives are extracted from English word frequency list https://norvig.com/ngrams/count_1w.txt 5 Idioms are extracted from https:// 7esl.com/english-idioms/ and https: //www.phrases.org.uk/meanings/ phrases-and-sayings-list.html tains sentences with post-modifiers and facts about named entities extracted from Wikidata. We use the textual data in PoMo, replace the post-modifier with <biography> and treat the ground-truth postmodifier as target span.
PARA: Paraphrasing is a common method by which authors improve their draft writing (Flower and Hayes, 1981;Tufte, 2006). Unlike sentence simplification, the intent of our <para-phrase> tag is to paraphrase with improved writing quality, similar to embellishment. We construct parallel data for this tag by combining ParaNMT-50M , a large corpus consisting of back-translated sentence pairs, with WikiLarge, a sentence simplification dataset with parallel simple and complex sentences. The original sentence in ParaNMT-50M and complex sentence in WikiLarge are treated as targets, while the back-translated sentence and the simplified sentence are used as the source. We use BLEURT (Sellam et al., 2020) to filter noisy pairs from ParaNMT-50M, 6 discarding pairs whose word-level edit distance is less than five. To further encourage complex paraphrases, we require the reference sentence to have more lowfrequency words than the candidate sentence.

Evaluation against references
As an initial comparison of IGA and ILM, we evaluate the generated outputs of each model against reference completions from our dataset, both automatically and through a crowdsourced evaluation. We acknowledge that this type of evaluation (especially using automatic metrics) is limited for open-ended generation tasks like ours (Fan et al., 2018;Akoury et al., 2020;Rashkin et al., 2020), which is why we also conduct an in-depth user study in Section 6. While results of these evaluations cannot reflect how practical IGA can be used as an authoring assistant, they do indicate that IGA is more constrained than ILM and produces output that better fulfills the writing intents.

Automatic evaluation
We compare IGA with ILM on automatic metrics such as ROUGE (Lin, 2004) and self-BLEU (Zhu et al., 2018)    from the validation set. 7 On all but the BIO tag, IGA achieves higher ROUGE and self-BLEU scores than ILM (Table 3), which shows that IGA outputs have higher coverage and lower diversity, respectively, without differing considerably in length. This result indicates that IGA is indeed conditioning its output on the tags to produce more constrained outputs. Since the infilled spans of BIO are strictly post-modifiers that follow a very specific structure (i.e., enclosed by two commas), the superior performance of ILM indicates that it memorizes this simple form of construction without requiring a separate tag input. PARA is the only substitution-based tag in our system and is not supported by ILM. Therefore, we compare performance of PARA with the stateof-the-art paraphraser STRAP released by Krishna et al. (2020) with the default nucleus sampling p = 0.6. We compute BLEURT scores to check semantic similarity, as well as BLEU (Papineni et al., 2002), self-BLEU (Sun and Zhou, 2012), and iBLEU (Sun and Zhou, 2012) with α = 0.8 to check the diversity of output. Table  4 indicates that IGA outperforms STRAP in all dimensions. We hypothesize that this is primarily because the diverse paraphraser in STRAP normalizes (and often simplifies) stylized text, while our PARA tag is associated with complex, embellished paraphrases during fine-tuning. 7 All automatic metrics are computed only on the infilled spans, excluding the context.

Intrinsic crowdsourced evaluation
The above automatic evaluations can only tell us so much about IGA's capabilities. Many of our tags (e.g., DESCP, CAUSE) are open-ended, which results in a large space of acceptable outputs. Thus, measuring similarity to ground-truth span completions is not as suitable for our task as it is for more constrained tasks such as machine translation or summarization. In this section, we shift to human evaluation as a way to learn more about the behavior and usefulness of IGA.
We begin with a small-scale intrinsic evaluation to get a sense of the generation quality and adequacy of an output in fulfilling a writing intent. We randomly select 50 examples from the test set of each tag and generate outputs using both ILM and IGA for each example. For each output, three Mechanical Turkers are shown the gold completion as well as the generated text and asked to choose which is more fluent, coherent, and adequate at fulfilling the author intent, using a five-point scale to measure fine-grained preference. 8 The results of this task are inconclusive: although IGA outputs for most tags are more often preferred than those from ILM across these dimensions (see Appendix A for specifics), especially in terms of adequacy, the subjective nature of the task yields low agreement among annotators. 9 In general, annotators found the task difficult and most often chose to express no preference.

User study
Due to the limitations of the previous evaluations, we launch a user study in the same spirit as Clark et al. (2018) to understand the interactive behavior of real users. We measure whether human authors benefit from AI-assisted writing, and whether they prefer intent-guided generation to the uncontrolled ILM model. We design an interactive web demo that allows participants to write with the help of each model, logging their behavior (e.g., queries to the model, edits made on generated text) and self-reported feedback.
Our interactive demo is inspired by markup 8 We employ workers with an approval rate higher than 96% and total approved HITS greater than 1000. Each rater is rewarded for $0.1 per HIT. 9 The best Fleiss κ (Fleiss, 1971) across all tags was only slightly over 0.2, which indicates slight agreement (Landis and Koch, 1977). language editors such as Overleaf 10 or Lyx 11 ; a screenshot of the interface is shown in Figure 3. In the textbox to the left, users type sentences with any of our supported tags. After clicking Generate, the model's output will be displayed on the right hand side 12 . By default, three samples are shown to the user, although they can increase the number of samples if they wish. After a user selects a sample, it is appended to the existing text in the top-right textbox, which contains all of the text the user has already written. Users can then edit the sample (or completely delete it) and write continuations directly into the top-right textbox. This process repeats every time the user decides to use a tag to obtain model-generated text. On the backend, the input fed to our IGA model is the concatenation of content in the main textbox (i.e., context) and the input in the assistant box, truncated at 300 tokens.

User study design
We recruited twelve computer science graduate students for our user study, seven of whom are native English speakers. 13 Nine of the twelve participants in the user study had creative writing experience in English prior to the evaluation, three participants had taken creative writing classes, and one was trained in media writing. We asked each participant to write short stories in response to prompts selected from WritingPrompts (Fan et al., 2018), a large dataset of stories written by users on Reddit. This task is suitable for our user study because creative writing requires diverse rhetorical directives while also not placing as much of an emphasis on world knowledge on the part of the participant (unlike writing a news article, for instance).
We ask each participant to write responses to three different prompts, where for each prompt they use one of three different writing modes: (1) BASE: writing from scratch without any AI assistance, (2) ILM: writing assisted only with the <mask> tag, and (3) IGA: writing assisted with multiple tags. To study how often users use intentguided generation instead of uncontrolled generation when given a choice, we also include the <mask> tag in the IGA mode by simply producing outputs from the ILM model. We randomize the order of modes across subjects to mitigate respondent fatigue (Lavrakas, 2008) (e.g., one participant may write their responses using BASE, ILM, IGA while another might use IGA, BASE, ILM).
Each evaluation session lasts for approximately one hour. Before each session, the participant is instructed to read a tutorial document, which describes the system's layout and the usage of each tag. During each evaluation session, they first go over an interactive tutorial to experiment with each tag, either with provided examples or examples they invent themselves, and then start the main writing tasks. The purpose of the tutorial is to thoroughly familiarize participants with each model so they do not have to learn on the fly.
During the AI-assisted writing phase, we do not require participants to write every sentence with the tags, or even use the system at all if they choose not to (e.g., they can write their whole response from scratch). We do require them to write at least ten sentences in response to each prompt.
In each evaluation session, we record the following metrics to understand how the participants interact with the systems: 1. # of clicks on the Generate button, which takes the user-tagged sentence and outputs multiple (sampled) completions 2. # of clicks on the + button, which adds a sampled completion to the Main textbox 3. # of sentences written without any assistance 4. # of model-generated tokens that were kept and deleted by the author in the Main textbox 5. # of novel tokens inserted by an author within a model-generated completion.
We report the average number of tokens and sentences in the responses (Table 5), the average number of clicks per sentence (Table 6), tag usage of all AI-assisted output (Figure 4), and unigram precision, recall, and F1-scores of each intent tag (Table 7). In summary, users interact more frequently with IGA than ILM, generating more content (∼ 3 more sentences per session and ∼ 30 more tokens in each response), and more of their sentences on average are AI-assisted with IGA (8.0 compared to 6.6). Additionally, IGA generations are far less likely to be edited than those from ILM: Table 7 shows 69% of the generated tokens are preserved in ILM mode, compared to ∼ 87% in IGA mode (averaged across all tags). Interestingly, when equipped with the intent-based tags in IGA mode, the output of the uncontrolled <mask> tag, the second most often used tag in IGA mode as shown in Figure 4, is more likely to be accepted by users than in the ILM (80 vs. 68). This is likely because when the users use the <mask> tag under IGA, they indeed have no clear intents, and are thus more likely to accept intent-free generation.

Survey feedback
After each session, we also collect feedback from subjects through a post-session survey. The first part of our survey asks participants to recall their experience with IGA mode and evaluate various aspects (Table 11). We refer readers to Appendix C for more detailed description. Participants are   mildly satisfied with the model performance (3.4 / 5) and are interested in using the system for future WritingPrompt tasks (3.6 / 5), but they are polarized on how easy the system is to learn (3.6 / 5 with a standard deviation of 1.4) We also ask them to choose the writing mode they prefer the most and explain their preference. Out of 12 participants, seven prefer IGA writing mode, four prefer ILM, and only one prefers writing from scratch. The majority of participants favoring the experience of either ILM or IGA demonstrates the potential of AI-assisted writing, especially for open-ended creative tasks like story writing. The most common reason that users prefer the intent-guided generation of IGA is because it provides fine-grained control over the generated output. The four participants who prefer ILM remark that the system is much simpler to use because it has only one tag (<mask>). As one participant comments : "Once I became more comfortable with the remainder of the tags, I think it would be easier for me to write, and therefore more enjoyable. So short-term I would enjoy ILM and then long-term IGA. As someone who struggled with figuring out what to write next for short stories in elementary school, I wish this existed then!".
In the final portion of the survey, we ask them to rate the quality of each tag in IGA. If they did not use a certain tag in their writing, they are asked to give a rating for it by recalling their experience of using that tag during the tutorial mode.

Limitations
Although our user study demonstrates that subjects prefer IGA over competing models, it has many limitations. First, NLP researchers are not the right group to ideate the set of writing intents, and in the user study, computer science graduate students are not representative enough as the target users. A more ideal setup is to conduct both the ideation of intents and user study with expert users, preferably English students or teachers. This sort of study could be done on platforms like Upwork. To validate the usefulness of the existing intents in IGA, we also need to conduct interviews with writing professionals and inquire about new prospective intents for future development.

Conclusion
In this paper, we introduce a new approach to interactive human-AI co-authoring by means of an Intent-Guided Authoring Assistant (IGA). Our model is able to infill around author-provided keywords, sentence fragments, and rhetorical instructions with fluent and coherent text. We conduct a small-scale user study which shows that our method has advantages over baseline methods on a creative writing task.

Ethics statement
Our data collection is for research purposes only, and thus consistent with the terms of use of all source corpora we mined. For the evaluation process, we strive to compensate the Mechanical Turk workers as well as participants of our user study with competitive payments. The intended use of IGA is for creative writing. Although generating factually-correct output is not a major focus of creative writing tasks, IGA often hallucinates facts about real-world entities, a phenomenon that raises ethical concerns and has become an increasing focus in text generation research (Maynez et al., 2020;Wang and Sennrich, 2020). The model can on rare occasions produce offensive outputs, due in large part to GPT-2's pretraining corpora. One potential way to reduce the toxicity of output is to apply profanity filter as a post-processing step before final output is returned.

A Mechanical Turk experiment
We randomly select 50 examples from test set of each tag and get output from ILM and IGA respectively. Each example includes the gold reference and the model output. Each example was assigned to three Mechanical Turk workers who have approval rate higher than 96% and number of approved HITS greater than 1000. Each worker was asked to rate the fluency (FL), coherence (CH) and the adequacy (ADQ) of the infilled content. The first two dimensions are common in natural language generation evaluation, which judge the grammaticality and how well the system output fits into the provided context (Çelikyilmaz et al., 2020). The last quality dimension ADQ measures how well the infilled content alone fulfill the target author intent. The rating is on a 5-point Likert scale. To increase inter annotator agreement, we collapsed 1 and 2 to 1, 4 and 5 to 3 and change 3 to 2, thus the reported value in 9 is reported on a 3-point scale.  Table 9: Ratings of intrinsic crowdsourced evaluation. We collapse the 5-point Likert scale to 3-point scale with 1 (prefer reference), 2 (no preference), 3 (prefer generated text). Fleiss κ greater than 0.2 is marked with * .
In general, we find it's hard to get high agreement from the Turkers in terms of fluency except for IDIOM . Annotators believe the ILM has better fluency mostly because some spans are infilled with clauses rather than short idioms, which leads raters to give higher fluency scores.

B Discourse markers used for data extraction
We display discourse markers used for extracting fine-tuning example in Table 10.

C Post-survey rating
The first section of our survey asks participants to recall their experience with IGA mode and evaluate various aspects presented in Table 11. Besides commonly asked dimensions, such as fluency, relevance, coherence, and general quality of system output, we also ask them how often the system generate output that's interesting (interesting) and that inspires them to write (Inspiration). They are also asked to rate whether they are satisfied with the system output (satisfy), whether they would like to use the system again (Use again) for the WritingPrompt task, and how easy it is to learn the system (Easy to learn). In general, participants are mildly satisfied with the model performance, but understandably, have polarized views on how easy it is to learn this system with standard deviation of 1.4.

D Fine-tuning example
We display fine-tuning example of each tag (intent) in Table 12.
Intent Tag Example PARA <sub> the growth potential has consistently declined in this period . <sub> <sep> The growth potential has been steadily declining throughout this period . <answer> BIO Roger Stone , a Republican strategist , said , " Issues that were extremely successful for us in the 80 's are n't on the radar screen anymore . " But Robert Teeter , <biography> , insists that the frictions and tensions are simply the growing pains of a governing coalition . <sep> the Republican polltaker <answer> CAUSE I gawped in astonishment . This morning I read that the University of Exeter has had to employ social media operators to deal with inquiries , <cause> increasing <cause> email , considering it too slow and unwieldy . <sep> because <answer> numbers of students will not use <answer> EFFECT " I view military prisons as the overlooked campaign of 1864 ; prisons , their management and questions of exchange are taking up a massive part of the bureaucratic part of the war . " <effect> Civil War <effect> . <sep> In the end , most <answer> POWs survived <answer> CNTRA Part of being able to extend the network effect of your status update is having the right desktop client for broadcasting updates as well as keeping a lookout on relevant updates from other users . <concession> perfect <concession> user , we highly recommend the new Seesmic Desktop for managing multiple accounts and tracking custom search results . <sep> Though we believe the <answer> desktop client is unique to each <answer> DESCP It's because, contrary to what we've been told by satirists, sneering cynics and other such detritus, he is in fact a deeply witty and humane man. <description> and he looks like a chimp . <sep> He 's intelligent , perceptive <answer> IDIOM As the Senators prepare to face the Montreal Canadiens in Game 3 of their playoff series Sunday night ( CBC , 7 p.m . ET ) at Scotiabank Place , the Ottawa coach had his audience of assembled media <idiom> as he tried to deflect any talk about a war of words . <sep> in stitches <answer>