Automatic Document Sketching: Generating Drafts from Analogous Texts

The advent of large pre-trained language models has made it possible to make high-quality predictions on how to add or change a sentence in a document. However, the high branching factor inherent to text generation impedes the ability of even the strongest language models to offer useful editing suggestions at a more global or document level. We introduce a new task, document sketching, which involves generating entire draft documents for the writer to review and revise. These drafts are built from sets of documents that overlap in form - sharing large segments of potentially reusable text - while diverging in content. To support this task, we introduce a Wikipedia-based dataset of analogous documents and investigate the application of weakly supervised methods, including use of a transformer-based mixture of experts, together with reinforcement learning. We report experiments using automated and human evaluation methods and discuss relative merits of these models.


Introduction
Large pre-trained language models such as T5 and GPT-3 (Raffel et al., 2019;Brown et al., 2020) have enabled impressive progress on a variety of natural language generation tasks by producing fluent, coherent long texts (Rashkin et al., 2020;Zellers et al., 2019). While automated document-level generation seems tantalizingly within reach, a high branching factor presents significant challenges in tailoring generated documents to the specific requirements of users. Topic drift and "hallucination" of information are endemic to these models (Wiseman et al., 2017). These risks have ensured that end-user applications involving text generation (e.g., Smart Compose, Smart Reply, Grammarly) still require a human to remain in control of content and are restricted to individual sentences or even smaller segments of text (Chen et al., 2019; Figure 1: The right side shows a sketch for writing the report of a future democratic national convention, generated from a pile of previous reports. Kannan et al., 2016;Alikaniotis and Raheja, 2019;Prabhumoye et al., 2019;Faltings et al., 2021).
Can large generative language models be used to assist user writing at the document level while the user still controls the factual content? A possible answer lies in the observation that a substantial portion of day-to-day writing involves some form of reuse. Similar documents (e.g., monthly reports, sales letters, job descriptions) are effectively recycled by changing those segments that need to be modified (Fig. 1). 1 Moreover, documents containing analogous texts are often found collocated in repositories, a common practice for organizations that manage professional documents. 2 The high branching factor that impedes the application of long-form generation might thus be mitigated if models were to exploit conventionallystructured analogous "reusable" texts to produce a "sketch" that captures prototypical document-level text structure. In this view, initial sketches would assist authors by reducing the manual editing effort 1 Evidenced by the plethora of companies offering reusable business templates for enterprise use, e.g., https://www.businesswire.com/news/home/ 20201028005573/en/.
2 For instance, monthly reports of world trade in grains are organized chronologically at https: //usda.library.cornell.edu/concern/ publications/zs25x844t?locale=en (planning, inserting and deleting content, etc.). A good sketch would reflect structural patterns and reusable text, and provide indication, beyond static boilerplate, of locations where user input might be warranted. This could be especially beneficial for novice writers who would otherwise have to read many analogous documents before developing a full picture of what is entailed in writing such documents. 3 A fully-implemented dynamic system might update other portions of the document to reflect modifications introduced by the user.
In this work, we propose a new task, called DOC-UMENT SKETCHING, in which initial template-like prototype documents are generated from collections of analogous documents. To support this task, we collected a dataset consisting of approximately 20K Wikipedia documents with similar textual characteristics. 4 For this new task, inspired by previous work in measuring machine translation post-editing productivity (Tatsumi, 2009;Specia and Farzindar, 2010), we define an automatic evaluation metric based on Word Error Rate (Snover et al., 2006;Tomás et al., 2003). We compare against strong baseline models including a mixture of experts model and a reinforcement learning approach designed to handle multi-source inputs and a weak supervision setting. Finally, we provide experimental analysis of these models, using automated and human evaluation studies.
2 Related Work 2.1 Document Generation Recent work leverages the success of large pretrained language models to generate long texts such as stories (Rashkin et al., 2020), reviews (Cho et al., 2019a and fake news (Zellers et al., 2019). Most end-user applications for assisting user writing, however, are confined to sentence-level generation Kannan et al., 2016;Alikaniotis and Raheja, 2019;Prabhumoye et al., 2019;Faltings et al., 2021). Our work focuses on document-level writing assistance in which a document sketch is constructed from a set of similar documents.
3 Our experiments in Section 6 also show that drafts derived from multiple analogous documents are more effective than those derived from one or two documents. 4 We release our data and source code for dataset construction and experiments at https://github.com/ ellenmellon/document_sketching.

Template-Based Generation
Some existing work induces templates as an intermediate step for performing tasks like text summarization or response generation. Most use a retrieval-based method to extract similar references from the training corpus as prototypes (Cao et al., 2018;Yang et al., 2019;Gao et al., 2019;Peng et al., 2019), and learn to separate salient information and latent template structure. Cai et al. (2019) induce an intermediate template for response generation explicitly, but from a single retrieved relevant response. Similar prototype editing work Fabbri et al., 2020) focuses on short text (e.g., a question or a single sentence) or structured output (e.g., code snippet) editing. Oya et al. (2014); Magooda and Litman (2019); Yang et al. (2020);  convert each single input text into a template with blanks using rule-based methods. Other work such as (Wiseman et al., 2018) and (Gangadharaiah and Narayanaswamy, 2020) relies on a knowledge base or a domain/task specific ontology to segment text sequences into templates.

Multi-Sequence Processing
Multiple Sequence Alignment (MSA), widely used in the biological domain (Sauder et al., 2000) to align multiple biological sequences like proteins, has long been leveraged for text pattern matching (Barzilay and Lee, 2003;Alonso et al., 2004). We adopt this method to align input documents and create heuristic templates as weak supervision.
Other tasks taking multiple text sequences as input include multi-document summarization, which seeks to generate an abstractive text summary of multiple input documents (Liu and Lapata, 2019;Chu and Liu, 2019), and multi-source machine translation (Nishimura et al., 2018;Garmash and Monz, 2016) that encodes input texts in multiple source languages and translates them into a target language. Cho et al. (2021) generate a question from input documents by applying a multi-encoder model with a transformer-based coordinator.

Problem Definition
We introduce the task of document sketching, which aims to facilitate the authoring process by generating a template-like document draft, based on a collection of sampled similar documents. Formally, the task can be defined as follow: given a set of n documents X = {x 1 , x 2 , ..., x n }, generate a text sequence s that can be used as the sketch to reduce the human effort involved in composing a target document y.
Evaluation Metrics As in most text generation tasks, we rely on human evaluation (see details in Section 6.3) to draw final conclusions on system comparison. However, due to the high cost of human evaluation, we use automatic evaluation metrics for system development.
It is difficult (and expensive) to collect humanwritten sketches as references. However, since a generated sketch s is used to help the user complete writing a target document y (i.e., post-editing), we can instead use the target document as a reference and calculate the extra edits required to transform a sketch into that target document. Inspired by the previous uses of word error rate (WER) in evaluating machine translation (Snover et al., 2006;Tomás et al., 2003) and evidence of reasonable correlation between WER and human post-editing productivity (Tatsumi, 2009;Specia and Farzindar, 2010), we propose to assess the effectiveness of s in the completion of y based on WER and the Levenshtein distance (abbreviated as 'lev') between s and y: score(s, y) = 1 − WER(s, y) The higher score(s, y) is, the fewer minimum number of word-level insertion and deletion edits a user would need to complete writing y if starting from s. To account for the lexical and phrasal variety of writing a target document, we calculate the average score of multiple reference documents Y : where Y = {y 1 , y 2 , ..., y m } and m denotes the number of reference documents.

Data
We collect our dataset from the English Wikipedia dump (June 20, 2020). We first group documents into collections with analogous texts, with the observation that articles with shared reusable structural texts also tend to have similar titles. These collections are then split into training, validation and test sets. This dataset is designed for a weakly supervised setting, as gold sketches are unavailable.

Wikipedia Document Collections
We observe that Wikipedia article titles provide a strong indication of whether documents are likely to share structural text (i.e., can be put in the same collection) or not. For example, it is reasonable to consider articles with titles like "Super Bowl I", "Super Bowl II" and so on to comprise a document collection of the annual championship game. Therefore, we group documents whose titles are identical but for one token at a specific position. In the "Super Bowl" example, the document titles are the same up to the third token in each title, and the collection can be named "Super Bowl ". It is worth noting that collecting these articles based on their titles is simply reflective of how similar documents in Wikipedia tend to cluster. Our task setup only requires a set of references as the input (e.g., documents organized in the same directory), without the need to have reference documents to share titles. Title matching can yield noise in the extracted document collections, so we apply simple yet effective restrictions in the extraction pipeline. We empirically set the minimum number of documents per collection to 15 and the minimum document title length to 3 tokens. We randomly select and inspect over 200 extracted collections, and find 90% of them contain analogous documents for a certain topic (examples in Tab. 1). The remaining 10% are less clear, but can be usefully grouped into 3 classes as in Tab. 1. To further reduce noise in each collection, we apply Eq. (2) to calculate the average similarity score of each document with the other documents in a collection, removing those with an average score lower than an empirically chosen threshold −1.5. Finally, we keep document collections with more than 5 documents and truncate them to a maximum of 50 documents, which are divided into train, validation, and test sets in the ratio of 0.8/0.1/0.1. The collection statistics are summarized in Tab. 2.

Task Data Points
We divide each document collection into multiple smaller collections of analogous documents. For standard supervised training, each data point consists of 1 document d to construct a heuristic sketch as weak supervision (see details in Section 5) and up to 9 input documents X. For evaluation, additional 4 documents are used as references Y for calculating the score in Eq. (2). Since document collections whose title tokens differ by a number usually contain documents about events or entities of different times, we order the documents in ascending numerical order to imitate practical scenarios where humans sketch a document by looking at collections of previously written documents. We divide each collection into data points with up to 14 (=1+9+4) documents for all dataset splits, yielding 24k, 3.1k and 3.0k data points (smaller collections) for train, validation, and test sets, respectively.

Approach
Since gold document sketches are not available for supervised training, we first perform weakly supervised training by constructing heuristic sketches as targets. Then we apply reinforcement learning strategies for text generation.

Weakly-Supervised Learning
Heuristic Labels As described in Section 4, each data point has one document d for creating the weak-supervised sketch. We conduct pair-wise sequence alignment (using the Ratcliff-Obershelp algorithm (Ratcliff and Metzener, 1988)) for d and each input document x i ∈ X, and count the relative alignment ratio (the number of tokens being aligned divided by |X| = n) of each token in d. We retain tokens with this ratio above a threshold and replace other tokens with an ellipsis token to form the heuristic target sketch s given X. Consecutive ellipses are merged. The threshold is empirically set at 0.6, which yields the highest average score(s, Y ) on the validation set.

Mixture of Experts (MoE)
To generate an initial document sketch, the model needs to take multiple input documents. The most obvious way to do this is to concatenate all input documents into a single long sequence and feed that into an encoderdecoder model like T5 (Raffel et al., 2019). However, processing long sequences in such models is memory-consuming and lack of structure makes it difficult for the model to perform document-level coordination. We therefore use a mixture of experts (MoE) framework, in which a coordinator decodes a token at one timestamp by taking the hidden state and the output vocabulary distribution from each expert that processes a single document. Fig. 2 shows the overall framework of MoE.
The experts have the same encoder-decoder structure and aim to generate s by each encoding a separate single input document only. All experts share the same model parameters. At each decoding timestamp t, the i th expert encodes x i as well as previously decoded tokens from a coordinator s 1 ...s t−1 (or s 1 ...s t−1 during training), and outputs a probability distribution π t i over all vocabulary words. The coordinator is a transformer-based encoder that takes the hidden state at the current timestamp h t i of each i th expert, and outputs a weight w t i with a final linear layer. The output weights are used to calculate a weighted sum of the probability distributions from their corresponding experts where π t is the final distribution for generation at timestamp t.

Reinforcement Learning
We leverage reinforcement learning (RL) to further improve generation quality. For each training example with input X, we generate a sequenceŝ, which is sampled from the probability distribution at each time step, p(ŝ t |ŝ 1 ...ŝ t−1 , X). We observe that directly optimizing the evaluation function proposed in Eq.
(2) at the sequence level, using a vanilla policy gradient (PG) or a self-critical sequence training (TD-SCST) algorithm (Rennie et al., 2017;Paulus et al., 2018;Pasunuru and Bansal, 2018), can lead to instability during training as the reward cannot be calculated until the end of generation (Celikyilmaz et al., 2018). Therefore, we instead use a token-level incremental reward that is based on the change to the original reward function r(ŝ, Y ) = score(ŝ, Y ) from each sampled token s t , given references Y : The training objective can be written as: where T = |ŝ|. Since optimizing RL loss alone runs the risk of compromising the language model (Paulus et al., 2018;Pasunuru and Bansal, 2017), we use a mixed loss as follows: where λ is a hyperparameter to be tuned.

Setup
To leverage the recent success in such transformerbased generation models, neural generation models in our experiments are initialized with the base version of T5 (Raffel et al., 2019;Wolf et al., 2020), an encoder-decoder architecture pre-trained on a variety of text-to-text tasks. 5 All hyperparameters are tuned on the validation set. For MoE, we first fine-tune T5-base to obtain a "single expert" model to initialize each individual component model of the MoE. The "single expert" is trained by leveraging a sequence-to-sequence objective as in T5, with each of the n training (input, output) pairs: (x 1 , s), (x 2 , s), ..., (x n , s) from a data instance. This is followed by training a complete MoE model as described in Section 5. We use greedy beam search as the decoding strategy for all generative models with a beam size of 4 (the default value in T5base). We observe that consecutive ellipses and uninformative tokens hurt readability. To improve the readability of output sketches, we apply minor post-processing to all models in our experiments: we merge consecutive ellipses if all tokens between them are punctuation or among the 30 most frequent tokens. 6 Sentence-terminating periods are excepted.  (2018)). In order to avoid the situation where the models learn to not generate ellipses that lead to unreadability, we set the cost for deleting an ellipsis to be much smaller (0.1) in the reward function. In addition, we randomly select three seed values for each model. Each model is trained with the above parameter value combinations. We apply grid parameter search for T5 and MoE, and random search for MoE+RL. Hyperparameters are tuned based on automatic evaluation score (1-WER) on the validation set. The best hyperparameter configurations for bestperforming models were: T5: batch size = 2, warmup step = 4000, learning rate = 3e −5 ; MoE: batch size = 15, warm-up step = 8000, learning rate = 3e −5 , warm-up step (coordinator) = 4000, learning rate (coordinator) = 1e −4 ; MoE+RL: batch size = 4, warm-up step = 4000, learning rate = 3e −5 , warm-up step (coordinator) = 4000, learning rate (coordinator) = 3e −5 , λ = 0.99.

Systems
Non-neural systems include the following approaches: Last-pair: The output is the aligned boilerplate text between the last pair of ordered input documents (as described in Section 4, in cases where the differing title token is time-related, the last pair refers to the most recent two docu-ments); Last-retrieval: The last document is retrieved as the initial draft; MSA: We apply the multi-sequence alignment approach (Barzilay and Lee, 2003), to align tokens in input documents and replace tokens that have too few alignments (threshold tuned on valid set) with ellipses; Consensus-MSA: This resembles how we generate heuristic drafts as weak supervision in Section 5.1. However, instead of using the held-out document d to create the skeleton, we use the input document x i that gives the highest value of n j=1 score(x i , x j ). We evaluate the following neural approaches: T5 (zero-shot): We directly apply T5 without finetuning. We use T5 to encode each input document individually and at each decoding timestamp, we combine probability distributions from all T5 decoders with average pooling. As T5 rarely generates ellipses, we insert an ellipsis token if the probability of the top candidate token w from the combined distribution is below a threshold α that is tuned on the validation set (i.e., p(w) < α). 7 We used the "summarize :" prefix in the zero-shot setting, this being is the most similar T5 pre-trained tasks to ours; T5 (doc-finetune): We finetune T5 with all input documents concatenated into a single string input (with a special separator token) and each target document (instead of the heuristic sketch) as the generation target; T5: Similar to the T5 (doc-finetune) setting, but with the heuristic sketch defined in Section 5 as the generation target; MoE: This is the mixture of experts model described in Section 5, also trained with the heuristic sketches as the weak supervision; MoE + RL: This is the RL model described in Section 5, with the trained MoE as the warm-start.

Automatic Evaluation
We use the automatic evaluation of Section 3 for system development while relying on human evaluation to draw final conclusions on system comparison. However, we observe reasonable consistency between our automatic and human evaluation results. We report automatic evaluation results in Tab. 3. The first numerical column includes overall results of the test set, and the remaining columns categorize the results into three roughly equi-sized levels of similarity among documents within the input document set, estimated on the basis of average pair-wise edit   distance normalized by length. Among non-neural models, we observe that multi-sequence alignment approaches outperform last-retrieval and last-pair where only one or two input documents are leveraged to produce the initial draft. We also notice that consensus-MSA with a "central" document as the skeleton is more effective than the standard MSA that treats all inputs equally. When directly applying zero-shot T5, we notice that when the threshold α described in Section 6.2 equals 1.0 (i.e., the output is always an empty string), the model can achieve the best score 0. This is partly because that none of the T5 pre-trained tasks is the same as ours. In addition, simply exploiting the generation probability at each decoding timestamp can be problematic. For example, when decoding a frequently-appearing entity with multiple tokens, the first token might have low probability given the preceding context while subsequent tokens can be more probable simply because they frequently appear with the first token. T5 with supervision from the target document (doc-finetune) yields a much lower score than T5 finetuned with the heuristic draft. This is mostly because the complete target document contains much more information than can be inferred from the input; the model learns to hallucinate facts, necessitating heavy deletion by users.
MoE outperforms T5, which indicates that hav-   ing a coordinator to communicate between each individual encoder and decoder effectively improves model performance. We also experiment with coordinators using a simple linear layer and find that transformer-based ones are much more effective. Applying RL on top of a trained MoE model helps further improve the model performance in automatic evaluation. We denote the RL approach as MoE+RL in Tab. 3. Warm-starting RL with a MoE model close to fully-converged gives slightly better results than with an fully-converged MoE model.
The rightmost three columns in Tab. 3 divide test examples into 3 roughly equal-sized levels of similarity among documents within the input document set ('high', 'medium', 'low'). Both last-retrieval and T5 (doc-finetune) models drop dramatically as the input similarity decreases, mostly because they contain a lot of hallucinated content, an issue that is even more severe when there is little overlapping structure or factual content among the input docu-ments. MoE based models are much more robust to low input similarity compared to the baselines. For the group with the most similar input documents, consensus-MSA gives the best score, while MoE based models yield much better performance in the other two groups.
Human Evaluation In order to avoid bias in human judgments, we control possible confounding factors including sequence length and the number of ellipses in a sequence during decoding (Nakov et al., 2012;Guzmán et al., 2015). We apply a tuneable penalty for each confounding factor (at inference time only) for each neural model (Murray and Chiang, 2018) to generate examples for human evaluation, such that the average output length and number of ellipses in each sequence from all neural systems compared in Tab. 4 are almost the same. 8 We notice that such normalization does not affect the automatic evaluation score of each system much and the system ranks do not change.
Human evaluation was conducted using crowdsourced workers. Judges were presented with paired randomized outputs and target documents, and were instructed to choose their preference for a starting point for writing the target document in order to save editing time. 9 Judgments were based on a five-point Likert scale, and ties were permitted. Four judges evaluated each pair, and metrics were imposed to block poorly performing judges. Sample sizes are 1000 for all system pair comparisons.
Results in Tab. 4 confirm that MoE outperforms consensus-MSA and T5. Although RL helps improve automatic evaluation scores on top of MoE, human judges mostly prefer MoE over MoE+RL. The fact that applying RL can hurt readability (Paulus et al., 2018;Pasunuru and Bansal, 2018) may explain why MoE+RL achieves higher automatic scores yet worse human scores than MoE. We observe that MoE+RL has occasional difficulty predicting where ellipsis tokens should be, which hurts sketch readability. Moreover, since our task is a subjective one (e.g., different preferences for more/less verbose sketches), a customized RL reward (e.g., different cost for deletion and insertion for WER calculation) should be applied in real applications to reflect different user preferences.

Analysis
We investigated the differences in model performance when the input documents have different similarity scores. When the similarity is lower, models tend to produce drafts that are harder to interpret because they generate a higher percentage of generic or ellipsis tokens, although MoE and MoE+RL are less vulnerable to this. Tab. 7 shows the average percentage of generated ellipsis or punctuation tokens for each system when the input document similarity is high or low. We can see that when the similarity score is higher, the two statistical numbers do not differ greatly between systems, except that MSA has a much higher percentage of ellipses. When similarity is low, MoE and MoE+RL have much lower percentages of ellipses or punctuation, compared with other models.
Tab. 5 shows two examples of system outputs. These examples are from the clusters of high and low input similarity respectively. When the input similarity is high, content tokens from consensus-MSA are included in the target document while the neural models tend to generate more hallucinated tokens. This explains the better score consensus-MSA achieves in the higher input similarity cluster. We also notice that when the similarity is high, consensus-MSA generally has better coverage of overlapped content between input documents, and its generated drafts are on average more than 10% longer than other systems. When the similarity is low, the second example shows that all systems generate shorter outputs due to less shared structure between input documents. In this case, consensus-MSA starts to include more content that does not necessarily appear in the target and MoE-based models generate more reasonable sketches and contain fewer ellipses and uninformative tokens, which can hurt the sketch readability.

Discussion
Since gold document sketches are difficult to annotate, there remain challenges of how to leverage or create better weak supervision. We may also want models to adjust to users' personalized preferences over document sketches (e.g., some may prefer more deletions than insertions or vice versa) in practical application deployment, where RL can play a major role by having models directly optimized by customized rewards.
We show that MoE-based models tend to generate more readable sketches by outputting fewer ellipses and functional tokens, though in some cases it can be difficult to guess what should fill a given ellipsis. An interesting future study could examine how the placement and number of ellipses in a sketch affects readability. We would also like to explore whether replacing ellipsis tokens with some more contentful label (e.g., entity type) or a retrieved sample content for a gap (Xu et al., 2021) could help make sketches more interpretable.
An initial document sketch provides the reusable text structure that allows a user to fill in detailed content. Interactive document generation can be seen as the next step of our document sketching task that dynamically fills in more local contexts based on users' inputs. We suggest that RL could again Target Table 7: Percentage of ellipses (in all generated tokens) and punctuation (in generated tokens excluding ellipses) tokens. 'High' and 'low' refer to the average pair-wise input document similarity level.
play a major role towards building such writing assistants, with rewards from a real user or a user simulator. We notice, for example, in Tab. 6 that given user input in the first sentence, the subsequent generated sentences become much more specific and relevant to the new input.

Conclusions
We have presented a new task, DOCUMENT SKETCHING, designed to generate draft documents that users can edit. To support this task, we introduced a new dataset and a weakly supervised learning setting. Experimental results show that deep learning models outperform established multisequence alignment approaches.

Acknowledgments
We thank members of Microsoft Research and University of Washington's NLP groups who provided feedback and insights to this work. In particular, we thank Chenyan Xiong for providing insights during early project discussions. We also thank Michael Lee, Sara Ng, Sitong Zhou, and Trang Tran for their assistance and feedback on the human evaluation setup. We also thank Zhang Li, Kosh Narayanan, Chandra Chikkareddy, Si-Qing Chen, and Weixin Cai of Microsoft's Natural Language Experience team for their insights into authoring experience.

Ethical Considerations
We believe this paper to be part of body of work that can mitigate some of the ethical pitfalls (Bender et al., 2021) of large pre-trained models such as GPT-3 (Raffel et al., 2019). While the present work does not entirely prevent misuse, e.g., by a malicious user trying to promote misinformation, Automatic Document Sketching can be more broadly framed as a form of controllable generation, which places greater control into the hands of the users. As such it is complementary to ongoing research on fake news detection (Zellers et al., 2019), toxicity detection (Han and Tsvetkov, 2020;Pavlopoulos et al., 2020), and consistency detection in, e.g., summarization (Maynez et al., 2020;. Human subjects (crowdworkers) were paid at a rate higher the legal minimum wage in our locale (Washington State, U.S.A.).