Expository Text Generation: Imitate, Retrieve, Paraphrase

Expository documents are vital resources for conveying complex information to readers. Despite their usefulness, writing expository text by hand is a challenging process that requires careful content planning, obtaining facts from multiple sources, and the ability to clearly synthesize these facts. To ease these burdens, we propose the task of expository text generation, which seeks to automatically generate an accurate and stylistically consistent expository text for a topic by intelligently searching a knowledge source. We solve our task by developing IRP, a framework that overcomes the limitations of retrieval-augmented models and iteratively performs content planning, fact retrieval, and rephrasing. Through experiments on three diverse, newly-collected datasets, we show that IRP produces factual and organized expository texts that accurately inform readers.


Introduction
Expository writing is intended to inform a reader about a topic in a clear and logical manner (Weaver and Kintsch, 1991).Such writing is highly prevalent online, appearing in various forms such as university descriptions, medical information, and Wikipedia articles.Despite its importance, writing expository text is a difficult task, as it requires the author to carefully plan content for the document, synthesize information from multiple sources, and Figure 1: Expository text as college descriptions produced by IRP, the LED language model (Beltagy et al., 2020), and one-shot ChatGPT, compared to the ground truth.IRP and LED use a factual corpus as input and ChatGPT is prompted with one training example.Differently highlighted text indicates significant factual errors.Bold indicates similar organization and wording.name) to generate an expository document (e.g.college description).To produce this document, there are two necessary steps: 1) searching a large knowledge source, which may be the web or a provided factual corpus, for facts on the topic of interest; and 2) faithfully synthesizing and paraphrasing said information into a structured, readable document.
We position expository text generation as a knowledge intensive text generation task with two properties.First, expository texts are organized and phrased similarly, as shown by the gold college descriptions in Figure 1, which discuss the institutions' founding, enrollment, location, and campus size, in that order.This is a key difference from summarization.Typical summarization tasks prioritize condensing salient information and thus, the style of the output should match the style of the input text (Zhong et al., 2020).In expository text generation, we are primarily focused on generating factual text in a consistent style, where this style is learned from examples of output documents, and may not be reflected in the input knowledge source.This makes our proposed task more similar to datato-text generation (Gatt and Krahmer, 2018), but we seek to generate natural language text from a large, unstructured factual source, rather than structured data as in data-to-text generation.
Second, expository text generation provides a generic document title as the only search query to obtain information for the output.This is a key difference from other query-based tasks such as long-form question answering (LFQA) (Fan et al., 2019), which provide fine-grained search queries as inputs (e.g."How do Jellyfish function without brains or nervous systems?").In LFQA, these queries are sufficient to retrieve the information for answering the question.However, in expository text generation, the document title alone is insufficient to retrieve facts for the output, especially when nuanced information is required (e.g.university ranking).Hence, effective expository text generation models must perform additional reasoning and planning to create fine-grained search queries, as demonstrated by our ablation studies ( §5.3).
Current language models (LMs) (Lewis et al., 2020a;Raffel et al., 2020;Touvron et al., 2023) are unaligned with these goals of expository text generation, due to their inability to plan which specific facts should be retrieved for the output (Puduppully et al., 2019) and tendency to hallucinate when paraphrasing facts (Ji et al., 2022).For example, in Figure 1, we find that ChatGPT and LED generate documents that resemble college descriptions, but hallucinate many incorrect key facts, such as the institution's founding and enrollment, significantly weakening the credibility of the document.
To address these issues, we propose an iterative framework named Imitate, Retrieve, Paraphrase (IRP).While LMs struggle to perform content planning, fact retrieval, and faithful rephrasing, IRP explicitly distributes the tasks to three key modules: the Imitator, Retriever, and Paraphraser.First, to decide which facts from the corpus should be included in the next sentence of the output, the Imitator generates a content plan in the style of the expository document domain.The Retriever then uses this content plan to find relevant, up-todate information from the factual corpus.Finally, to maintain the flow of the expository document, the Paraphraser faithfully rewords the retrieved information in the style of the content plan.IRP repeats this process at the sentence level, allowing the model to focus on preserving the factuality of sentences rather than the entire output at once.Unlike existing models, IRP explicitly addresses content planning, fact retrieval, and faithful rephrasing, resulting in high-quality outputs (shown in Figure 1).
To our knowledge, this is the first work to explore the task of expository text generation.Hence, we study the effectiveness of IRP on three diverse, newly-collected datasets: university descriptions, medical information on prescription drugs, and history sections of Wikipedia articles in computer science.Through extensive experiments, we show that IRP produces higher-quality expository text compared to state-of-the-art models.Further, we observe that human judges find the outputs of IRP more factually correct compared to these baselines.Our contributions can be summarized as follows: 1) We introduce the task of expository text generation, which aims to generate a factual document that clearly informs readers about a specified topic.
2) We propose the IRP framework to address the challenges of expository text generation.IRP iteratively and explicitly tackles the key steps of content planning, fact selection, and faithful rephrasing.
3) Through extensive experiments on three diverse datasets, we demonstrate that IRP produces highly factual documents, both in terms of automated metrics and human evaluation, while also adhering to the style of the expository document domain.4) We conduct ablation studies and investigate the factual errors produced by IRP to suggest new directions for expository text generation research.
Expository text generation is a knowledge in- tensive task, as it requires searching a knowledge source.However, the IRP model has two notable differences from traditional RAG models.First, to discover information for expository text generation, IRP uses learned, fine-grained queries in the form of stylistic content plans, while RAG models primarily use the document title as a query.Finegrained queries allow IRP to better locate necessary information for the output, resulting in improved factuality ( §5.1).Second, IRP is an iterative RAG model, meaning that it generates text sentence-bysentence, and can attend to shorter pieces of text at a time.We find that this incremental generation is essential for preserving factual consistency ( §5.3).
However, these models have been developed for summarization tasks with short document inputs and cannot be directly applied to expository text generation, which leverages a large factual source.

Wikipedia Generation
The closest task to expository text generation is Wikipedia generation (Sauper and Barzilay, 2009), which seeks to automatically create Wikipedia articles from article titles.Typically, this is achieved by retrieving input text from the web followed by re-ranking and generation steps (Banerjee andMitra, 2015, 2016;Liu et al., 2018;Pochampally et al., 2021).However, these models are often tailored for specific Wikipedia domains.In expository text generation, we seek to create models that can generalize to different domains of expository documents.

Method
IRP tackles expository text generation where the input is 1) a topic t for the expository document and 2) a large corpus of factual sentences C = {x j } related to topic t.As an initial study, we fix the corpus C to compare models, but in practice, C can be acquired in real time for up-to-date information.To guide the initial generation, IRP also takes 3) a sequence of words r = {r k } to prefix the expository document.Using these inputs, IRP aims to produce a sequence of sentences D = {y i } to comprise the generated expository text.The document D must contain accurate information about the topic t from the corpus C organized in a clear and logical way.
As illustrated in Figure 2, IRP leverages three components: 1) a style Imitator p(y i |y 1:i−1 ) that generates a stylistic content plan y i for the next sentence in the expository document, based on the current state of the output y 1:i−1 (or prefix r in the first iteration); 2) a Retriever p(x j |y i ) that returns the top-k factual sentences x ⊆ C most related to the content plan y i ; and 3) a Paraphraser p(z|x, y i ) that combines the semantics of x and the syntax of y i into a reworded sentence z.We will describe each of these modules, followed by how they are ensembled and trained for the full IRP model.

Imitator
To find relevant facts for the next sentence of the expository text, we must first plan which facts need to be included.Hence, the Imitator p(y i |y 1:i−1 ) generates a content plan y i in the style of the expository document domain for the next sentence in the output document, conditioned on the current sentences in the output y 1:i−1 (or the user-specified prefix r in the first iteration).We seek to mimic the expert content planning of expository documents in the training set.Instead of imitating content planning at the sentence level and optimizing p(y i |y 1:i−1 ) directly, we consider the more relaxed problem of minimizing the cross-entropy loss of token prediction for the expository documents in the training set, i.e. ∀w j ∈ D d , ∀D d ∈ {D 1 , ..., D n }: We leverage GPT-2 (Radford et al., 2019) to minimize λ imit through causal language modeling.
During each iteration of IRP, we create a stylistic content plan y i from sentences y 1:i−1 (or prefix r), by first flattening y 1:i−1 (or r) into a list of tokens s = [s 1 , s 2 , ..., s m ].We initialize the causal language model with context s and iteratively generate a content plan y i = [s m+1 , s m+2 , ..., <|EOS|>] until the end-of-sentence token is reached.By stopping at <|EOS|>, we obtain a single sentence that outlines the content needed for the next sentence of the expository document.If GPT-2 generates the end-of-text token, the document is completed.

Retriever
In order to effectively produce the information described in the content plan, we seek to narrow the search space of where these facts could occur.Thus, given a stylistic content plan y i produced by the Imitator, the Retriever p(x|y i ) searches for the topk candidate facts x ⊆ C that contain the content described in y i .We find that existing retrievers, such as DPR (Karpukhin et al., 2020) andBM25 (Robertson et al., 1995), fail to complete this task, as the hallucinated entities in the content plan y i impair these models' search capabilities.For example, when generating an expository document for ML history, the content plan y i may be the sentence "Machine learning was named by Noam Chomsky in 1984."The hallucinated terms "Noam Chomsky" and "1984" should be ignored when searching for the correct facts, but we find that DPR and BM25 still weigh these terms in their implementations.
To address this issue, we fine-tune DistilBERT (Sanh et al., 2019) with the task of classifying the position of each sentence in the expository document.In doing so, we find that DistilBERT more effectively ignores hallucinated entities during retrieval, which we analyze in §5.5.DistilBERT performs fairly well on the classification task, given the consistent, logical format and style of expository documents within the same domain.
We compute the relevance of each sentence x j ∈ C to the content plan y i by taking the dot product of x j and y i , both embedded by DistilBERT.To obtain these embeddings, we feed each sentence through the classifier and take its representation in the last layer, averaged over all tokens: p(x j |y i ) ∼ d(x j ) T q(y i ). (4) The top-k most relevant factual sentences x ⊆ C to y i will have the k-highest values for p(x j |y i ), which can be obtained through Maximum Inner-Product Search (Shrivastava and Li, 2014).

Paraphraser
To ensure the expository document flows smoothly, we must reword the retrieved factual information in the style of the expository document domain.Thus, after obtaining a stylistic content plan y i and factual sentences x, the Paraphraser p(z|x, y i ) must generate a single sentence z aligned to the syntax of y i and the semantics of x.To achieve this goal, we formulate a variation on text generation with syntactic exemplars (Chen et al., 2019;Lin et al., 2020).We aim to minimize the cross-entropy loss of token prediction for z, conditioned on y i and x: We minimize λ para with BART (Lewis et al., 2020a), a seq2seq transformer-based language model.We modify the input so x and y i are surrounded by custom <|fact|> and <|style|> Our problem formulation differs from traditional text generation with syntactic exemplars, in that the input x contains multiple sentences instead of one.This change is necessary, as the information outlined in the content plan y i may be distributed across multiple sentences.Thus, BART must learn to aggregate information from multiple sentences while adhering to the style of the content plan.

The Iterative IRP Framework
The Imitator, Retriever, and Paraphraser are ensembled to generate expository documents, detailed in Algorithm 1.After the topic t, prefix r, and factual corpus C are provided, the Imitator first uses r as the initial context for GPT-2 to generate a stylistic content plan y i .Next, the Retriever embeds y i and each sentence x j ∈ C with DistilBERT, in order to find the top-k factual sentences x ⊆ C most similar to y i .Finally, the Paraphraser uses BART to combine the syntax of y i and the semantics of x into a single sentence z, which is appended to the output D. The next prefix for the Imitator is set to D, and the process is repeated until the generated content plan y i is the <|endoftext|> token.

Training
The Imitator, Retriever, and Paraphraser modules are trained independently to tackle expository text generation.We will now describe how we modify an expository text generation training set to train each of these components.The training set contains pairs of factual corpora (input) and expository documents (output), i.e. (C, D).The Imitator and Retriever are trained without modifying the training set, solely leveraging the expository documents.The Imitator performs causal language modeling with GPT-2 on each document D, while the Retriever uses DistilBERT to classify the position of every sentence comprising each document D.
To train the Paraphraser, we require triplets of stylistic content plans y i , sets of factual sentences x, and reworded sentences z.For a given triplet, we can obtain the reworded sentence z by selecting any of the sentences found in an expository document D. Working backwards, we represent the stylistic content plan y i as a sentence from a different expository document that has high similarity to z, where similarity is calculated with Eq. 4. We obtain x in a similar manner, using z to retrieve the top-k factual sentences x ⊆ C, also according to Eq. 4. By using z instead of y i to retrieve x, we can be more confident that x will contain the information needed to reconstruct z, reducing the need for the Paraphraser to hallucinate during training.

Datasets
We test the capabilities of IRP on three diverse, newly-collected datasets.1) U.S. News † is a corpus of 433 college descriptions from the top 500 ranked colleges on U.S. News.We select the college name as the topic of the document.2) Medline contains information for 844 medications from MedlinePlus ‡ , a medical library supported by the National Institute of Health.We select the medication name as the topic of the document.3) WikiCS is a collection of the first paragraphs of history sections from 500 Wikipedia § articles in computer science.We select the Wikipedia article title as the topic of the document.For each dataset, we create a 70/10/20 train/validation/test split.No data from the test set is used to train or validate any of the models or components of IRP.We provide full details for dataset collection in Appendix A.1.
For the scope of this work, we assume that the best corpus C has already been obtained for each document D. This is an approximation for the realistic scenario where C is retrieved in real time.To obtain these ideal corpora, we collect documents from the web, reverse engineering with document D. We web scrape sentences W from the top-5 web pages returned using the topic t and each sentence of D as search queries.We exclude pages that contain the ground truth document D. We find that in almost all cases, the retrieved sentences W provide all of the necessary information for generating D. But to ensure comprehensive coverage, we create two versions of each dataset, one where C = W and one where C = W ∪D, denoted by without doc and with doc, respectively.To introduce variation in the with doc datasets, we perform back translation (Mallinson et al., 2017) on the sentences from D, which does not affect their content.The corpora C are shuffled in each entry of the datasets.

Baselines
We compare IRP with the following baselines: 1) LLaMa (Touvron et al., 2023) is an LLM shown to have competitive performance with GPT-3.We choose the 7B version of LLaMa and prompt with 5 representative training examples.LLaMa prefixes its output with the same prefixes used by IRP.
2) LLaMa+Retr is LLaMa with an extra input of the top-5 retrieved sentences with DPR from the factual corpus using the document title as a query.
3) LED (Beltagy et al., 2020) is a seq2seq LM leveraging the Longformer model to encode and decode long documents.LED uses the factual corpus as input to generate the expository document.4) RAG (Lewis et al., 2020b) leverages a retriever and generator to produce outputs specific to a user query.We choose DPR (Karpukhin et al., 2020) as the retriever and BART Large as the generator.RAG uses the factual corpus and topic (query) as inputs to produce expository documents.5) BART is trained to generate the output using the topic as the sole input.This model helps us assess if other models use the factual corpus, or if they simply memorize the style of the expository text.

Training Setup
IRP uses GPT-2 Large, DistilBERT Base, and BART Large for the Imitator, Retriever, and Paraphraser, respectively.We use "[topic] is," "[topic] is used to treat," and "[topic] was first created" as the prefixes for U.S. News, Medline, and WikiCS, respectively.We selected these prefixes by assessing the most common prefixes of outputs in the training set.As a quality control check after generation, we filter sentences deemed repetitive by the Retriever embeddings (cosine similarity above 0.98).For all models, we manually select hyperparameters by assessing R1 and validation loss.We discuss all training details in Appendix A.2.

Quantitative Metrics
We evaluate the quality of the generated documents with two sets of metrics.First, we use traditional metrics.ROUGE-1 (R1) and ROUGE-2 (R2) (Lin, 2004), BLEU (Papineni et al., 2002), and ME-TEOR (Denkowski and Lavie, 2014) measure the similarity between the predicted and true outputs.
However, as these metrics have low correlations with human judgements of factuality (Kryscinski et al., 2020;Fabbri et al., 2022), we also adopt stateof-the-art factuality metrics.First, we calculate the average percentage of tokens in the generated document that are Hallucinated, meaning that they do not appear in the input corpus.Halluc indicates whether the generated text is faithful to the input corpus.Next, we use FactCC, a classifier that is trained to detect factual errors between a source text and a claim (Kryscinski et al., 2020).We use the true output as the source text and each sentence of the generated document as the claim, and report the average proportion of source/claim pairs that FactCC predicts as factually consistent.Finally, as research has suggested that natural language inference has a high correlation with human judgements of factuality (Maynez et al., 2020), we calculate whether the generated document is entailed (NLI-Ent) or contradicted (NLI-Contr) by the true output.We use a pre-trained DistilBERT classifier (accuracy of 0.82) on the MNLI dataset (Williams et al., 2018), and report the average proportion of the generated sentences that are predicted to be entailed (and contradicted) by the true output.

Performance Comparison
In Table 1, we observe that IRP obtains the highest factuality scores, achieving the strongest results for 20 out of the 24 calculated factuality metrics.Further, we find that apart from Medline, IRP outperforms baselines on almost all traditional metrics.These findings confirm that the Paraphraser faithfully rewords the factual information in the style of the expository document domain.Previous works have shown that prioritizing factuality leads to a drop in ROUGE (Goodrich et al., 2019;Maynez et al., 2020).However, IRP adheres to factual accuracy while maintaining the style of the expository document domain, suggesting the benefit of separating expository text generation into the steps of content planning, fact selection, and rephrasing.
We also find that in some cases, BART obtains competitive factuality scores with LED+Retr, implying that the LLM equipped with DPR cannot effectively leverage the factual corpus.Further, although RAG is typically effective in knowledge intensive tasks, the model produces more factual inaccuracies than IRP.Both findings suggest that the document title used by LLaMa+Retr and RAG is an insufficient query to retrieve the factual information needed to produce expository text.Hence, fine-grained queries, such as the stylistic content plans used by IRP, are necessary for obtaining all of the facts to include in the expository document.
Finally, we note that most models achieved much better performance on the with doc datasets.However, the with doc scenario is unrealistic, indicating that future models should prioritize the knowledge acquisition step of C, as it largely dictates the factuality of the output.We believe that studying models that search the web during inference (e.g.LLMs with search engines) is a promising next step towards stronger expository text generation models.

Human Evaluation
To For Fact, we provide annotators with the true output and encourage them to use external resources (Google Search).We observe high annotator agreement for Fact and Style, with Krippendorff's α (Krippendorff, 2011) of over 0.80 on each dataset.IRP strikes the best balance between factuality and style (Avg) in 3/4 datasets and competes with ChatGPT on the fourth dataset (Table 2), despite having less parameters (1.2B vs 175B).Generally, we note that the LLMs (ChatGPT, LLaMa+Retr) perform well in factuality but poorly in style, while the opposite is true for seq2seq LMs (LED, RAG).

Ablation Study
We conduct an ablation study (Table 3, full results Table 5) to determine the contribution of two key design choices of IRP.We find that using stylistic content plans over the document title for retrieval, and generating text iteratively instead of all at once, both improve the fluency and factuality of IRP.

Factual Error Investigation
To investigate the errors produced by IRP, we invite one computer science student to annotate 30 expository texts generated by IRP from the without doc test sets.First, we ask the annotator to identify factual errors in the generated text compared to the true output.We then ask if each error occurred because 1) a factual inconsistency exists in the retrieved facts (i.e.there was no hallucination), 2) no suitable fact could have been obtained by the Retriever, as it does not exist in the input corpus, 3) the fact exists in the input corpus, but the Retriever could not locate it, or 4) the Retriever located the correct fact, but the Paraphraser could not gener- ate it faithfully.We store each step of IRP so the annotator can answer this question.We find that the majority of factual errors are due to inconsistencies in the retrieved facts, rather than the weaknesses of the Retriever, Paraphraser or data collection (Figure 4).For example, one source may report a different university ranking compared to U.S. News.This poses an interesting question for future lines of work on expository text generation: how can we leverage fact verification to accurately select information when multiple options exist?

Retriever Embedding Analysis
The Retriever ignores hallucinated entities when creating sentence embeddings, resulting in increased performance compared to pre-trained retrievers.We visualize this property in Figure 3 and find that the Retriever puts less weight on factually specific entities (e.g."Emory" and "Prologue").As a result, the Retriever can focus its embeddings on the more important terms for retrieving information (e.g."belongs," "class," and "medications").

Sample Expository Document Outputs
We provide examples of documents generated by models on our three datasets in Appendix A.3.

Conclusion
We introduce the task of expository text generation and develop IRP to overcome the inability of LMs to solve this task.IRP separately and iteratively ¶ https://github.com/cdpierse/transformers-interpret performs content planning, fact selection, and faithful rephrasing to generate high quality expository text.Automatic and human evaluations on three datasets demonstrate that IRP adheres to factual accuracy while maintaining the style of the expository document domain.To suggest future directions for expository text generation research, we analyze the factual errors produced by IRP, conduct ablation studies, and visualize the Retriever of IRP.

Limitations
One drawback of IRP is that three separate components need to be trained for each expository document domain.However, we argue that the cost of training is not overwhelmingly high and only differs from state-of-the-art baselines by a few hours.Further, we feel that the improvements in factuality and in many cases, style and fluency, justify this slightly higher cost of training.Since IRP consists of three separate components and is not trained endto-end, this also indicates that there is still room for improvement on expository text generation.
Further, we find that expository text generation frameworks have a large performance gap between the with doc and without doc datasets ( §5.1).As discussed in the paper, we believe this gap can be overcome by studying and developing models that can retrieve information from the web in real time during inference.For example, instead of using stylistic content plans to search a provided factual corpus, perhaps they could be reworded into search queries to retrieve up-to-date information from Google in real time, thus overcoming any limitations of a provided factual corpus.If future work in this direction results in expository text generation models that can perform live retrieval during inference, they can also be compared and benchmarked with LLMs that are equipped with web search engine plugins.

Ethical Considerations
The primary goal of IRP is to maintain factual consistency in its generated expository documents.However, as with all text generation frameworks, IRP may produce factual errors, as shown in §5.4.Future expository text generation models could improve factuality by performing fact verification, retrieving live information from the web during inference, or incorporating external knowledge sources.
Further, the Paraphraser is the key component in IRP that ensures the generated text is faithful to the factual corpus.However, there is always the possibility of someone creating their own Paraphraser aligned with the goal of producing misinformation or deceptive claims from true information, and plugging this malicious component into IRP.We hope that future research will result in safeguards to detect and combat these potential risks of seq2seq language models.

A.1 Detailed Dataset Collection
To obtain the expository documents D on each dataset, we web scrape the respective websites with BeautifulSoup || .We could not find specific research licenses for the three datasets, but note that they are free to access and publicly available online.Further, we found that each dataset has been analyzed in previous NLP research papers.For a given expository document D and its topic t, we will now explain how we obtain the set of factual sentences W, briefly described in §4.1.First, we break up D into a set of sentences {y i }.For each sentence y i , we obtain the URLs of the top 5 search results using the query "[t] [y i ]".After repeating this for each sentence, we flatten the list of URLs into a unique set, and filter the URLs that contain the ground truth expository document (e.g. for the U.S. News dataset, we filter all URLs which contain the substring "usnews").We then use BeautifulSoup to obtain the text of all of the <p> tags.Using the nltk sentence tokenizer ** , we extract all sentences and flatten them into a unique set.We clean sentences by keeping alpha-numeric symbols and punctuation with regex, as well as applying unidecode † † to ensure sentences contain only ASCII characters.All information is in English, and we studied a sample of sentences to ensure that there was no offensive language in the dataset.We use a regex script to filter personal information, and analyzed a large sample of corpora from each test set to ensure personal information does not exist in the dataset.In Table 4, we display summary statistics of each dataset after this process.

A.2 Detailed Training Setup
The Imitator is trained with GPT-2 Large (774M parameters) through the aitextgen ‡ ‡ Python package.We choose a batch size of 1, a learning rate of 1e-3, and train the model for 3000 steps.All other parameters are set to the default value of the aitextgen implementation.LED uses an input size of 16384 and is trained with a batch size of 8, a learning rate of 5e-5, 8 gradient accumulation steps, 1500 warm-up steps, and trained for 8 epochs.The generator of RAG and the BART baseline are trained with the same parameters as the Paraphraser.The Retriever of RAG selects k = 15 sentences.All unstated parameters are the default values of their respective implementations.We ensure that each model is trained until the validation loss and/or ROUGE-1 score converges.
For the LLMs (LLaMa, LLaMa+Retr, Chat-GPT), we perform 3-shot prompting using 3 manually selected, representative input/output training examples.We assess the outputs of all baselines and perform the same quality control check as IRP to filter semantically repetitive sentences to improve the fluency of the output of the baseline.

A.3 Qualitative Analysis
In Tables 6, 7, and 8, we present examples of expository documents generated by IRP on U.S. News, Medline, and WikiCS, respectively, on both versions of the datasets (with doc and without doc).We also display the topic of the expository document and the true output.In these examples, we can see that IRP produces text with high factual accuracy without sacrificing fluency or the style of the expository document domain.
Further, in Tables 9, 10, and 11, we directly compare the expository document outputs of IRP and the baselines (LED, RAG, LLaMa, LLaMa+Retr) on U.S. News, Medline, and WikiCS, respectively.On U.S. News, we find that the baselines tend to produce factual errors related to many of the key details, such as the institution's founding and tuition.On Medline, we find that the baselines struggle to generate accurate drug classes and explanations for how the medications affect the human body.Some generated documents also contain phrases that are repetitive and difficult to understand.On WikiCS, we find that the baselines are mostly factually accurate, but the documents lack overall structure and coherence.Compared to other models, LLaMa struggles the most with preserving the style of the expository document domain.

A.4 Human Evaluation
We display the set of instructions given to human annotators for evaluating the style and factual accuracy of expository documents in Figure 5.In the following survey, you will read a total of 200 generated college descriptions in the style of U.S. News.Please rate each document on a scale of 1-5 in Style Adherence, and Factual Accuracy.Please use the following guidelines for these two attributes using one sentence from the example of the fictitious "Moon University." Style Adherence: How similar is the generated text compared to the true output?Do they organize the same information in the same order, generally using the same phrasing?We are not concerned whether the factual information is correct, but rather if the factual information is being described/outlined appropriately.

Example True Output:
Moon University is a public institution that was founded in 2022.
Example Ratings: 1 -Moon University has a total enrollment of 50 students.
3 -Moon University was founded in 2022, is a public university, and enrolls 50 students.

-Moon University is a public institution that was founded in 2005.
Factual Accuracy: How accurate is the information conveyed in the document?Are there significant factual inconsistencies or errors?We will provide the ground truth output along with the generated document.Please use Google to verify factual errors if they are not obvious.

Example True Output:
Moon University is a public institution that was founded in 2022.It has a total enrollment of 50.
Example Ratings: 1 -Earth University is a private institution that was founded in 1990.It has 10,000 students.

Diltiazem
Diltiazem is used to treat high blood pressure and control angina pectoris (chest pain).Diltiazem belongs to a class of medications called calcium channel blockers.It relaxes blood vessels so the heart does not have to pump as hard.It also increases blood and oxygen supply to the heart.
Diltiazem is used to treat high blood pressure and angina pectoris (a condition in which the heart is unable to pump enough blood to all parts of the body).Diltiazem belongs to a class of medications called calcium channel blockers.It works by relaxing blood vessels in the body and heart and lowering the heart rate.
Diltiazem is used to treat certain types of heart rhythm disorders such as atrial fibrillation (a condition in which the heart beats irregularly, causing shortness of breath, dizziness and fatigue).Diltiazem belongs to a class of medications called calcium channel blockers.It works by relaxing blood vessels so that blood can flow to the heart.

Midostaurin
Midostaurin is used with other chemotherapy drugs to treat certain types of acute myeloid leukemia (AML; a type of cancer of the white blood cells).Midostaurin is also used for certain types of mastocytosis (a blood disease in which there are too many mast cells [a certain type of white blood cells]).Midostaurin belongs to a class of drugs called kinase inhibitors.It works by blocking the action of the abnormal protein that signals the proliferation of cancer cells, helping to stop the spread of mast and cancer cells.
Midostaurin is used with other chemotherapy drugs to treat certain types of acute myeloid leukemia (AML; a type of cancer of the white blood cells  The term "business process modeling" itself was coined in the 1960s in the field of systems engineering by S. Williams in his 1967 article "Business Process Modelling Improves Administrative Control."His idea was that techniques for obtaining a better understanding of physical control systems could be used in a similar way for business processes.It was not until the 1990s that the term became popular.Daridorexant is used alone or in combination with other medications to treat insomnia (insomnia or disturbed sleep).It is also used to treat restless leg syndrome (RLS; restless legs syndrome; a condition in which the legs do not function properly and the person can experience weakness, numbness, loss of muscle coordination and problems falling asleep) in adults and children 6 years of age and older who have not been successfully treated with benzodiazepines (narcotic) or sedatives (sudden, unexpected bouts of insomnia that can occur when the person is not treated with these medications).

BART
Daridorexant is used alone or in combination with other medications to treat high blood pressure.It is also used to treat agina pectoris (chest pain that occurs when the heart does not pump enough blood to all parts of the body) in adults and children 2 years of age and older.Daridorezant belongs to a class of medications called angiotensin II receptor antagonists.It blocks the action of certain natural substances that tighten blood vessels, allowing blood to flow more smoothly through the body.
Daridorexant is used alone or in combination with other medications to treat high blood pressure.It is also used to treat agina pectoris (chest pain that occurs when the heart does not pump enough blood to all parts of the body) in adults and children 2 years of age and older.Daridorezant belongs to a class of medications called angiotensin II receptor antagonists.It blocks the action of certain natural substances that tighten blood vessels, allowing blood to flow more smoothly through the body.His idea was that techniques for obtaining a better understanding of physical control systems could be used in a similar way for business processes.It was not until the 1990s that the term became popular.

LED
The term "business process modeling" was coined by the systems engineer S. Williams in 1967.
Fractal patterns have been found in the paintings of American artist Jackson Pollock.Pollock's paintings were composed of chaotic dripping and splattering, and Pollock's paintings were composed of chaotic dripping and splattering.

RAG
A business process model is a graphical representation of a business process or workflow and its related sub processes.A process model can grow out of procedural maps of day to day operations.
Business Process Management (BPM) was originally developed in the 1980s by a group of researchers at the University of Illinois at Urbana-Champaign under the name of Business Process Management.Their goal was to "build a business process model that would allow companies to identify and manage their business processes in a way that would reduce costs and increase productivity.

BART
The concept of business process management dates back to the 1960s, with the introduction of the concept of "business process management" in the United States.Business process management was originally developed in the 1970s by the University of California, Berkeley, and the National Institute of Standards and Technology (NIST).
The concept of business process management dates back to the 1960s, with the introduction of the concept of "business process management" in the United States.Business process management was originally developed in the 1970s by the University of California, Berkeley, and the National Institute of Standards and Technology (NIST).

Figure 2 :
Figure 2: Overview of the IRP framework.The Imitator (GPT-2) first produces a stylistic content plan that outlines the information to be discussed in the next sentence.The Retriever (DistilBERT) uses this content plan to find relevant factual sentences in the input corpus.The Paraphraser (BART) rewords the factual sentences in the style of the stylistic content plan.This sentence is the next input for the Imitator and the process is repeated.

Figure 4 :
Figure 4: Distribution of IRP factual error types.
University is a public institution founded in 2000.It has a total enrollment of 75. 5 -Moon University, founded in 2022, is a public institution.A total of 50 students are enrolled.

Figure 5 :
Figure 5: Evaluation instructions for expository document generation on U.S. News.

Table 2 :
Human evaluation of style and factuality of expository documents on a 5-Likert scale.Avg is the average of style and factuality scores.ChatGPT is prompted with three representative topic/output examples.

Table 3 :
(Kokhlikyan et al., 2020)ue was born around 1972 by Alain Colmerauer and Philippe Roussel hydro ##xy ##zine belongs to a class of medications called anti ##his ##tam ##ines s Visualized token attribution scores for the classification task performed by the Retriever on a sample of sentences from each test set.Darker shades of blue indicate higher token attribution scores.Scores are calculated with transformers-interpret ¶ , a Python library that leverages Captum(Kokhlikyan et al., 2020).R1 and factuality of IRP versus Generating text All at once instead of iteratively, and using a Topic Query over stylistic content plans on WikiCS with doc.
each baseline oBalachandran et al., 2022)milar to previous works (Huaand Wang, 2019;Balachandran et al., 2022), we ask annotators to evaluate on Style adherence to the true output (i.e.organization and phrasing), and Factuality on a 5-Likert scale.thename prologue was born around 1972 by alain col ##mer ##auer and philippe ro ##uss ##el l Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI'18/IAAI'18/EAAI'18. AAAI Press.

Table 4 :
Summary statistics of U.S. News, Medline, and WikiCS datasets for expository document generation.

Table 5 :
Comparison of traditional metrics (ROUGE-1, ROUGE-2, BLEU, METEOR) and factuality metrics (Hallucinations, FactCC, Entailment, Contradictions) for IRP ablations.Gen All generates the text at once rather than sentence-by-sentence. Topic Query uses the topic as the search query in the factual corpus instead of the stylistic content plans used by IRP.Best results are in bold.

Table 6 :
Sample expository documents generated by IRP on U.S. News.

Table 7 :
Sample expository documents generated by IRP on Medline.

Table 8 :
Sample expository documents generated by IRP on WikiCS.

Table 9 :
Baseline comparison of generated expository documents on U.S. News for Southeastern University.Red text indicates factual errors.Daridorexant is used to treat insomnia (insomnia; a sleep disorder in which sleep is disturbed and difficult to control) in adults and children 12 years of age and older.Daridorexant belongs to a class of drugs called orexin receptor antagonists.It works by slowing activity in the brain to enable sleep.RAG Daridorexant is used alone or in combination with other medications to treat insomnia (insomnia or disturbed sleep) in adults and children 6 years of age and older.It works by slowing activity in the brain to enable sleep.

Table 10 :
Baseline comparison of generated expository documents on Medline for Daridorexant.Red text indicates factual errors.Blue text indicates incoherent, repetitive, or difficult to understand phrasing.True Output Business process modeling techniques such as flowchart, functional flowblock chart, control flowchart, Gantt chart, PERT chart, and IDEF have been around since the beginning of the 20th century.Gantt charts were among the first to appear around 1899, flowchart in the 1920s, functional flowblock chart, and PERT in the 1950s, data flowchart, and IDEF in the 1970s.Modern methods include Unified Modeling Language and Business Process Model and Notation, but they are only a fraction of the methods used over the years to document business processes.S. Williams coined the term "business process modeling" in the 1960s in his 1967 article "Business Process Modelling Improves Administrative Control."Business process modeling techniques such as flowchart, functional flowblock chart, control flowchart, Gantt chart, PERT chart, and IDEF have been around since the beginning of the 20th century.Gantt charts were among the first to appear around 1899, flowchart in the 1920s, functional flowblock chart, and PERT in the 1950s, data flowchart, and IDEF in the 1970s.Modern methods include Unified Modeling Language and Business Process Model and Notation, but they are only a fraction of the methods used over the years to document business processes.S. Williams coined the term "business process modeling" in the 1960s in his 1967 article "Business Process Modelling Improves Administrative Control."IRP (Ours) S. Williams coined the term "business process modeling" in the 1960s in his 1967 article "Business Process Modelling Improves Administrative Control.".These process models help organizations document workflows, surface key metrics, pinpoint potential problems and intelligently automate processes.Business process modeling became the foundation of new methods, for example those that supported data collection, data flow analysis, process flow diagrams and reporting facilities.Business process modeling techniques have been developed over the decades to support specific business needs.The term "business process modeling" itself was coined in the 1960s in the field of systems engineering by S. Williams in his 1967 article "Business Process Modelling Improves Administrative Control.".

Table 11 :
Baseline comparison of generated expository documents on WikiCS for Business Process Modeling.Red text indicates factual errors.While LED is factually correct, it does not provide enough detail on the with doc dataset and fails to discuss the topic of Business Process Modeling adequately on the without doc dataset.