Sequentially Controlled Text Generation

While GPT-2 generates sentences that are remarkably human-like, longer documents can ramble and do not follow human-like writing structure. We study the problem of imposing structure on long-range text. We propose a novel controlled text generation task, sequentially controlled text generation, and identify a dataset, NewsDiscourse as a starting point for this task. We develop a sequential controlled text generation pipeline with generation and editing. We test different degrees of structural awareness and show that, in general, more structural awareness results in higher control-accuracy, grammaticality, coherency and topicality, approaching human-level writing performance.


Introduction
Imagine that you are tasked with: Write a "Related Works" section.Would it help to know the past structure of the article (e.g. it is coming after the "Discussion" section)?How about the full structure of the article (e.g. after the "Introduction" but before the "Problem Statement")?
On the other hand, although numerous works have focused on content planning using keywords (Yao et al., 2019), plot-design (Rashkin et al., 2020) and entity tracking (Peng et al., 2021), macrostructural control has been relatively understudied.
Figure 1: We study the task of sequentially-controlled generation: generating documents exhibiting structure given by a sequence of local control codes.Shown is a news article with it's Van Dijk structure (Van Dijk, 2013) and headline.Our models take as input the headline and discourse tags and generate a sequence of sentences.We explore the degree of structural awareness (local, pastaware or full-sequence) for controlling each sentence in the document, with the goal of generating the most structurally faithful, coherent and topical text.
So, in this work, we study (1) how to impose macro-structural control on narrative text generation and (2) how much structural awareness during generation contributes to well-structured and fluent text.We propose a novel task, sequentially controlled text generation.In this task, the user provides a prompt as well as a sequence of local control codes, each of which guides the generation of a single sentence.(In our experiments, we use headlines as prompts and Van Dijk (2013) discourse tags as control codes (Figure 1).) We develop methods to address this task, expanding prior work focused on single control code generation (Keskar et al., 2019;Dathathri et al., 2019;Yang and Klein, 2021).As in prior work, the controlled generation problem is decomposed into a discriminator and a generator.However, in this work, the discriminator learns to incorporate an entire sequence of control codes.We hypothesize that information about structural intention can positively impact generative output (intuition for this is given in the hypothetical at beginning of this introduction).We show that our methods improve structural cohesion and certain aspects of  coherence over naive GPT-2 output.
Next, we hypothesize that more structural awareness improves generation.Again, we refer to the introduction hypothetical: humans craft text according to how it fits into a document's full structure (Chenlo et al., 2014), so a generative model should similarly benefit from having such information.We test this hypothesis by varying the discriminator's conditional independence assumptions.We experiment with three different degrees of control: local-only (where the discriminator is only aware of the current sentences' control code), past-aware (where the discriminator is aware of the current sentences' control code and all previous control codes), and full-sequence (where the discriminator is aware of the entire document's sequence of control codes).We show that more structural awareness, especially of past structure, helps generate the highest-quality text.Finally, we show how to re-introduce a degree of local control by combining structurally-aware generation methods with a local sentence-level editing technique.
In summary, our novel contributions are: • We propose a novel task, sequentially controlled text generation and identify a discourse schema (Van Dijk, 2013) and dataset (Choubey et al., 2020) to explore this task (Sections 2, 4).
• We combine two different approaches in controlled text generation: generation and editing, and show that the highest-quality text is generated when both of these approaches are used (Section 3).
• We use our methods to study the degree of structural control that yields the highestquality text: local, past-aware and fullsequence control.We show that overall, full-sequence produces optimal text over an array of metrics (Section 7).
We see this work opening the door to a variety of follow-on directions: giving users control over the macro-structure of their generated output can allow users to quickly prototype different structures for their work.It can allow them to work in tandem with a generative algorithm to infill missing structural components in a piece of writing1 .It might even allow them produce different versions of the same story for readers with different reading preferences.Finally, we also see macrostructural control providing a natural complement to, and being used in tandem with, other forms of controlled generation, like fact-aware generation (Logan IV et al., 2019) or creative generation (Goldfarb-Tarrant et al., 2020;Tian and Peng, 2022;Peng, 2022) to yield more engaging and useful generative content.

Problem Statement
We assume, as input, a headline sentence, X 0 , and a sequence of control codes ⃗ c = c 1 , ..., c S of length S (i.e., one for each sentence we wish to generate in the document.Adjacent codes can be of the same type.)We wish to produce, as output, a document X of length S as a sequence of sentences X = X 1 , ..., X S , each composed of a sequence of words We define the sequentially controlled text generation objective as: where x i is a word in sentence k, x <i are the preceding words, X <k are the preceding sentences (including the headline, X 0 ).c k is the control code for k.We assume that ⃗ c, the entire sequence of control-codes for a document, is given.
We use Bayes rule to factorize t 1 into: t 2 is calculated using a standard pretrained language model (PTLM) and t 3 is calculated by a trained discriminator.This allows us to maximally re-use naively trained language models and, as we show, is far more resource efficient than fine-tuning a prompt-based model.
Three approximations for t 3 are: In the local-only model, we assume each control code c k is conditionally independent of other control codes given x i .Thus, our generator model t 1 is made aware only of local structure: the control code c k pertaining to the current sentence, k.Because of this conditional independence assumption, local-only control is similar to prior work that used only single-control codes, where the goal was to generate a single sentence p(x|c) = n i=1 p(x i |c) (Keskar et al., 2019).However, we show that we can remove these independence assumptions and study more complicated structural control which, as we show later, produces more coherent output.

Past-Aware
In the past-aware model, we assume autoregressive dependence between control codes, conditioned on x.Control codes for future sentences, c >k , are conditionally independent.In Equation 1, this results in x i being dependent on c k and the sequence of control codes, c <k .

Full-Sequence
In the full-sequence model, we make no conditional independence assumptions.
We can restrict both the past-aware and the full-sequence approximations to a sliding window around sentence s 2 .We can also add a prior on p(⃗ c) to induce a discount factor 3 .This focuses the generator on control code c k and down-weights surrounding control codes.
In the next section, we show how to model these objectives.We first describe the discriminator we use as our control-code model, the controlled generation techniques and the editing techniques we adapt.

Methodology
As described in Section 2, we can efficiently do generation by combining a naively-trained language model with a discriminator.Hence, the discriminator is the main architectural component that allows us to incorporate inter-dependencies between control code sequences.We start by describing how our discriminator models different degrees of structural awareness (Equations 3, 4 and 5) in Section 3.1.
We design a generation pipeline to balance structural and local awareness.The flow we use to accomplish this is depicted in Figure 3.The first step is Generation.Here, we sample each word, x i using techniques described in Section 3.2 which allow us to leverage our discriminator to impose structural control.When we have completed a sentence, we move to Editing.Here, we edit the sentence to further impose local control on each sentence, updating x to optimize a variation of Equation 1: p(x i |x −i , c k ), discussed in Section 3.3.

Discriminator
The discriminator we construct takes as input a sequence of sentences (X) and a sequence of local control tags (⃗ c).Our architecture combines a sentence-classification model, similar to that used in (Spangher et al., 2021a), with a separate label embedding architecture to incorporate knowledge of c <j .Hence, we can make predictions for c j based not only on x, but prior tags, c <j , allowing us to model structural dependencies (Equation 2).For a full description of architecture, see Appendix A.
2 i.e. t3 ranges only from j = k − w...k + w instead of the full sequence of sentences.In practice, we use w = 3.
Figure 3: Generation process.First, we perturb the output of a language model using a structurally-aware classifier to approximate p(x i |x <i , X <k )p(⃗ c|x <i , X <k ) and generate word x i by sampling from the perturbed distribution .When we generate an < eos > token, we edit the sentence.We use a discriminator to identify class-salient words to mask, generating masked sentence M , and infill to boost class likelihood.
We train it to model local-only, past-aware and full-sequence control variants expressed in Section 2: we train separate prediction heads to make predictions on c k−w , ...c k , ...c k+w , i.e. labels from −w, ..., +w steps away from current sentence k4 .For local-only control (Equation 3) we only use predicted probabilities from the main head, k.In past-aware control (Equation 4), we multiply predicted probabilities from heads prior to the current sentence < k, and for full-sequence control, we multiply predicted probabilities from all heads. 5e now describe how we use these predictions.

Generation
We combine our discriminator's predictions with a naive PTLM to solve Equation 2 in two different ways: Hidden-State Control, based on (Dathathri et al., 2019) and Direct Probability, based on (Yang and Klein, 2021).
Hidden-State Control (HSC): Wolf et al. (2019)'s GPT-2 implementation caches hidden states H to produce logits approximating p(x i |x <i ).We perturb these hidden states H, resulting in Ĥ that produce logits approximating Equation 1 instead.We generate H from a naive PTLM and use this to make a prediction ĉ using our discriminator.We then calculate the loss L(ĉ, c) and backpropagate to H to derive Ĥ.
Direct we calculate p(c s |x i,j , x <i , X <s , c −s ) for each x i,j using our discriminator.We directly multiply these probabilities to calculate Equation 16 .
Note that the HSC and DPC algorithms are extensions of previous work: the difference is that here they are used to model control code sequences rather than single tags.The key components that allow this is our discriminator, which makes predictions based on label sequences, and our algorithm which, as shown in Figure 3, increments codes each time an <eos> token is generated.

Editing
After we have finished generating a sentence, we edit it to introduce more discourse markers of the local control code.
We identify words in our input sequence that have the most impact on control-code prediction by using the gradient on our input sentence of the discriminator's loss onto tokens and masking full words, following Ross et al. (2021).We use only the current sentence prediction made by our discriminator (i.e.Equation 3), so that we impose local control on the sequence even in settings where the generator imposes structural control.We cull the high-gradient words based on heuristics 7 to encourage the editor to introduce explicit discourse markers.We fine-tune a label-aware infilling model (Raffel et al., 2019) to generate candidate edits 8 given the masked input.We mask and infill until we have generated a sentence that has an increased likelihood p(c k | xk ) > p(c k |x k ), and generate edit candidates (n = 10).We select edits on the basis of class likelihood and perplexity 9 .
For more comparison and distinction from previous work for both Generation and Editing, see Appendix D.1, E.

Datasets and Schema
The form of sequential control we study is discourse: i.e. the functional role sentences play in a document's larger argumentative purpose.We use a news discourse schema proposed by Van Dijk (2013).Choubey et al. (2020) apply this schema and annotate a dataset, NewsDiscourse, consisting of 802 articles from 3 outlets 10 , tagged on the sentence level.Their schema consists of 9 classes: { Main Event, Consequence, Current Context, Pre-7 Words that are not proper nouns, named entities (except the DATE class) or adjectives, as we find these categories are more likely to be topic words spuriously correlated with control-codes.
8 A T5 model trained using a specific input template incorporating the label.E.g. label: Background.text: The senator <MASK> to the courtroom to <MASK>.
9 Perplexity of the entire generated document so far is used as a selection criteria, P P L(x k ⊕ X <k ), to encourage edits preserving the logical flow of the document. 10nytimes.com,reuters.comand xinhuanet.comvious Event, Historical Event, Anecdotal Event, Evaluation, Expectation }.11 .Although each sentence is tagged with a code, codes often repeat.For example, an entire paragraph can be tagged with Main Event sentences.We show a partial sample in Figure 1.We adopt this schema to describe each news article's structure.
We also use a dataset of unlabeled news articles12 to fine-tune a GPT-2 model for news.We sample 30,000 documents from this dataset in a manner so that the distribution of sentence-lengths matches the distribution of sentence lengths in the Choubey et al. (2020) dataset.

Implementation Details
We fine-tune a GPT2-base model on a large news corpus with a max word-piece length=204813 .We use this to generate naive PTLM languagemodeling as well as sentence-embeddings in our Discrimination model.Further implementation details are discussed in Appendix A.
We discuss the discriminator results here briefly.As shown in Figure 4, the primary head, p, has a Micro F1-score of .65,which approaches stateof-the-art on this dataset 14 .However, performance degrades rapidly for heads farther from p.For more results on discriminator performance, including experimental variations, see Appendix A.1.

Experiments
We sample 10 documents from the test set of our discourse dataset (n = 200) to test different pipeline settings.The input to our models is a headline (as a prompt) and the full sequence of gold-truth discourse labels of that document.
Baselines We compare our experimental pipelines (Section 3) with the following baselines: (1) Naive GPT-2 generation given only the headline as input (i.e.no control codes), (2) a fine-tuned Prompting approach and (3) the original Human-written articles.
For (2), we directly train a class-conditional language model to generate text by including labels in the prompt, as in (Keskar et al., 2019).Local-only prompting is achieved by only including the local control code (and prior generated sentences) in the prompt, and updating the prompt to generate a new sentence.For past-aware prompting, we include all control codes prior to our current sentence in the prompt, and update on every new sentence.Finally, for full-sequence prompting, we including the full sequence of control codes in the prompt.(See Appendix C for more details and examples of prompt design.) For each of these baselines, we test with and without editing (with the human-written text being edited by our algorithm in Human and with the generated text in all other trials being edited).
Evaluation For all pipelines, we select the best hyperparameter configurations based on perplexity and model-assigned class likelihood.Then, we manually annotate each generated document for 4 metrics: Accuracy (0-1) 15 Grammar (1-5) 16 , Logical Flow (1-5) 17 and Topicality (1-5) 18 .We recruit two expert annotators with journalism experience to perform annotations blindly without awareness to which generation pipeline was used, and find moderate agreement κ ∈ [.36, .55]across all categories.For more details, see Appendix G.We record model-dependent and non-model automatic metrics used by See et al. (2019), described further in Appendix B.

Results
Best Overall Trial We show automatic and human metrics for the subset of pipelines with topperforming hyperparameters in Table 2.In general, the highest-performing generation pipelines are all variations of DPC with either past-aware, or fullsequence structural control.
We observe that DPC with past-aware control and editing has the highest class-label accuracy, nearly approaching the human trials.The top performing pipelines for grammar and topicality are DPC with full-Sequence control and without editing.GPT-2 performed best only for Logical Flow, which was surprising but could perhaps be because the unconstrained nature of GPT-2's generation allowed it to hallucinate a flow that seemed consistent even if it was poorly structured.

Effect of Different Pipeline Components
We show the distributional shifts in performance across all trials, in Figures 5, 6 19 .Structural control has a largely positive effect on generated text.In Figure 5, we find that Full-Sequence models are, on average, able to generate the most label-accurate sentences with the best grammar, logical flow and topicality.Finally, editing improves accuracy, grammar and logical flow (Figure 6.) The original human-generated text is our goldstandard, and it is highly class-accurate, grammatical, coherent and topical.Interestingly, as seen in Table 2, editing can also be applied to humanwritten text to boost label accuracy, but at the expense of coherence.

Discussion
We set out to answer two questions in this research: (1) whether we could impose structural control over generated documents and ( 2) what kinds of structural control (local-only, past-aware, or fullsequence) had the greatest effect on discourse, flow, topicality and grammaticality.Our novel pipelines, which extend various discriminatorbased approaches for generation and editing, approach human-level performance.However, a gap between our model's output and human-generated  text still remains across all metrics, suggesting the need for more research.
Insight #1: Some structural information improves all metrics of quality.Our structural exploration suggests that, for the best-performing pipelines, past structural information (along with editing) boosts class accuracy the most, but knowledge of the full-sequence does not.In the analogy given in the Introduction, this equates to: to write a "Related Works" section, it helps to know that it comes after the "Introduction" vs. the "Discussion", but not information of what sections come after.This is perhaps because enough signal is already given by the past sequence and the full sequence just adds more noise.However, fullsequence information does yield the best grammar and topicality.This might indicate a regularizing role played by the full-sequence.In general, we suspect that past-aware modeling and editing both push the model more towards the class label at the expense of topicality, flow and grammar, while full-sequence does the opposite.In practice, some combination of these pipeline components might be desired.and Klein, 2021).The weakness of our discriminator is one reason why HSC may have performed poorly.However, in other trials we see strong accuracy.Thus, even with a weak classifier, we can control generation.This might be because even a weak discriminator can still give relative differ-ences between generation that does or does match the control code.
Insight #3: Evaluating text candidates using multiple model's perplexity might result in better selections.Just as surprisingly, editing also has an overall average positive effect on generation accuracy and generation quality (Figure 6).We had hypothesized that, because the editor makes locally-aware infilling decisions, it would improve class-accuracy but hurt other metrics of document quality, like topicality and flow.Indeed, for the topperforming trials, like DPC and Human, Editing only improves class accuracy.However, grammar and flow improves in other trials.This could be because, as mentioned in Section 3.3, we selected candidates based on how well they make sense in the document.This also suggests that using multiple PTLMs to select for better quality combines different virtues of each model.

Error Analysis:
We observed that sentence tokenizing remained a huge challenge.Many of the grammar errors that our annotators observed were from sentences that ended early, i.e. after decimal points.Indeed, the correlation between sentencelength and grammar is relatively high (r = .34).
One reason for this could be that error-prone sentence tokenizing models provided faulty training data during pretrainining of LMs.This will continue to hinder document-level structural work, which often relies on a model accurately ending a sentence.Another observation, in Table 2, is that perplexity doesn't necessarily correlate with human judgements of quality, especially for more complex writing like Financial news reporting.

Related Work
Discourse-Aware Narrative Text Generation.
Generating narrative text, such as news articles and scientific reports, has been a long standing problem in NLP.Early work relies on template (Xu et al., 2018;Wiseman et al., 2018), rules (Ahn et al., 2016;Leppänen and Toivonen, 2021), or specialized architectures (Fan et al., 2018;Bosselut et al., 2018) that are hard to generalize.Recently, pre-trained Transformers have shown impressive capabilities to produce fluent text, and there are few works that seek to adapt them to document-level generation with appropriate discourse structures.One work, DiscoDVT (Ji and Huang, 2021), uses a discrete variational auto-encoder (VAE) with a latent space guided by explicit Penn Discourse Treebank (PDTB) relations (Prasad et al., 2008).We are excited by this work, which shows strong improvements in coherence.While our work is able to learn from more abstract structural tags rather than low-level PDTB relations, our approach is fully supervised.We are excited by the semi-supervised nature of Ji and Huang (2021)'s approach, which may allow it to learn discourse structures that are less well annotated.Also, possible extensions of this approach using hierarchical discrete VAEs (Razavi et al., 2019) or diffusion models (Li et al., 2022) might provide users' control over higher-order macro-structures in text, such as those explored in (Spangher et al., 2021b).

Controlled Generation
The black-box nature of neural generation models poses challenges for many real-world applications (Wiseman et al., 2017;Holtzman et al., 2019).Researchers have designed various techniques to control the syntactic structure (Goyal and Durrett, 2020), sentiment (Hu et al., 2017;Fu et al., 2018;Luo et al., 2019), and language style (Niu and Bansal, 2018;Cao and Wang, 2021).Most notably, the CTRL model (Keskar et al., 2019) conditions the output by incorporating textual control codes during the pretraining stage.However, such training is resourceintensive and requires large datasets.Alternatively, PPLM (Dathathri et al., 2019), FUDGE (Yang and Klein, 2021), GeDI (Krause et al., 2021), and NADO (Meng et al., 2022) achieve inference-time control through either directly manipulating the generator's hidden states, or adjusting the probabilistic distribution over the output vocabulary.Our work differs from prior work in that we tackle structured control instead of a single attribute.Our task, though, can be relatively easily addressed by extensions to these frameworks, and we look forward to future work that might improve further on the results we showed.
Sequentially Controlled Generation Sequential control for text generation has been explored from many angles, from symbolic planning approaches (Meehan, 1976;Lebowitz, 1987), to keywordbased approaches (Yao et al., 2019) and concept, event and entity driven planning approaches (Rashkin et al., 2020;Peng et al., 2021;Alabdulkarim et al., 2021;Han et al., 2022).We are the first, to our knowledge, to utilize a purely latent control structure based off of discourse structures.There is increasing interest in exploring how discourse can be used to guide generation (Ghazvininejad et al., 2021;Cohan et al., 2018), from early works developing discourse schemas for generation (Mann, 1984;Stede and Umbach, 1998) to evaluating creative generation pipelines (Hua and Wang, 2020).However, neither direction allows discourse structures to be explicitly controlled in generation.
Editing.Most existing neural models generate text in one-shot, from left to right.Recently, an emerging line of research (Guu et al., 2018;Malmi et al., 2019;Kasner and Dušek, 2020) has explored editing as part of the generation pipeline to further improve the output quality, or satisfy certain desired constraints.Our work builds off of the MiCE framework (Ross et al., 2021), which was originally designed for generating contrastive explanations.Our observation that editing increased both the discourse adherence and the coherence of the generated text adds to a growing body of evidence that editing can play an important and modularized role in larger creative generation pipelines.We see editors such as simplification editors (Dong et al., 2019), factual correction editors (Cao et al., 2020) and stylistic editors (Kabbara and Cheung, 2021) each aimed at different desired attributes of text can possibly play a role in multi-attribute control (Li et al., 2022).
Finally, we see overlaps as well to an earlier paradigm of generative modeling: Bayesian models for text like Latent Dirichlet Allocation (LDA) (Blei et al., 2003) and, more interestingly, sequential variants (Du et al., 2012).There is recent work marrying PPLM-style controlled text generation with topic modeling (Carbone and Sarti, 2020).Such directions might lead to more hierarchical, structural control.

Conclusion
We have formalized a novel direction in controlled text generation: sequentially controlled text generation.We extended different techniques in controlled text generation to fit this direction, and have shown how a news discourse dataset can be used to produce news articles exhibiting human-like structure.We have explored what degrees of structural awareness yield the most human-like output: more structural control yields higher-quality output.And, we shown how to combine structural control with local editing.We have probed different parts of our pipeline to show the effects of each part.

Acknowledgements
Alexander Spangher acknowledges support from Bloomberg's Data Science Ph.D. Fellowship.This work was conducted while Alexander was as intern at Bloomberg.We also acknowledge Jonathan May for helpful early conversations about this work.We acknowledge Vivienne Litzke for annotation effort and further conversations.
12 Ethics Statement

Limitations
A central limitation to our work is that the datasets we used to train our models are all in English.As mentioned previously, we used Choubey et al. ( 2020)'s NewsDiscourse dataset, which consists of the sources: nytimes.com,reuters.comand xinhuanet.com.Although xinhuanet.com is a Chinese source, they used English-language articles.Additionally, we used an unlabeled news dataset from Kaggle20 for fine-tuning GPT2-base and for calculating some automatic metrics like % Unseen Words.We filtered this dataset down to two English-languge, Western domains: nytimes.comand reuters.com in order to match the domains are closely as possible to the NewsDiscourse dataset.
Thus, we must view our work in discourse generation with the important caveat that non-Western news outlets may not follow the same discourse structures in writing their news articles.We are not aware of existing Van Dijk-style (Van Dijk, 2013) datasets towards which we could provide an exact comparison.But, we hope in future work to look at other kinds of discourse structures that might exist in other languages.

Risks
There is a risk that the work will be used for misinformation or disinformation.This risk is acute in the news domain, where fake news outlets peddle false stories that attempt to look true (Boyd et al.;Spangher et al., 2020).Along this vein, there is the aforementioned work using discourse-structure to identify misinformation (Abbas, 2020;Zhou et al., 2020), and the risk in developing better discourseaware generation tools is that these misinformation detectors might lose their effectiveness.
There is also a non-malicious misinformation risk, as large language models have been known to generate hallucinated information (Choubey et al., 2021).The more such threads of research are pursued without an accompanying focus on factuality and truth, the more risk we run of polluting the information ecosystem.However, like others (Dathathri et al., 2019), we see a value in continuing this direction of research, even if this current work is not the final output we wish to see being used by non-researchers in the world.It is one step along the way.
There is also a risk that news articles in either of our datasets contain potentially libelious or defamatory information that had been removed from the publishers' website after the dataset was collected.However, we do not release either of the datasets we use, so we do not see our actions as privacyviolating.
We were unable to ascertain the license for the Kaggle dataset.It has been widely used in the academic literature, including in papers published in ACL venues (Pathak and Srihari, 2019) and others (Alhuqail, 2021).We corresponded with the authors and opened a discussion question [URL withheld to preserve anonymity] seeking more information about the license.The authors are public about their desire to have their dataset used21 and we have had independent lawyers at a major media company ascertain that this dataset was low risk for copyright infringement.

Computational Resources
The experiments in our paper required computational resources.We used 8 30GB NVIDIA GPUs, AWS storage and CPU capabilities.We designed all our models to run on 1 GPU, so they did not need to utilize model or data-parallelism.However, we still need to recognize that not all researchers have access to this type of equipment.We used Huggingface GPT2-base models for our predictive tasks, and will release the code of all the custom architectures that we constructed.Our models do not exceed 300 million parameters.

Annotators
We recruited annotators from professional networks.Both consented to annotate as part of the experiment in exchange for acknowledgement.One is a graduate student studying in Europe, and the other is a former journalist.One annotator is female, and the other is male.One is half-Asian and half-white identifying, the other is white.Both identify as cis-gender.This work passed IRB.We see in Figure 8 that discount factor b has a non-linear effect on the output.In accordance with our prior results, b = 0 is the lowest-performing variant across all four human-quality metrics.b = .33seems to be the most effective discount factor overall, and yields the best output for accuracy and logical flow, while b = 1 yield the best-performing output for grammar and topicality.We conclude that a finer-grained balance of local control and structural control might be important overall, but in some cases more structural control might help as noted previously.

A.2.2 Hidden-State Control (HS)
In Dathathri et al. (2019), authors find anywhere between 3 and 10 backpropagation steps is acceptable.In this work, we use 10 steps with a small step size.We also test different regularizations, also explored in (Dathathri et al., 2019), on the output logits generated from Ĥ.We experiment with different hyperparameters for one of the regularizations: l = γ l + (1 − γ)l 0 where l 0 is the naive, unperturbed logits.We experiment with different values of γ from 0 (fully unperturbed) to 1 (fully perturbed).

A.2.3 Direct-Probability Control (DPC)
Authors in (Yang and Klein, 2021) offer an innovation by training their classifier p(c|x) to consider subsequences p(c|x 1 , ...x i ) for all i, ostensibly improving the accuracy of their joint probability calculation while midsequence.This is in contrast to Dathathri et al. ( 2019)'s training regimine, which only considers full sequences p(c|x 1 , ...x n ).However, Yang and Klein (2021) do not provide ablations to show whether it is this training regimine, or their direct calculation of p(x)p(c|x), which is responsible for the improvements they observe.In this work, we perform this ablation and find that it has negligible difference, according to automatic evaluation metrics.We also introduce a mean fusion (Stahlberg et al.) into the p(x)p(c|x) joint likelihood: γp(c|x) + (1 − γ)p(x) and test different values of γ.

B Automatic Metrics List
Here, we discuss the automated metrics reported in Table 2.They are largely based off metrics proposed in See et al. (2019).

B.1 Metrics Reported in Paper
Label Probability : We measure the label probability assigned to the gold-truth class label given in our input sequence: p(c|c <s , x i , x <i , X <s ).We use head p, or the current head, in the discriminator shown in Figure 7.
Perplexity : Perplexity is calculated using the fine-tuned GPT-2 model, which we fine-tuned on 30, 000 news articles.
Diverse N-grams : We measure the likelihood that an n-gram in one sentence will be unique compared with the entire document.In other words: Diverse N-Grams(s, d) = # unique n-grams in sentence s # n-grams in document d We calculate the set of n-grams per document as the total number of 1,2,3-grams in that document.We calculate one measurement per sentence in the document, and average these scores together.
Sentence Length : We measure the total number of words in the sentence, based on word-level tokenization using https://spacy.io/.
Unseen Words : We use an external corpus of 30, 000 news articles to determine a typical, large news vocabulary.Any words that are outside of this vocabulary are considered "Unseen Words".For our purposes, we are most interested in exploring malformed words, which are sometimes generated by the language model.However, unseen words might also be proper nouns.

Further Details
As a baseline, we train a language model to directly calculate p(x i |x <i , X <s , ⃗ c), following (Keskar et al., 2019) The prompts are specific to current sentence being generated.We first start by generating sentence 1, whereby the prompt for Baseline and Past-Aware is both: Headline: <Headline> Labels: <Label 1> Sentences: Then, we let the model generate the first sentence and stop when we generate the < EOS > character.We then regenerate the prompt to include the previously generated sentence and update the tags, so Baseline becomes: Headline: <Headline> Labels: <Label 2> Sentences: <Sentence 1> and Past-Aware becomes: Headline: <Headline> Labels: <Label 1> <Label 2> Sentences: <Sentence 1> We continue in this fashion, resetting the prompt each time, until we have finished generating sentences for all the tags in our input data.
The Full-Sequence process is very similar, except we do not need to update the label-space, since by default the model is exposed to the full sequence of tags before generation.

D Editing
In this section, we describe the various components of the editing model.First, we note the differences in our approach and Ross et al. (2021)'s method.Then, we discuss the infilling model and the discriminator.et al. (2021) designed their editor to flip classifier predictions.So, they edited input x → x until c p(⃗ c|x) ̸ = c p(⃗ c|x).Then, ∆(x, x) was given as the explanation for the flip.We are not concerned with flipping predictions so much as maximizing the probability of the ground truth label.So, we design our objective to be x → x until p(c|x) > p(c|x).

Ross
To understand why the loss-gradient on the input can provide feature importance, consider the firstorder Taylor approximation of the loss, l(x) ≈ l(a) + l ′ (a)(x − a).Here, the gradient of the loss at a, l ′ (a), can be seen as a set of linear weights similar to logistic regression coefficients, which are commonly used for feature importance.
We also wished to restrict editing to explicit discourse markers, spuriously correlated words, so we heuristically excluded all Proper Nouns, Named Entities (except DATE) or adjectives from the edit candidate set.Table 6 shows explicit discourse markers in the news discourse context.Here, we show the top words associated with each discourse class 24 .Some words effect the tense of the sentence25 , others inject epistemological uncertainty26 , still others time-peg events to certain days27 .

D.2 Infilling Model
We train a label-aware infilling model in a similar method as Ross et al. (2021).Our prompt is: label: <label> text: Lorem Ipsum <mask> Lorem <mask> Ipsum.
Where the masks replace high-salience words, which we discovered as described above.We format samples using sentences in our training dataset, and train a T5 model as described by the authors.

D.3 Possible Improvements
We note that this infilling method directly models p(x|M (x), c), i.e., the likelihood of infilled words given a label and a masked sentence.Another possible approach to this problem would be to use a naive infiller and Bayes rule as done in the generation phase of this paper to generate logits p(x|M (x))p(c|x, M (x)).This could possibly improve the editor for the same reasons Dathathri The deal comes as insurers and drugmakers struggle with competition from Medicare prescription drugs.
The deal could stall as insurers and drugmakers struggle with competition for Medicare prescription drugs.
Table 4:  Another aspect of the editor that we noticed was that it could sometimes degrade the coherency and topicality of the document.This is especially evident in the Human trials.We partially addressed this by selecting candidate edits based off the perplexity of the whole document.We could have mitigated this further by giving our infiller the entire document as context 28 .

E Further Methods Comparison
The standard controlled text generation setup is typically expressed as follows: where x is the output sequence and c is a single control code (for example: sentiment (Dathathri et al., 2019)).Here, x is a single sentence (or paragraph) of n words, factorized autoregressively into words x i and previous words x <i .Previous approaches to controlled text generation (Dathathri et al., 2019;Yang and Klein, 2021) factorize the right term of Equation 7 as follows: As in Equation 7, this factorization decomposes our sequentially controlled text generation model into an uncontrolled language model and a controlcode model.The key difference between Equation 8 and 2 is in the second term, i.e. how we choose to model the control codes (the difference in the first term is simply a rather trivial extension of a naive  language from a sentence-to-paragraph generation to a document-generation context).
We show a direct comparison of all of our generation approaches in Figure 5. Here, we show that Direct Probability Control has the best effect over Naive GPT-2 for class-accuracy and, surprisingly, perhaps, Grammar and Topicality as well.

F Ovid's Unicorn Is Not Structural
We annotate of the famous Ovid's Unicorn news article generated and presented by the original GPT-2 authors (Radford et al., 2019).
We analyse this article as we have analyzed our generation models Section 7. One of our annotators gave each sentence the Van Dijk discourse label that best fits (Van Dijk, 2013), and the other assessed whether it actually fit.This is not an applesto-apples comparison with the Label Acc.column in Table 2, because we are assessing the accuracy of the label that we chose after reading the text.
We next measured the likelihood that an article with the discourse structure of Ovid's Unicorn would exist naturally.We build a simple bigram model for tags, p(c t+1 |c), to calculate the total   probability of a tag sequence.We show in Figure 9, the typical transitions between discourse labels in the news discourse dataset.We fit our simple bigram model using label sequences in the training dataset, and calculate average log-likelihood of the tag sequence for each document in our test dataset.The median of across these is shown in Table 5.As can be seen, sequences in the test dataset are far more likely than the Ovid's unicorn article, which falls outside of the 95th percentile of the distribution of typical articles.Main Event : The major subject of the news report.It can be the most recent event that gave rise to the news report, or, in the case of an analytical news report, it can be a general phenomenon, a projected event, or a subject.
Consequence : An event or phenomenon that is caused by the main event or that directly succeeds the main event.Previous Event : A specific event that occurred shortly before the main event.It either directly caused the main event, or provides context and understanding for the main event.
Current Context : The general context or worldstate immediately preceding the main event, to help the readers better understand and contextualize the main event.Similar to Previous Event, but not necessarily tied to a specific event.
Historical Event : An event occurring more than 2 weeks prior to the main event.Might still impact or cause the main event, but is more distal.
Expectation : An analytical insight into future consequences or projections made by the journalist.
Evaluation : A summary, opinion or comment made by the journalist on any of the other discourse components.
Anecdotal Event : Sentences describing events that are anecdotal, such events may happen before or after main events.Anecdotal events are specific events with specific participants.They may be uncertain and can't be verified.A primary purpose of this discourse role is to provide more emotional resonance to the main event.
In Table 6 we attempt to provide more insight into different News Discourse elements by modeling using Logistic Regression.

G Annotation
We recruit two manual annotators, one with > 1 year and the other with > 4 years of journalism experience.Both annotators offered to perform these tasks voluntarily in exchange for acknowledgement.Figure 12: Visual of the annotation task interface that we asked our annotators to use.We presented annotators with class labels and asked them to simply determine Y/N whether the label was accurate.We also added a question to probe topicality.(Prompting Baseline is the method generating the text currently seen in the interface.) For their reference, we showed the annotators the label definitions (shown in Section F.1) and a decision-tree (shown in Figure 11).The decisiontree breaks down key components of discourse reasoning.
Additionally, we gave them training annotation questions for practice.For the training task, they were asked to view human-written sentences from 10 articles and go through the step-by-step question process based on the decision tree.These labels were checked with the gold labels from the training dataset, and they trained until they were answering questions with >80% accuracy.
The interface we used to collect annotations is shown in Figure 12.Annotators were blind to the method that generated the text but were shown the desired true labels and simply had agree Y/N if the label fit 29 29 An earlier interface that asked annotators to assign their For Grammar, we asked them to count the number of grammar mistakes per sentence (1: >6, 3:2-4, 5:0).For Logical Flow, we used a qualititative metric (1: "Poor", 3: "OK", 5: "Great").For Topicality, we also used a qualitative metric (1: "Not at all", 3: "OK", 5: "Great") own tags was too difficult.
(a) Structure of human-written articles.(b) Structure of naively generated GPT-2 articles (c) Structure of sequentially controlled GPT-2 articles.

Figure 2 :
Figure 2: Discourse structure (Van Dijk, 2013) of articles generated according to different processes.The likelihood of a tag in the kth fraction of a news article is shown.Machine-generated structure is labeled by humans.

Figure 5 :
Figure 5: Comparison of different structural control methods across different pipelines and hyperparameters.

Figure 6 :
Figure 6: The effect of editing, across different pipelines and hyperparameters.

Figure 8 :
Figure 8: The effect of Discount Factor b, across different pipelines and hyperparameters.
A selection of sentences and the edit operations performed on them.The editor focuses on (a) temporal relations, (b) conditional statements (c) explicit discourse markers (e.g."expect") and correct grammar.et al. (2019) and Yang and Klein (2021) observed an improvement over CTRL (Keskar et al., 2019).

Figure 10 :
Figure 10: Different generation methods, across different pipelines and hyperparameters.

F. 1
Van Discourse-based Schema Introduced  in Choubey et al. (2020)    The schema used for News Discourse, introduced by (Choubey et al., 2020), was based off the schema introduced by Van Dijk (2013).As such, the classification guidelines were:

Figure 11 :
Figure 11: Tree shown to annotators for reference on manual annotation task.
-Probability Control (DPC): We calculate p(x i |x <i , X <s ) to identify the 200 most likely x i under the naive language model, |x i,j | 200 j=0 .Then Discriminator performance on test data.F1 scores for p(c j |X <k , x <i , c <j ) predictions.Sentence index k and word index i are fixed: we show error for using the current sentence to predict all past, current and future labels.

Table 1 :
Belarus food retailer lines up landmark overseas IPO Main Event SILVER SPRING, Colo.(AP) -A company in Finland is poised to list its first international grocery chain in the United States after a yearlong search, raising concerns about potential corporate conflicts of interest and legal issues around using foreign intellectual property.<|endoftext|>Previous Event The world's largest retailer of meat and fish products hopes to become one day the leader in convenience groceries.<|endoftext|>Expectation It says it could become a leader in fresh fruits and vegetables, as well.<|endoftext|>Evaluation About 300 Finnish companies have applied to list on the New York Stock Exchange under a new plan to bypass the U. S. government and create "licenses" that allow them in many other countries, according to a regulatory filing.<|endoftext|>Evaluation The European Union prohibits using foreign intellectual property for purposes outside the country, and Finland's National Stock Exchange does not require approval before selling a company'lls share in its new company.<|endoftext|>Evaluation "What's at stake here is transparency, a sense of fairness to all the stakeholders.<|endoftext|>Evaluation I don't think it's right for companies to have intellectual property rights," says Michael Vakilainen, an independent analyst who has tracked the company since 2008.<|endoftext|>Expectation He says there are potential conflicts of interest, because one partner is the government.<|endoftext|>Expectation "What if you're a government contractor?"<|endoftext|>Sample document generated.Generation Method = Direct Prob.Control.Structure = Past Aware.Edited = False.
. We design the following prompt structure to simulate baseline, past-aware and fullsequence control variants.
The deal is significant for Wind Energy, which has operations mostly in New York.Current Context 8 billion shares sold in all of 2015.8 billion shares were traded in all of China.Expectation

Table 5 :
Log-Likelihood of Tag-Sequence, according to simple bi-gram model p(c t+1 |c t ), trained by counting tag sequences in the training dataset.5th/50th/95th percentiles shown for test set.

Table 6 :
Top predictive words for each discourse type (top positive β coefficients for a Logistic Regression trained to predict y = news discourse tag per sentence using and X = a bag of words representation of each sentence).